# 1 - Introduction

**Made by  :** 
  - Enggar Kristian 
  - Wahyudi

**Batch : FTDS - 012** 

**Objective : Final project, model Machine Learning for chatbot**

# 2 - Import Libraries

In [3]:
# Import library that will be used in this project

# Library for Visualization
import matplotlib.pyplot as plt

# Library for preprocessing
import string
import re
import numpy as np
import pandas as pd
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# Library for modelling
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pickle
from sklearn.pipeline import make_pipeline
from util import JSONParser
from sklearn.model_selection import train_test_split

# Library for model evaluation
from sklearn import metrics

# 3 - Data Loading

In [20]:
# Load data from folders
path = "D:\Bootcamp Data Science Batch 011 - Hacktiv8\Final Project\p2---final-project-ftds-012-group-003\data\intents.json"

# Defining JSONParser
jp = JSONParser()

# Parsing data intents
jp.parse(path)

# Building dataframe and save it to variable df
df = jp.get_dataframe()

Loading data intents.json made by us for conversation between costumer and chatbot

# 4 - Preprocessing

In [21]:
# call dataset
df

Unnamed: 0,chat_input,intents
0,Hai,salam
1,Halo,salam
2,Hello,salam
3,Hei,salam
4,Hy,salam
...,...,...
486,bisa akses dimana,website
487,info lengkap perusahaan dimana min?,website
488,apakah ada webnya,website
489,profile bank nya dimana ya?,website


## Cleaning Data Text

In [22]:
# Look into sample chat
sample_chat = df.chat_input[430]
sample_chat



'boleh ketemu langsung?'

In [23]:
# Transforming the chat into lowercase
chat_lower = sample_chat.lower()
chat_lower

'boleh ketemu langsung?'

In [24]:

# Remove punctuations from the chat
chat_punct = chat_lower.translate(str.maketrans("","",string.punctuation))
chat_punct



'boleh ketemu langsung'

In [25]:
# We need to change non-alphabetical characters with spaces too to make the data cleaner.
chat_punct = re.sub("[^A-Za-z\s']"," ", chat_punct)
chat_punct

'boleh ketemu langsung'

In [26]:
# Remove tab, in case the customer is mistype inputting tab on their chat
chat_punct = chat_punct.strip()
chat_punct

'boleh ketemu langsung'

In [27]:
# Defining Stemmer
stemmer = StemmerFactory().create_stemmer()

# Applying stemmer to the chat
output   = stemmer.stem(chat_punct)
output

'boleh ketemu langsung'

In [28]:
# Create function to case folding corpus in the dataframe
def document_processing(document):
    # Transform Document Into Lowercase
    document = document.lower()

    # Remove Punctuation From Document
    document = document.translate(str.maketrans("","",string.punctuation))

    # Remove Digit From Document
    document = re.sub("[^A-Za-z\s']"," ", document)

    # Remove links
    document = re.sub(r'http\S+', '', document) # remove links
    document = re.sub(r"www.\S+", " ", document) # remove link
    
    # Remove Tab From Document
    document = document.strip()

    #Stemmer
    stemmer = StemmerFactory().create_stemmer()

    # stemming process
    document = stemmer.stem(document)

    return document

Make function for subtitute all equation for cleaning data text, and stemming. 

## Data Splitting

In [29]:
# Split between data (X) and target (y)
X = df.chat_input
y = df.intents

In [34]:
# Do Preprocessing text with Case Folding Function
X_proc = X.apply(document_processing)


Apply function cleaning text to the data set, and apply to all string. 

# Word Vectorization


In [35]:
# Define text Vectorizer
vect = CountVectorizer()
vect.fit(X_proc)

CountVectorizer()

Define Vectorization from tensorflow use to convert string to integer.

In [36]:
# Split Data for Train-Set and Test-Set
X_train, X_test,  y_train, y_test = train_test_split(X_proc, y,test_size=0.2, random_state=5)
print(f"Train-Set shape      : {len(X_train),len(y_train)}")
print(f"Test-Set shape       : {len(X_test),len(y_test)}")

Train-Set shape      : (392, 392)
Test-Set shape       : (99, 99)


# 4 - Model Training

In [37]:
# MultinomialNB Pipeline
nb = make_pipeline(CountVectorizer(),
                   MultinomialNB())

# Training
nb.fit(X_train, y_train)

Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

Define with model machine learning Multinomial Naive Bayes with pipeline.

In [38]:
# Random Forest Pipeline
xgb = make_pipeline(CountVectorizer(),
                   XGBClassifier(random_state=5, learning_rate=0.1, max_depth=2, n_estimators=60))

# Training
xgb.fit(X_train, y_train)





Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('xgbclassifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               gamma=0, gpu_id=-1, importance_type=None,
                               interaction_constraints='', learning_rate=0.1,
                               max_delta_step=0, max_depth=2,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=60,
                               n_jobs=4, num_parallel_tree=1,
                               objective='multi:softprob', predictor='auto',
                               random_state=5, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=None, subsample=1,
                               tree_method='exact', validate_par

One of our favourite model Machine Learning that is Extreme Gradient Boosting. Uprgade from ensemble learning and latest model.

In [39]:
# Decision Tree Pipeline
dt = make_pipeline(CountVectorizer(),
                   DecisionTreeClassifier())

# Training
dt.fit(X_train, y_train)

Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('decisiontreeclassifier', DecisionTreeClassifier())])

Make pipeline model Decision Tree for eficiency running time.

# 5 - Model Evaluation

## Model Evaluation for MultinomialNB

In [40]:
y_train_pred_nb = nb.predict(X_train)
y_test_pred_nb = nb.predict(X_test)
print("Train-Set Multinomial Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_train_pred_nb, y_train)*100)
print("Test-Set Multinomial Naive Bayes model accuracy(in %) :", metrics.accuracy_score(y_test_pred_nb, y_test)*100)

Train-Set Multinomial Naive Bayes model accuracy(in %): 95.91836734693877
Test-Set Multinomial Naive Bayes model accuracy(in %) : 83.83838383838383


Such a great result, from a dataset that is not much but the MultinomialNB model can read it well, so that it gets optimal results for the test data.

## Model Evaluation for Decision Tree

In [41]:
y_train_pred_dt = dt.predict(X_train)
y_test_pred_dt = dt.predict(X_test)
print("Train-Set Decision Tree model accuracy(in %):", metrics.accuracy_score(y_train_pred_dt, y_train)*100)
print("Test-Set Decision Tree model accuracy(in %) :", metrics.accuracy_score(y_test_pred_dt, y_test)*100)

Train-Set Decision Tree model accuracy(in %): 98.9795918367347
Test-Set Decision Tree model accuracy(in %) : 75.75757575757575


Decision Tree is not better than the MultinomialNB model, we can see that the model brings overfit results and is far below MultinomialNB for the test data results.

# Model Evaluation for XGBoost

In [42]:
y_train_pred_xgb = xgb.predict(X_train)
y_test_pred_xgb = xgb.predict(X_test)
print("Train-Set Random Forest model accuracy(in %):", metrics.accuracy_score(y_train_pred_xgb, y_train)*100)
print("Test-Set Random Forest model accuracy(in %) :", metrics.accuracy_score(y_test_pred_xgb, y_test)*100)

Train-Set Random Forest model accuracy(in %): 76.53061224489795
Test-Set Random Forest model accuracy(in %) : 64.64646464646465


oops, the favorite model actually doesn't study the text data model well, the results are the worst among the other 2 models, but goodfit, the data inference will not use the XGBoost model later.

# 6 - Model Saving

Well, in the end, based on the results, we chose the MultinomialNB model to be a machine learning model that will be applied to chatbots in the banking industry, to add and satisfy service to customers who have complaints on Twitter later, then this chatbot will be active if customers chat via Direct Message on our official account.

In [24]:
# Model Saving Random Forest
with open('model_2.pkl', 'wb') as file:
  pickle.dump(nb, file)

# 7 - Building Chatbot

In [43]:
print("Hai kakak, saat ini kamu tehubung dengan akun bisnis resmi Bank T0yip. Pesan ini akan dibalas oleh Costumer Handler Assitant atau kamu bisa panggil Chass :)")
while True:
    # input user
    chat = input("Saya : ")
    # Preprocessing
    chat_processed = document_processing(chat)
    # Intent prediction (tag)
    res = nb.predict_proba([chat_processed])
    # get the probability value and its location
    max_prob = max(res[0])
    max_idx = np.argmax(res[0])
    
    # Define condition for unknown input
    print("Saya :",chat)
    if max_prob < 0.20:
        print("Bot  : Maaf Kak, aku masih gak ngerti maksud kakak ):")
    # Define condition to give response towards specific tag
    else:
        print(f"Bot  : {jp.get_response(nb.classes_[max_idx])}\n")
    # Define response to end the chat for tag "menutup"
    if nb.classes_[max_idx] == 'menutup':
        break

Anda Terhubung dengan chatbot Kami, panggil aja CHASS
Saya : hai
Bot  : Hai kakak!

Saya : atm
Bot  : Oke kak jangan khawatir, untuk ATM rusak, ATM tertelan, dan ATM hilang akan kami ganti dengan kartu yang baru serta Gratis! Oiya jangan lupa isi data diri dulu ya di link ini https:/www.banktoyib.com/data_diri

Saya : cara buat akun
Bot  : Baik kak, pendaftaran akun rekening cukup mudah hanya masukkan email yang aktif, lalu buat password, jangan lupa siapkan dokumen diri seperti KTP dan NPWP, setelah itu pengenalan wajah, isi data diri, Selamat! Akun rekening kamu telah jadi.

Saya : 
Bot  : Maaf Kak, aku masih gak ngerti maksud kakak ):
Saya : 
Bot  : Maaf Kak, aku masih gak ngerti maksud kakak ):
Saya : 
Bot  : Maaf Kak, aku masih gak ngerti maksud kakak ):
Saya : 
Bot  : Maaf Kak, aku masih gak ngerti maksud kakak ):
Saya : 
Bot  : Maaf Kak, aku masih gak ngerti maksud kakak ):
Saya : 
Bot  : Maaf Kak, aku masih gak ngerti maksud kakak ):
Saya : kamu siapa?
Bot  : Tugas Chass untuk 

The model has successfully made, and can apply to API Twitter Direct Message!