# 1 - Introduction

**Made by  :** 
  - Enggar Kristian 
  - Wahyudi

**Batch : FTDS - 012** 

**Objective : Final project, model Machine Learning for chatbot**

# 2 - Import Libraries

In [1]:
# Import library that will be used in this project

# Library for Visualization
import matplotlib.pyplot as plt

# Library for preprocessing
import string
import re
import numpy as np
import pandas as pd
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# Library for modelling
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pickle
from sklearn.pipeline import make_pipeline
from util import JSONParser
from sklearn.model_selection import train_test_split

# Library for model evaluation
from sklearn import metrics

# import for verification Name, email, and OTP
import os
import math
import random
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

# 3 - Data Loading

In [2]:
# Load data from folders
path = "D:\Bootcamp Data Science Batch 011 - Hacktiv8\Final Project\p2---final-project-ftds-012-group-003\data\intents.json"

# Defining JSONParser
jp = JSONParser()

# Parsing data intents
jp.parse(path)

# Building dataframe and save it to variable df
df = jp.get_dataframe()

Loading data intents.json made by us for conversation between costumer and chatbot

# 4 - Preprocessing

In [3]:
# call dataset
df

Unnamed: 0,chat_input,intents
0,Hai,salam
1,Halo,salam
2,Hello,salam
3,Hei,salam
4,Hy,salam
...,...,...
523,bisa akses dimana,website
524,info lengkap perusahaan dimana min?,website
525,apakah ada webnya,website
526,profile bank nya dimana ya?,website


## Cleaning Data Text

In [4]:
# Look into sample chat
sample_chat = df.chat_input[430]
sample_chat



'dimana terdekat'

In [5]:
# Transforming the chat into lowercase
chat_lower = sample_chat.lower()
chat_lower

'dimana terdekat'

In [6]:

# Remove punctuations from the chat
chat_punct = chat_lower.translate(str.maketrans("","",string.punctuation))
chat_punct



'dimana terdekat'

In [7]:
# We need to change non-alphabetical characters with spaces too to make the data cleaner.
chat_punct = re.sub("[^A-Za-z\s']"," ", chat_punct)
chat_punct

'dimana terdekat'

In [8]:
# Remove tab, in case the customer is mistype inputting tab on their chat
chat_punct = chat_punct.strip()
chat_punct

'dimana terdekat'

In [9]:
# Defining Stemmer
stemmer = StemmerFactory().create_stemmer()

# Applying stemmer to the chat
output   = stemmer.stem(chat_punct)
output

'mana dekat'

In [10]:
# Create function to case folding corpus in the dataframe
def document_processing(document):
    # Transform Document Into Lowercase
    document = document.lower()

    # Remove Punctuation From Document
    document = document.translate(str.maketrans("","",string.punctuation))

    # Remove Digit From Document
    document = re.sub("[^A-Za-z\s']"," ", document)

    # Remove links
    document = re.sub(r'http\S+', '', document) # remove links
    document = re.sub(r"www.\S+", " ", document) # remove link
    
    # Remove Tab From Document
    document = document.strip()

    #Stemmer
    stemmer = StemmerFactory().create_stemmer()

    # stemming process
    document = stemmer.stem(document)

    return document

Make function for subtitute all equation for cleaning data text, and stemming. 

## Data Splitting

In [11]:
# Split between data (X) and target (y)
X = df.chat_input
y = df.intents

In [12]:
# Do Preprocessing text with Case Folding Function
X_proc = X.apply(document_processing)


Apply function cleaning text to the data set, and apply to all string. 

# Word Vectorization


In [13]:
# Define text Vectorizer
vect = CountVectorizer()
vect.fit(X_proc)

CountVectorizer()

Define Vectorization from scikit learn use to convert string to integer.

In [14]:
# Split Data for Train-Set and Test-Set
X_train, X_test,  y_train, y_test = train_test_split(X_proc, y,test_size=0.2, random_state=5)
print(f"Train-Set shape      : {len(X_train),len(y_train)}")
print(f"Test-Set shape       : {len(X_test),len(y_test)}")

Train-Set shape      : (422, 422)
Test-Set shape       : (106, 106)


# 4 - Model Training

In [15]:
# MultinomialNB Pipeline
nb = make_pipeline(CountVectorizer(),
                   MultinomialNB())

# Training
nb.fit(X_train, y_train)

Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

Define with model machine learning Multinomial Naive Bayes with pipeline.

In [16]:
# Random Forest Pipeline
xgb = make_pipeline(CountVectorizer(),
                   XGBClassifier(random_state=5, learning_rate=0.1, max_depth=2, n_estimators=60))

# Training
xgb.fit(X_train, y_train)





Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('xgbclassifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, enable_categorical=False,
                               gamma=0, gpu_id=-1, importance_type=None,
                               interaction_constraints='', learning_rate=0.1,
                               max_delta_step=0, max_depth=2,
                               min_child_weight=1, missing=nan,
                               monotone_constraints='()', n_estimators=60,
                               n_jobs=4, num_parallel_tree=1,
                               objective='multi:softprob', predictor='auto',
                               random_state=5, reg_alpha=0, reg_lambda=1,
                               scale_pos_weight=None, subsample=1,
                               tree_method='exact', validate_par

One of our favourite model Machine Learning that is Extreme Gradient Boosting. Uprgade from ensemble learning and latest model.

In [17]:
# Decision Tree Pipeline
dt = make_pipeline(CountVectorizer(),
                   DecisionTreeClassifier())

# Training
dt.fit(X_train, y_train)

Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('decisiontreeclassifier', DecisionTreeClassifier())])

Make pipeline model Decision Tree for eficiency running time.

# 5 - Model Evaluation

## Model Evaluation for MultinomialNB

In [18]:
y_train_pred_nb = nb.predict(X_train)
y_test_pred_nb = nb.predict(X_test)
print("Train-Set Multinomial Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_train_pred_nb, y_train)*100)
print("Test-Set Multinomial Naive Bayes model accuracy(in %) :", metrics.accuracy_score(y_test_pred_nb, y_test)*100)

Train-Set Multinomial Naive Bayes model accuracy(in %): 95.73459715639811
Test-Set Multinomial Naive Bayes model accuracy(in %) : 81.13207547169812


Such a great result, from a dataset that is not much but the MultinomialNB model can read it well, so that it gets optimal results for the test data.

## Model Evaluation for Decision Tree

In [19]:
y_train_pred_dt = dt.predict(X_train)
y_test_pred_dt = dt.predict(X_test)
print("Train-Set Decision Tree model accuracy(in %):", metrics.accuracy_score(y_train_pred_dt, y_train)*100)
print("Test-Set Decision Tree model accuracy(in %) :", metrics.accuracy_score(y_test_pred_dt, y_test)*100)

Train-Set Decision Tree model accuracy(in %): 98.81516587677726
Test-Set Decision Tree model accuracy(in %) : 73.58490566037736


Decision Tree is not better than the MultinomialNB model, we can see that the model brings overfit results and is far below MultinomialNB for the test data results.

# Model Evaluation for XGBoost

In [20]:
y_train_pred_xgb = xgb.predict(X_train)
y_test_pred_xgb = xgb.predict(X_test)
print("Train-Set Random Forest model accuracy(in %):", metrics.accuracy_score(y_train_pred_xgb, y_train)*100)
print("Test-Set Random Forest model accuracy(in %) :", metrics.accuracy_score(y_test_pred_xgb, y_test)*100)

Train-Set Random Forest model accuracy(in %): 70.85308056872039
Test-Set Random Forest model accuracy(in %) : 67.9245283018868


oops, the favorite model actually doesn't study the text data model well, the results are the worst among the other 2 models, but goodfit, the data inference will not use the XGBoost model later.

# 6 - Model Saving

Well, in the end, based on the results, we chose the MultinomialNB model to be a machine learning model that will be applied to chatbots in the banking industry, to add and satisfy service to customers who have complaints on Twitter later, then this chatbot will be active if customers chat via Direct Message on our official account.

In [21]:
# Model Saving Random Forest
with open('model_2.pkl', 'wb') as file:
  pickle.dump(nb, file)

# 7 - Building Chatbot

In [22]:
# random number for OTP verification
def OTP(nomor_rekening):
    digits= nomor_rekening
    OTP=""
    for i in range(6):
        OTP+=digits[math.floor(random.random()*10)]
        otp = OTP + " is your OTP"
        msg= otp
    return msg

In [23]:
OTP("3275032705981005")

'552573 is your OTP'

In [36]:
print("Hai kakak, saat ini kamu tehubung dengan akun bisnis resmi Bank T0yip. Pesan ini akan dibalas oleh Costumer Handler Assitant atau kamu bisa panggil Chass :)")
while True:
    # input user
    chat = input("Saya : ")
    # Preprocessing
    chat_processed = document_processing(chat)
    # Intent prediction (tag)
    res = nb.predict_proba([chat_processed])
    # get the probability value and its location
    max_prob = max(res[0])
    max_idx = np.argmax(res[0])
    
    print("Saya :",chat)
    # Define condition for unknown input
    if max_prob < 0.20:
        print("Bot  : Maaf Kak, aku masih gak ngerti maksud kakak :(")
    # Define condition to give response towards specific tag
    else:
        print(f"Bot  : {jp.get_response(nb.classes_[max_idx])}\n")

    # Define solving #gantiATm
    if chat == '#gantiATM':
        print("Baik kak aku bantu proses pergantian ATM ya, Sebelumnya ada beberapa syarat yang harus dilengkapi yaitu dengan isi data diri berikut:")
        print("Masukkan nama lengkap Anda: ")
        while True:
            nama = input("Nama:")
            if len(nama) <= 1:
                print("Kolom nama tidak boleh kosong.")
            else:
                print("\nMasukkan email Anda:")
                break
        while True:
            Email = input("Email: ")
            if "@" not in Email:
                print("Email anda salah tidak ada '@' di dalamnya\n")
            elif "." not in Email:
                print("Email anda salah tidak ada '.' di dalamnya\n")
            else:    
                print("\nMasukkan nomor handphone Anda:")
                break    
        no_hp = int(input("Nomor handphone: "))
        print("\nMasukkan nomor rekening Anda: ")
        while True:
            no_rek = input("Nomor rekekning minimal 10 digit: ")
            if len(no_rek) <= 8:
                print("Nomor rekening Anda salah!")
            else:
                print("\n")
                break
        no_otp = no_rek
        while True:
            # Define random OTP code
            OTP=""
            for i in range(4):
                OTP+=no_otp[math.floor(random.random()*10)]
            else:
                print(f"HATI-HATI PENIPUAN! JANGAN BERIKAN OTP ke siapapun \nKode OTP Anda: {OTP}. Silakan gunakan untuk melanjutkan proses transaksi.\n") 
                print("Masukkan 4 digit kode OTP Anda: ")
            while True:    
                input_otp = input("Kode OTP 4 digit: ")
                if input_otp != OTP:
                    print(f"Kode OTP yang Anda masukkan salah!")
                else:
                    print("Selamat kakak akan menerima kartu ATM baru! Tunggu kami ya kak maksimal 3x24 jam kartu ATM baru kakak sampai rumah.\n")
                    break
            print("\nAda lagi yang bisa Chess bantu kak?")
            break
        
    # Define response to end the chat for tag "penutup"
    if nb.classes_[max_idx] == 'penutup':
        break

    

    

Hai kakak, saat ini kamu tehubung dengan akun bisnis resmi Bank T0yip. Pesan ini akan dibalas oleh Costumer Handler Assitant atau kamu bisa panggil Chass :)
Saya : ganti
Bot  : Oke kak jangan khawatir, untuk penggantian kartu ATM rusak, kartu ATM tertelan, dan kartu ATM hilang kakak bisa menggantinya di kantor cabang terdekat atau bisa Chass bantu kakak, namun akan dikenakan biaya tambahan untuk materai 10000 sebesar Rp10.000 pembayarannya akan otomatis dipotong melalui kartu bank untuk penggantian kartu lama dengan kartu yang baru, dengan cara ketik #gantiATM untuk melanjutkan proses penggantiannya.

Saya : #gantiATM
Bot  : Maaf Kak, aku masih gak ngerti maksud kakak :(
Baik kak aku bantu proses pergantian ATM ya, Sebelumnya ada beberapa syarat yang harus dilengkapi yaitu dengan isi data diri berikut:
Masukkan nama lengkap Anda: 

Masukkan email Anda:

Masukkan nomor handphone Anda:

Masukkan nomor rekening Anda: 


HATI-HATI PENIPUAN! JANGAN BERIKAN OTP ke siapapun 
Kode OTP Anda: 49

The model has successfully made, and can apply to API Twitter Direct Message!