## <b>Chatbot</b>

## Problem Statement:
An AI based chatbot where we have to predict the answer of the user's asked question.
## Dataset:
Here we are using Bridgelabz fellowship program's frequently asked questions as dataset for tr

## Solution:
Here, we will use Natural Language Processing (NLP) for text processing and for prediction Machine learning Support Vector Machine (SVM) algorithm.

### Immersive Experience to be gained ?
This notebook is a detailed investigation on AI based chatbot, how it will work, how we will pre-process the chatbot dataset and how we predict the answer from user's asked question.


## Step 1: Import all the required libraries 

* __NLTK__ : The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language
* __Pandas__ : In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis and storing in a proper way. In particular, it offers data structures and operations for manipulating numerical tables and time series
* __Numpy__ : NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays
* __Sklearn__ : Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn.
* __Pickle__ : Python pickle module is used for serializing and de-serializing a Python object structure. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. The idea is that this character stream contains all the information necessary to reconstruct the object in another python script.

In [1]:
#Loading libraries 
import pandas as pd
import numpy as np
import pickle
import nltk
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
import sklearn
import re

## Step 2 : Read all the required data and combine it

In [2]:
#loading data
try:
    faq = pd.read_csv('Data/chatbot_faq.csv')
    greeting = pd.read_csv('Data/Greetings.csv')
except (FileNotFoundError, IOError):
    print("Wrong file or file path")   

Now we will concat the faq and greeting in one dataset

In [3]:
print(faq.shape)
print(faq.head())

(30, 5)
                                       Question        DomainIntent  \
0           How long is the fellowship program?  fellowship program   
1        How much does fellowship program cost?  fellowship program   
2               What is the fellowship program?  fellowship program   
3  Can the fellowship program be done remotely?  fellowship program   
4                              How do I get in?  fellowship program   

               Intent                                             Answer  \
0            duration      The program is 4 months on a full-time basis.   
1      admission_fees   The program is free to the fellows. You do no...   
2  fellowship program  Coding jobs with emerging tech product compani...   
3         remote work   No! We believe that interaction with the ment...   
4               enter   You will require to register for one of our r...   

     Class  
0  general  
1  general  
2  general  
3  general  
4  general  


In [4]:
print(greeting.shape)
print(greeting.head())

(21, 5)
    Question        DomainIntent     Intent  Answer      Class
0      Hello  fellowship program  greetings      Hi  greetings
1         Hi  fellowship program  greetings   Hello  greetings
2        Hii  fellowship program  greetings   Hello  greetings
3        Hey  fellowship program  greetings      Hi  greetings
4  Hey There  fellowship program  greetings     Hey  greetings


In [5]:
data = pd.concat([faq, greeting], ignore_index=True)

After we read the data, we can look at the data using:


In [6]:
print ('The dataset has {0} rows and {1} columns'.format(data.shape[0],data.shape[1]))
print(data.iloc[29:32,:-1].values)

The dataset has 51 rows and 5 columns
[['What is the working hours of the training program?' ' fellowship'
  ' time'
  ' From morning 8.30 AM to 7.30 PM the fellowship engineers are expected to code. In the beginning it is data structures later it is live sample app and lastly it is to develop App solving the real-world problem statement.']
 ['Hello' 'fellowship program' 'greetings' 'Hi']
 ['Hi' 'fellowship program' 'greetings' 'Hello']]


In [7]:
#shuffle the data
data = data.sample(frac=1)
print(data.sample(frac=1))

                                             Question        DomainIntent  \
21      Do I have weekends off in fellowship program?  fellowship program   
36                              Hey, How is it going?  fellowship program   
1              How much does fellowship program cost?  fellowship program   
46                                       Good Morning  fellowship program   
30                                              Hello  fellowship program   
49                                           Good Day  fellowship program   
17  What percentage of the fellowship is developin...  fellowship program   
42                              Hi, nice to meet you.  fellowship program   
40                                  Nice to meet you.  fellowship program   
12       What happens to fellows after they graduate?  fellowship program   
19  What tools will fellowship engineer get a chan...  fellowship program   
20  how much time will take to complete fellowship...  fellowship program   

In [8]:
data.to_csv('01_shuffle_data.csv')

## Step 3: Data pre-processing


In this stage, we'll deal with outlier values, encode variables and take every possible initiative which can remove inconsistencies from the data set. Let's remove that:

* First we'll tokenzie each word from the dataset.
* After we tokenize, we will start cleaning up the tokens by Lemmatizing. Lemmatizing is the process of converting a word into its root form. 
- __Tokenizing__ : This breaks up the strings into a list of words or pieces based on a specified pattern using Regular Expressions aka RegEx. 
- eg : white brown fox = ‘white’, ‘brown’,’fox’
- __Lemmatizing__ : Lemmatizing is the process of converting a word into its root form.
- e.g., "Playing", "Played" = "play".

In cleanup() function, we are first tokenizing the sentence (seperating each word in sentence) and then steeming (converting a word into its root form) and at the end combine all the words to form a sentence.

- After removing unwanted data let's do some steps to make our data understandable for our program. That's why we do preprocessing.
- Here we are dealing with text data, we can understant it but our machines can't. So we need to convert the data from text to numeric form.  
- Vectorization :The process of converting NLP text into numbers is called vectorization in ML.
- TF-IDF : TF-IDF stands for term frequency-inverse document frequency. It tell how important a word is in a sentence. The importance of a word depends on the number of times it occured in a sentence. To understand it, let's see each term:
- __Term Frequency(TF)__ : How frequently a word appears in a sentence. We can measure it by an equation, 

- TF = __(Total number of times the word "W" occured in the sentence) / (Total number of words in the sentence)__
- __Inverse Document Frequency (IDF)__ : How common is a word across all the sentences.
- IDF = __log( (Total number of sentences) / (Number of sentences with word "W" in it))__
* Apply vecorization on the cleaned questions
* Here we have used tfidf vectorizer
* It’ll see the unique words in the complete para or content given to it and then does one hot encoding accordingly. Also it removes the stopwords and stores the important words which might be used less but gives us more better features. And stores the frequency of the words

In [9]:
class Data_Cleanig:
    def data_cleanup(self, sentence):
        TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
        cleaned_text = re.sub(TEXT_CLEANING_RE, ' ', str(sentence).lower()).strip()
        word_tok = nltk.word_tokenize(cleaned_text)
        lemmatizer = nltk.stem.WordNetLemmatizer()
        lemmatized_words = [lemmatizer.lemmatize(w) for w in word_tok]
        return ' '.join(lemmatized_words)

 Pass each question to the cleaning funtion defined above

In [10]:
cleaning = Data_Cleanig()
questions_cleaned = []
questions = data['Question'].values
for question in questions:
    questions_cleaned.append(cleaning.data_cleanup(question))

In [11]:
data['Cleaned_questions'] = questions_cleaned

In [12]:
data.to_csv('02_cleaned_data.csv')

The sentence __"How long is the fellowship program?"__ converted to __"how long is the fellow program ?"__

In [13]:
class Preprocessing():       
    # Vectorization for training
    def vectorize(self, clean_questions):
        vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(min_df=1, stop_words='english')  
        vectorizer.fit(clean_questions)
        transformed_X_csr = vectorizer.transform(clean_questions)
        transformed_X = transformed_X_csr.A # csr_matrix to numpy matrix  
        return transformed_X, vectorizer

    # Vectorization for input query
    def query(self, clean_usr_msg, usr, vectorizer):
        t_usr_array= None
        try:
            t_usr = vectorizer.transform([clean_usr_msg])
            t_usr_array = t_usr.toarray()
        except Exception as e:
            print(e)
            return "Could not follow your question [" + usr + "], Try again"

        return t_usr_array

In [14]:
le = sklearn.preprocessing.LabelEncoder()
preprocessing = Preprocessing()
X, vectorizer = preprocessing.vectorize(questions_cleaned)

y = data['Class'].values.tolist()
y = le.fit_transform(y)

In [15]:
y = y.reshape(len(y),1)
print(X.shape, y.shape)


(51, 68) (51, 1)


In [16]:
after_vectorize = np.append(X, y, axis=1)
np.savetxt("03_after_vectorize.csv", after_vectorize, delimiter=",")

## Step 5: Split the data into train and test set
- Now our data is ready to feed to the program. But here we'll split the data into train and test dataset so that after training the model we can test the model on the test dataset and find out how accurate are its predictions.
- Here we are splitting the data so that the training dataset contains 80% of the data and the test dataset contains 20% of the total data.
- Here we are using the train_test_split method from the sklearn library. We'll train our model on x_train and y_train, test it on x_test and y_test.

- test_size: Here we specify the size we want for our test dataset.
- random_state: When we use a random number generator for number or sequence generation, we give a starting number (AKA seed). When we provide the same seed, every time it’ll generate the same sequence as the first one. That’s why to keep the same random values every time, we give seed as random_state in train_test_split().

In [17]:
#split the dataset into x and y
x_data_train, x_data_test, y_data_train, y_data_test = sklearn.model_selection.train_test_split(
        X, y, test_size=0.25,random_state=42)

We can't see the values of an entire matrix, but by looking at its shape, we decide if we going in the right direction or not. By using ".shape" we can see shape of a matrix and it will also helpful in debugging.

In [18]:
print(x_data_train.shape, x_data_test.shape, y_data_train.shape, y_data_test.shape)

(38, 68) (13, 68) (38, 1) (13, 1)


## Step 6: Train the Model using SVM

In [19]:
#Using sklearn linear regression model
model = sklearn.svm.SVC(kernel='linear')
model.fit(x_data_train, y_data_train)

  y = column_or_1d(y, warn=True)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [20]:
#using the test data to test the model
y_data_pred = model.predict(x_data_test)
y_data_pred = y_data_pred.reshape(len(y_data_pred),1)
type(y_data_pred), y_data_pred.shape

(numpy.ndarray, (13, 1))

In [21]:
#saving the y_pred_test_comparison in the csv file
y_pred_test_comparison = np.append(y_data_test, y_data_pred, axis=1)
np.savetxt("03_y_pred_test_comparison.csv", y_pred_test_comparison, delimiter=",")

In [22]:
diffs = y_data_test - y_data_pred
print(diffs)

[[ 0]
 [ 0]
 [ 0]
 [-1]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]
 [ 0]]


In [23]:
#SVM model score computed by the sklearn library
print("SVC:", model.score(x_data_test, y_data_test))

SVC: 0.9230769230769231


## Step 7: Save the model in a pickle file

As logistic_model_cv gave us the highest accuracy we'll go with it and save it to pickle file.
We save our model to pickle files so that when we want to perform predictions on unseen data, we don't have to train our model again. Any object in python can be pickled so that it can be saved on disk. What pickle does is that it “serializes” the object first before writing it to file. Pickling is a way to convert a python object (list, dict, etc.) into a character stream. 

In [25]:
with open('../Model Testing/ml-chatbot-backend/model.pkl','wb') as f:
        pickle.dump(cleaning, f)
        pickle.dump(preprocessing, f)
        pickle.dump(vectorizer,f)
        pickle.dump(model, f)


## Summary

The project has been created to help people understand the complete process of machine learning / data science modeling. These steps ensure that you won't miss out any information in the data set and would also help another person understand your work.