## Dataset :
The data set contains question and it's corresponding answers.

## Step 1: Import libraries and files

* __Pandas__ : To get dataset in the form of dataframe
* __nltk__ : It is a NLP libraries which contains packages to make machines understand human language and reply to it with an appropriate response.

In [1]:
import pandas as pd
import numpy as np
import pickle
import operator
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split as tts
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder 
from sklearn.metrics.pairwise import cosine_similarity
import random
import nltk
from nltk.stem.lancaster import LancasterStemmer

In [2]:
# Create objects
stemmer = LancasterStemmer()
le = LabelEncoder()

## Step 2: Import Data

In [3]:
FAQs = pd.read_csv('faqs/GSTFAQs.csv').dropna()
greet = pd.read_csv('faqs/Greetings.csv').dropna()

data = pd.concat([FAQs, greet], ignore_index=True)
data.head()

Unnamed: 0,Question,Answer,Class
0,Does aggregate turnover include value of inwar...,Section 2(6) of CGST Act. Aggregate turnover d...,general
1,What if the dealer migrated with wrong PAN as ...,registration would be required as partnership ...,general
2,A taxable person’s business is in many states....,is liable to register if the aggregate turnove...,general
3,Can we use provisional GSTIN or do we get new ...,GSTIN (PID) should be converted into final GST...,general
4,Whether trader of country liquor is required t...,the person is involved in 100% supply of goods...,general


## Step 3: Data Cleaning 

* First we'll tokenzie each word from the dataset.
* After we tokenize, we will start cleaning up the tokens by Lemmatizing. Lemmatizing is the process of converting a word into its root form. 
* For example, words, like run, ran and running all convey the same meaning and hence don’t need to be  considered as different words, lemmatizing will reduce all the words to run.

In [4]:
def cleanup( sentence):
    word_tok = nltk.word_tokenize(sentence)
    stemmed_words = [stemmer.stem(w) for w in word_tok]
    return ' '.join(stemmed_words)

 Pass each question to the cleaning funtion defined above

In [5]:
questions_cleaned = []
questions = data['Question'].values
for question in questions:
    questions_cleaned.append(cleanup(question))

## Step 4 : Data preprocessing
* Apply vecorization on the cleaned questions
* Here we have used tfid vectorizer
* It’ll see the unique words in the complete para or content given to it and then does one hot encoding accordingly. Also it removes the stopwords and stores the important words which might be used less but gives us more better features. And stores the frequency of the words.

In [6]:
# Vectorization for training
def vectorize(clean_questions):
    vectorizer = TfidfVectorizer(min_df=1, stop_words='english')  
    vectorizer.fit(clean_questions)
    transformed_X_csr = vectorizer.transform(clean_questions)
    transformed_X = transformed_X_csr.A # csr_matrix to numpy matrix  
    return transformed_X, vectorizer

# Vectorization for input query
def query(clean_usr_msg, vectorizer):
    t_usr_array= None
    try:
        t_usr = vectorizer.transform([clean_usr_msg])
        t_usr_array = t_usr.toarray()
    except Exception as e:
        print(e)
        return "Could not follow your question [" + usr + "], Try again"

    return t_usr_array

## Step 5: Split data into train and test datasets

Here we are spliting the data so that train dataset contains 75% of the data and test dataset contains 25% of the total data.

In [7]:
X, vectorizer = vectorize(questions_cleaned)

y = data['Class'].values.tolist()
y = le.fit_transform(y)

# Split the dataset into train and test dataset
X_train, X_test, y_train, y_test = tts(X, y, test_size=.25, random_state=42)
print(X_train.shape)


(265, 943)


## Step 6: Train the model

Here we've used SVC algorithm to train our model with linear kernel and fit the data

In [8]:
model = SVC(kernel='linear')
model.fit(X_train, y_train)
print("SVC:", model.score(X_test, y_test))       

SVC: 0.651685393258427


## Step 7: Calculate the accuracy of our model

Accuracy is nothing but to calculate out of total predictions how many prediction ww made are right.

Accuracy = ( right predictions / total predictions made) * 100


In [9]:
count = 0

for value in range(1, len(X_test)):
    t_usr_array = X_test[value-1 : value]
    prediction = model.predict(t_usr_array)[0]
    if prediction == y_test[value-1]:
        count += 1
        
accuracy = (count / len(y_test)) * 100
print('Accuracy = ',accuracy, "%")

Accuracy =  64.04494382022472 %


## Step 8: Save model to the pickle file

In [10]:
with open('model.pkl','wb') as f:
    pickle.dump(model,f)

## Step 9: Test the model

Enter your query in the and check the output.

In [11]:
usr = 'Hello'
cleaned_usr = cleanup(usr)
t_usr_array = query(cleaned_usr, vectorizer)
prediction = model.predict(t_usr_array)[0]
class_ = le.inverse_transform([prediction])[0]
questionset = data[data['Class']==class_]

cos_sims = []
for question in questionset['Question']:
    cleaned_question = cleanup(question)
    question_arr = query(cleaned_question, vectorizer)
    sims = cosine_similarity(question_arr, t_usr_array)
    cos_sims.append(sims)

if len(cos_sims) > 0:
    ind = cos_sims.index(max(cos_sims)) 
    print(data['Answer'][questionset.index[ind]])


Hi
