#### Purpose of this Notebook
* Typical chatbot needs to understand the intent (in layman terms needs to understand the question or what has been asked in the interaction in order to answer the question). Suppose if you are building a FAQ chatbot and you have 1K question and answers, we need to almost build 1K intents so that bot can answer the each question by understanding the intent. Suppose for a customer, if the FAQ size increases to 10K, building 10K intents is viable and bot able to scale to the incremental solution?
* How can we solve this problem without writing the intents and enable bot to scale?
* This solution can also be used for search indexing 
* This can be integrated to chatbot or it can also function as independent solution

#### Theoritical solution
* If we think broadly, all we need to do is to find the right question from the matrix of question and answers for which the question is asked, if we are able to do so then we can pull the answer from the index.
* We will use soft cosine similarity from gensim in this notebook or to say in overall algorithm.

#### Other similar solutions
* We can do or achieve the same using other gensim similarity matching algorithms, what to use is based on the data and problem statement. Some similarities work better on word counts such as count vectorizer, tf-idf (term frequency inverse document frequency), etc. 
    * We have sample solution built based on tf-idf for the same dataset
    * Solution built based on document to vector (doc2vec) for the same dataset
    * Solution is available using word mover distance (wmd) for the same dataset

#### References
* https://www.machinelearningplus.com/nlp/cosine-similarity/
* https://www.machinelearningplus.com/nlp/gensim-tutorial/#18howtocomputesimilaritymetricslikecosinesimilarityandsoftcosinesimilarity

In [1]:
# Do all package imports 
import os

import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import Word2Vec, WordEmbeddingSimilarityIndex
from gensim.similarities import SoftCosineSimilarity, SparseTermSimilarityMatrix

# Below import and download is not required to load each and every time, do as required
#import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
import re

import pandas as pd

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
# Frame the data path for the data load (in this case it is csv file)
root_dir = "."
data_dir = "data"
data_file = "co_browsing_faq.csv"
data_path = os.path.join(root_dir, data_dir, data_file)

# Just cross for the path framed
#data_path

# Import Data
data = pd.read_csv(data_path, encoding="ISO-8859-1")

In [3]:
# Inspect some of the rows 
#data.head()
#data.tail()
list(data['question'][10:15])

['Cost of Living for Your Location, considered for income after retirement',
 'Accounting for Taxes on Withdrawals',
 'What if I want to plan to leave investments to my heirs?',
 "How can I plan for retirement when I don't know anything about retirement",
 'What about If I retire early?']

In [4]:
# Get stop words from nltk
stopWords = stopwords.words('english')
# Inspect some stop words
stopWords[10:15]

["you've", "you'll", "you'd", 'your', 'yours']

In [5]:
# For pre-processing data
def clean_data(sentence):
    # convert to lowercase, ignore all special characters - 
    # keep only alpha-numericals and spaces (not removing full-stop here)
    sentence = re.sub(r'[^A-Za-z0-9\s.]', r'', str(sentence).lower())
    sentence = re.sub(r'\n', r' ', sentence)
    
    # remove stop words
    sentence = " ".join([word for word in sentence.split() if word not in stopWords])
    
    return sentence.split()

In [6]:
# Pre-process all the questions from the data frame
questions_list = data.question.map(lambda x: clean_data(x))
# Convert the questions to list again, the output of this will be list of lists (this format is required for the algorithm)
questions_list = questions_list.tolist()
# Inspect some of the questions after data clean up
questions_list[10:15]

[['cost', 'living', 'location', 'considered', 'income', 'retirement'],
 ['accounting', 'taxes', 'withdrawals'],
 ['want', 'plan', 'leave', 'investments', 'heirs'],
 ['plan', 'retirement', 'dont', 'know', 'anything', 'retirement'],
 ['retire', 'early']]

In [7]:
# Methods to train word-vectors
# Using below we can train/build our own word to vector embeddings 

#w2v_model = Word2Vec(questions_list, size=50, min_count=1, iter=50) 

In [8]:
%%time
"""
Using below, we are loading pre-trained word vectors from the large corpus, list of all pre-trained word vectors available in 
gensim, below link provides the details.
https://github.com/RaRe-Technologies/gensim-data
"""
#w2v_model = api.load("glove-wiki-gigaword-50")
# This takes 3-4 minutes to load, the projection weights for below pre-trained word2vec is around 1.7GB
w2v_model = api.load("word2vec-google-news-300")

2020-03-02 16:23:23,596 : INFO : loading projection weights from C:\Users\mommasani.srinivasul/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz
2020-03-02 16:27:15,467 : INFO : loaded (3000000, 300) matrix from C:\Users\mommasani.srinivasul/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz


Wall time: 4min 13s


In [15]:
%%time
print(w2v_model.similarity('retirement','planning'))
print(w2v_model.most_similar('retirement'))
#print(w2v_model.wv['save'])
#print(len(w2v_model.wv['save']))
#print(len(w2v_model.wv.vocab))

0.13682444
[('retirment', 0.7519336938858032), ('retire', 0.7195518016815186), ('Retirement', 0.6784062385559082), ('retiring', 0.6532286405563354), ('retirements', 0.6454570889472961), ('pension', 0.6300544142723083), ('retirees', 0.6023612022399902), ('pensions', 0.5700603723526001), ('retires', 0.562114417552948), ('retired', 0.5513665676116943)]
Wall time: 585 ms


In [16]:
# Construct term simiarity index the trained or pre-trained word 2 vectors
termsim_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
# Construct the dictionary from our documents
dictionary = Dictionary(questions_list)
# Construct document to bag of words using the dictionary built above
bow_corpus = [dictionary.doc2bow(question) for question in questions_list]
# Construct similarity matrix using term similarity index built from word2vec and from the document dictionary
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  
# Construct final document similarity index
scs_sim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)

  
2020-03-02 16:30:59,889 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-03-02 16:30:59,892 : INFO : built Dictionary(92 unique tokens: ['get', 'planning', 'retirement', 'started', 'plan']...) from 34 documents (total 161 corpus positions)
2020-03-02 16:30:59,897 : INFO : constructing a sparse term similarity matrix using <gensim.models.keyedvectors.WordEmbeddingSimilarityIndex object at 0x000001729825B348>
2020-03-02 16:30:59,898 : INFO : iterating over columns in dictionary order
2020-03-02 16:30:59,901 : INFO : PROGRESS: at 1.09% columns (1 / 92, 1.086957% density, 1.086957% projected density)
2020-03-02 16:31:35,711 : INFO : constructed a sparse term similarity matrix with 2.670132% density


In [11]:
# Clean data before inferencing
def clean_data_inference(sentence):
    # convert to lowercase, ignore all special characters - 
    # keep only alpha-numericals and spaces (not removing full-stop here)
    sentence = re.sub(r'[^A-Za-z0-9\s.]', r'', str(sentence).lower())
    sentence = re.sub(r'\n', r' ', sentence)
    
    return sentence.split()

In [18]:
# Make a query for inferencing
query = 'retire early'
query = clean_data(query)
#query = clean_data_inference(query)
#print(query)

# Calculate similarity of query to each doc from bow_corpus
scs_sims = scs_sim_index[dictionary.doc2bow(query)]

# Print the inferences or questions with their confidence scores 
if(len(scs_sims) > 0):
    for i in range(len(scs_sims)):
        print("confidence score --> {} for record --> {}" .format(scs_sims[i][1], data.question[scs_sims[i][0]]))
else:
    print("search doesnt return anything")

confidence score --> 0.9999999403953552 for record --> What about If I retire early?
confidence score --> 0.36610791087150574 for record --> I want to plan for my retirement
confidence score --> 0.36610791087150574 for record --> Cost of Living for Your Location, considered for income after retirement
confidence score --> 0.36610791087150574 for record --> Is it possible to hold more than one account under retirement
confidence score --> 0.36610791087150574 for record --> Can I hold non-dollar currencies in my retirement fund?
confidence score --> 0.36610791087150574 for record --> How often should I rebalance my retirement account?
confidence score --> 0.36610791087150574 for record --> How can I plan for retirement when I don't know anything about retirement
confidence score --> 0.36610791087150574 for record --> Would it be possible to transfer other retirement plans under Digital adviser
confidence score --> 0.36610791087150574 for record --> Time Horizon for Retirement planning
co

  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))
  Y = np.multiply(Y, 1 / np.sqrt(Y_norm))


In [19]:
# In case if you want to see or print the answers for above question(s)
if(len(scs_sims) > 0):
    for i in range(len(scs_sims)):
        print("answer --> {}" .format(data.answer[scs_sims[i][0]]))
else:
    print("search doesnt return anything")

answer --> If you plan on retiring early, you can enter the details in retirement planning simulator and assess the amount you would be left with by the time of retirement.
answer --> Please provide your income and age, to plan for retirement.
answer --> While it's common to retire where you lived during your working years, your retirement location may be different. By default, we assume you'll continue to live where you do now.
answer --> Yes. Each investment account considered as a separate goal
answer --> Yes. Our system is capable of handling non-dollar currencies as well in retirement fund.
answer --> The decision to rebalance your retirement account should be made based on your desired asset allocation. This depends on your age and your desired risk level. 
answer --> The answer depends on your retirement goals and when you plan to retire. The earlier you want to retire, the more you will need to save.
answer --> Right now it is not possible. This feature is in the roadmap for de