<a href="https://colab.research.google.com/github/Tanaya2012/QA-chatbot/blob/main/similarity_method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates the implementation of three basic techniques for a Question-Answering (QA) chatbot. The techniques utilized are Cosine Similarity search, and Word2Vec.

1. Cosine Similarity search: This technique involves using a vector-based representation of the documents and computing the cosine similarity between the query and each document. It enables the chatbot to search for the most similar document or context to the given question and retrieve relevant answers based on the similarity scores.

2. Word2Vec: Word2Vec is a widely used word embedding technique that represents words in a high-dimensional vector space. The notebook utilizes Word2Vec to capture semantic relationships between words and enhance the chatbot's understanding of the query and context. It allows the chatbot to find similar words or phrases and provide more accurate and contextually relevant answers.

The notebook provides detailed explanations, code examples, and step-by-step instructions for implementing these techniques. It showcases how to integrate BERT, Cosine Similarity search, and Word2Vec into a QA chatbot system, enabling accurate and meaningful interactions with users.

## Data Collection

In [1]:
from pathlib import Path
from gensim.parsing.preprocessing import remove_stopwords
text = Path('/content/drive/MyDrive/cleaned_sentences.txt').read_text()
cleaned_sentences_with_stopwords = text.split('\n')
pdf_text = text.replace('/n', ' ')

In [2]:
user_questions = ['When did the GARDASIL 9 recommendations change?',
'What were the past 3 recommendation changes for GARDASIL 9?',
'Is GARDASIL 9 recommended for Adults?',
'Does the ACIP recommend one dose GARDASIL 9?']

In [3]:
def clean_sentence(sentence, stopwords=False):
  sentence = sentence.lower().strip()
  sentence = re.sub(r'[^a-z0-9\s]', '', sentence)
  if stopwords:
    sentence = remove_stopwords(sentence)
  return sentence

# Cosine similarity

Cosine similarity search is a technique used in a QA chatbot to find the most relevant answers based on the similarity between a query and a set of documents or contexts. It operates by representing the text data and queries as high-dimensional vectors and then calculating the cosine similarity between them.

To compute cosine similarity, the angle between the query vector and each document vector is measured. A smaller angle indicates a higher similarity between the query and the document. Cosine similarity ranges from -1 to 1, with 1 representing identical vectors and -1 indicating completely opposite vectors.

The cosine similarity scores are used to rank the documents or contexts in descending order. The document or context with the highest similarity score is considered the most relevant and is returned as the answer by the QA chatbot.

In [4]:
import numpy
sentences = cleaned_sentences_with_stopwords
sentence_words = [[word for word in document.split()]
                  for document in sentences]

from gensim import corpora
dictionary = corpora.Dictionary(sentence_words)

In [5]:
import pprint
bow_corpus = [dictionary.doc2bow(text) for text in sentence_words]

The given code defines a function named `retrieveAndPrintFAQAnswer` that retrieves the most relevant answer for a given question by comparing the question's embedding with a set of sentence embeddings. It iterates through the sentence embeddings, calculates the cosine similarity between the question embedding and each sentence embedding, and keeps track of the maximum similarity score. The index of the sentence with the highest similarity score is returned as the retrieved answer.

## Final Answer

In [6]:
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
import re 

def retrieveAndPrintFAQAnswer(question_embedding, sentence_embeddings, sentences):
  max_sim = -1
  index_sim = -1
  for index, embedding in enumerate(sentence_embeddings[:-1]):
    sim = cosine_similarity(embedding, question_embedding)[0][0]
    # print(index, sim, sentences[index])
    if sim > max_sim:
      max_sim = sim
      index_sim = index
  
  return index_sim

for user_question in user_questions:
  question = clean_sentence(user_question, stopwords=False)
  question_embedding = dictionary.doc2bow(question.split())
  index = retrieveAndPrintFAQAnswer(question_embedding, bow_corpus, sentences)  
  print("Question: ", question)
  print("Answer: ", sentences[index])

Question:  when did the gardasil 9 recommendations change
Answer:  however specimen collection was delayed 2 days until trained ecdhd staff members visited the facility
Question:  what were the past 3 recommendation changes for gardasil 9
Answer:  however specimen collection was delayed 2 days until trained ecdhd staff members visited the facility
Question:  is gardasil 9 recommended for adults
Answer:  hpv vaccine is recommended for routine vaccination at age 11 or 12 years
Question:  does the acip recommend one dose gardasil 9
Answer:  however specimen collection was delayed 2 days until trained ecdhd staff members visited the facility


# Word2Vec


Word2Vec is a popular word embedding model that represents words as dense vectors in a high-dimensional space, capturing semantic relationships. It learns these vector representations by training a neural network on large text corpora, enabling efficient computation of word similarities and semantic analogies.

In [7]:
from gensim.models import Word2Vec
import gensim.downloader as api

v2w_model = None
try:
  v2w_model = gensim.models.Keyedvectors.load('./w2vecmodel.mod')
  print("w2v Model Successfully loaded")
except:
  v2w_model = api.load('word2vec-google-news-300')
  v2w_model.save("./w2vecmodel.mod")
  print("w2v Model Saved")

w2vec_embedding_size = len(v2w_model['pc'])  

w2v Model Saved


The given code snippet defines two functions: `getWordVec` and `getPhraseEmbedding`.

The `getWordVec` function takes a word and a word embedding model as inputs. It tries to retrieve the vector representation of the word from the embedding model. If the word is present in the model, the corresponding vector is returned. Otherwise, it returns a zero vector of the same length as the sample vector from the model.

The `getPhraseEmbedding` function takes a phrase and an embedding model as inputs. It initializes a sample vector by obtaining the vector representation of the word "computer" from the model. It then iterates through each word in the input phrase, retrieves its vector representation using the `getWordVec` function, and accumulates the vectors. The resulting vector is divided by the total number of words in the phrase to obtain the average embedding. The function returns the average embedding as a reshaped 1D vector.

These functions enable obtaining word vectors and computing the average embedding for a given phrase using the specified word embedding model. These embeddings can be utilized for various natural language processing tasks such as similarity comparison, clustering, or input representation in machine learning models.

In [8]:
def getWordVec(word, model):
  samp = model['pc']
  vec = [0]*len(samp)
  try:
    vec = model[word]
  except:
    vec = [0]*len(samp)
  return (vec)


def getPhraseEmbedding(phrase, embeddingmodel):
  samp = getWordVec('computer', embeddingmodel)
  vec = numpy.array([0]*len(samp))
  den = 0;
  for word in phrase.split():
    den = den+1
    vec = vec+numpy.array(getWordVec(word, embeddingmodel))
  return vec.reshape(1, -1)

## Final Answer

In [9]:
sent_embeddings = []
for sent in sentences:
  sent_embeddings.append(getPhraseEmbedding(sent, v2w_model))

for user_question in user_questions:
  question = clean_sentence(user_question, stopwords=False)
  question_embedding = getPhraseEmbedding(question, v2w_model)
  index = retrieveAndPrintFAQAnswer(question_embedding, sent_embeddings, cleaned_sentences_with_stopwords)
  print("Question: ", question)
  print("Answer: ", sentences[index])

Question:  when did the gardasil 9 recommendations change
Answer:  two evidence to recommendations documents were developed   and presented along with proposed recommendations after a public comment period acip members voted unanimously to harmonize catchup vaccination recommendations across genders for all persons through age 26 years
Question:  what were the past 3 recommendation changes for gardasil 9
Answer:  all three vaccines have been approved for administration in a 3dose series at intervals of 0 1 or 2 and 6 months
Question:  is gardasil 9 recommended for adults
Answer:  catchup hpv vaccination is not recommended for all adults aged 26 years
Question:  does the acip recommend one dose gardasil 9
Answer:  the number of recommended doses is based on age at administration of the first dose
