<center><font size="6" color="blue">Text based Question Answering</font><br>

## Objective:
To develop a question answering system for closed domain question answering to help provide direct answers from the context or match questions on the fly with FAQ dataset, if a similar question exists. 

==>We need to give a span of the text as the answer and not the entire paragraph

In [None]:
!pip install spacy && python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
from pathlib import Path
import os, re, io
import pandas as pd
import numpy as np
import requests

## stopwords
from gensim.parsing.preprocessing import remove_stopwords
## lemma functionality provide by NLTK
from nltk.stem import WordNetLemmatizer
#nltk.download('wordnet')
from nltk import word_tokenize
#nltk.download('punkt')
import spacy
nlp = spacy.load('en')

from gensim.models import TfidfModel
from gensim.corpora import Dictionary
## cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

## <font color="#007bff"><b>Data Loading</b></font><br><a id="2"></a>


Prepare the data in the desired format to solve the problem is the next task. As, we need to extract every question and respective answers from an unstructured document and store it structured file. The best and simple way you could extract information from a text file is by doing parsing. Parsing can help to retrieve specific information on the following assumptions.

#### Assumptions:

- It has to be stored in .xlsx file in first two columns as "questions" and "answers"


The data after looks like below,

In [None]:
path = Path("/content/")
in_path = str(path / "BITSAT-FAQ.csv")
QA_df = pd.read_csv(os.path.join(in_path),header=None)
QA_df.columns = ['questions','answers']
QA_df.head()

Unnamed: 0,questions,answers
0,I am unable to access the Online application. ...,The application cannot be sent by email/post. ...
1,What is the eligibility for BITSAT?,"Candidates can write BITSAT 2020 with Physics,..."
2,I have completed the online application but I ...,You can go to the applying online page again a...
3,How will I get back my extra amount if I have ...,If you have made multiple payments towards BIT...
4,How can I edit/correct data in my application ...,The link for editing the application form will...


In [None]:
QA_df = QA_df[:15]
QA_df

Unnamed: 0,questions,answers
0,I am unable to access the Online application. ...,The application cannot be sent by email/post. ...
1,What is the eligibility for BITSAT?,"Candidates can write BITSAT 2020 with Physics,..."
2,I have completed the online application but I ...,You can go to the applying online page again a...
3,How will I get back my extra amount if I have ...,If you have made multiple payments towards BIT...
4,How can I edit/correct data in my application ...,The link for editing the application form will...
5,My 12th exam results are not expected before 1...,For admissions to I semester 2020-21 starting ...
6,I passed 12th in 2019. I didn't get 75% aggreg...,"If you are repeating 12th exam, you should do ..."
7,What are the tuition fees and other expenses?,The fee details for 2020-21 are yet to be fina...
8,I passed 12th exam in 2019. Am I eligible to a...,"As advertised, you are eligible to Apply for B..."
9,I had appeared in BITSAT-2019 but my marks wer...,"Yes, as advertised. Subject to eligibility con..."


## <font color="#007bff"><b>Preprocessing Techniques</b></font><br><a id="3"></a>
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover" title="go to TOC">Go to TOC</a>

Next step, We will not use the data as it is. Preprocessing is another very important step to fine-tune the dataset.

1. Remove unwanted characters
2. Remove Question number
3. Remove stopwords
4. Lemmatization - to reduce inflection of words and minimize the word ambiguity.

Why I chosen lemmatization over stemming? Lemmatization is powerful operation as it takes into consideration of morphological analysis of the word. 

**Example:** bicycles or bicycles are converted to bicyles. But, stemming algorithm works by predefined rules to remove prefix or suffix of the word.


In [None]:
## Data Preprocessing
class TextPreprocessor():
    def __init__(self, data_df, column_name=None):
        self.data_df = data_df  
        if not column_name and type(colum_name) == str:
            raise Exception("column name is mandatory. Make sure type is string format")
        self.column = column_name
        self.convert_lowercase()    
        self.applied_stopword = False
        self.processed_column_name = f"processed_{self.column}"
        
    def convert_lowercase(self):
        ## fill empty values into empty
        self.data_df.fillna('',inplace=True)
        ## reduce all the columns to lowercase
        self.data_df = self.data_df.apply(lambda column: column.astype(str).str.lower(), axis=0)    

    def remove_question_no(self):
        ## remove question no        
        self.data_df[self.column] = self.data_df[self.column].apply(lambda row: re.sub(r'^\d+[.]',' ', row))    
        
    def remove_symbols(self):
        ## remove unwanted character          
        self.data_df[self.column] = self.data_df[self.column].apply(lambda row: re.sub(r'[^A-Za-z0-9\s]', ' ', row))    

    def remove_stopwords(self):
        ## remove stopwords and create a new column 
        for idx, question in enumerate(self.data_df[self.column]):      
            self.data_df.loc[idx, self.processed_column_name] = remove_stopwords(question)        

    def apply_lemmatization(self, perform_stopword):
        ## get the root words to reduce inflection of words 
        lemmatizer = WordNetLemmatizer()    
        ## get the column name to perform lemma operation whether stopwords removed text or not
        if perform_stopword:
            column_name = self.processed_column_name
        else:
            column_name = self.column
        ## iterate every question, perform tokenize and lemma
        for idx, question in enumerate(self.data_df[column_name]):

            lemmatized_sentence = []
            ## use spacy for lemmatization
            doc = nlp(question.strip())
            for word in doc:       
                lemmatized_sentence.append(word.lemma_)      
                ## update to the same column
                self.data_df.loc[idx, self.processed_column_name] = " ".join(lemmatized_sentence)

    def process(self, perform_stopword = True):
        self.remove_question_no()
        self.remove_symbols()
        if perform_stopword:
            self.remove_stopwords()
        self.apply_lemmatization(perform_stopword)    
        return self.data_df

In [None]:
## pre-process training question data
text_preprocessor = TextPreprocessor(QA_df.copy(), column_name="questions")
processed_QA_df = text_preprocessor.process(perform_stopword=True)
processed_QA_df.head(10)

Unnamed: 0,questions,answers,processed_questions
0,i am unable to access the online application ...,the application cannot be sent by email/post. ...,unable access online application send applicat...
1,what is the eligibility for bitsat,"candidates can write bitsat 2020 with physics,...",eligibility bitsat
2,i have completed the online application but i ...,you can go to the applying online page again a...,complete online application take printout prin...
3,how will i get back my extra amount if i have ...,if you have made multiple payments towards bit...,extra multiple payment single application conn...
4,how can i edit correct data in my application ...,the link for editing the application form will...,edit correct datum application form
5,my 12th exam results are not expected before 1...,for admissions to i semester 2020-21 starting ...,12th exam result expect 18th june 2020 apply b...
6,i passed 12th in 2019 i didn t get 75 aggreg...,"if you are repeating 12th exam, you should do ...",pass 12th 2019 t 75 aggregate pcm eligible rep...
7,what are the tuition fees and other expenses,the fee details for 2020-21 are yet to be fina...,tuition fee expense
8,i passed 12th exam in 2019 am i eligible to a...,"as advertised, you are eligible to apply for b...",pass 12th exam 2019 eligible appear bitsat 2020
9,i had appeared in bitsat 2019 but my marks wer...,"yes, as advertised. subject to eligibility con...",appear bitsat 2019 mark cut mark appear bitsat...


In [None]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

## <font color="#007bff"><b>Techniques for Question representations</b></font><br><a id="4"></a>
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover" title="go to TOC">Go to TOC</a>

In this section will be discussing on multiple ways of representing FAQ questions.

1. TF-IDF
2. Word Embedding
3. BERT Embedding
4. Sentence BERT (SBERT) Embedding 

### <font color="#007bff"><b>1. TF_IDF Representation</b></font><br><a id="4.1"></a>

The first approach we will use for semantic similarity is leveraging Bag of Words (BOW). TF-IDF transforms the text into meaningful numbers. The technique is a widely used feature extraction in NLP applications. TF (Term Frequency) measures the no of times that words appear in a document. IDF (Inverse Document Frequency) measures low value for words that has high frequency across all the documents.

In [None]:
class TF_IDF():
    def __init__(self):
        self.dictionary = None    
        self.model = None
        self.bow_corpus = None

    def create_tf_idf_model(self, data_df, column_name):
        ## create sentence token list
        sentence_token_list = [sentence.split(" ") for sentence in data_df[column_name]]

        ## dataset vocabulary
        self.dictionary = Dictionary(sentence_token_list) 

        ## bow representation of dataset
        self.bow_corpus = [self.dictionary.doc2bow(sentence_tokens) for sentence_tokens in sentence_token_list]

        ## compute TF-IDF score for corpus
        self.model = TfidfModel(self.bow_corpus)

        ## representation of question and respective TF-IDF value
        print(f"First 10 question representation of TF-IDF vector")
        for index, sentence in enumerate(data_df[column_name]):
            if index <= 10:
                print(f"{sentence} {self.model[self.bow_corpus[index]]}")
            else:
                break

    def get_vector_for_test_set(self, test_df, column_name):
        ## store tf-idf vector
        testset_tf_idf_vector = []
        sentence_token_list = [sentence.split(" ") for sentence in test_df[column_name]]
        test_bow_corpus = [self.dictionary.doc2bow(sentence_tokens) for sentence_tokens in sentence_token_list]   
        for test_sentence in test_bow_corpus:
            testset_tf_idf_vector.append(self.model[test_sentence])      

        return testset_tf_idf_vector

    def get_training_QA_vectors(self):
        QA_vectors = []
        for sentence_vector in self.bow_corpus:
            QA_vectors.append(self.model[sentence_vector])      
        return QA_vectors

    def get_train_vocabulary(self):
        vocab = []
        for index in self.dictionary:
            vocab.append(self.dictionary[index])
        return vocab

### <font color="#007bff"><b>2. Word Embedding</b></font><br><a id="4.2"></a>

*GloVe* is an unsupervised learning algorithm for obtaining vector representations for words. It trained on the global word-word co-occurrence matrix. I downloaded a pre-trained word vector from Glove for our analysis. The code snippets for generating word embedding representation as below code snippet,

In [None]:
class Embeddings():
    def __init__(self, model_path):
        self.model_path = model_path
        self.model = None
        self.__load_model__()
        
    def __load_model__(self):
        #word_vectors = api.load("glove-wiki-gigaword-100")  
        model_name = 'glove-twitter-25' #'word2vec-google-news-50' #'glove-twitter-25'  
        if not os.path.exists(self.model_path+ model_name):
            print("Downloading model")
            self.model = api.load(model_name)
            self.model.save(self.model_path+ model_name)
        else:
            print("Loading model from Drive")
            self.model = KeyedVectors.load(self.model_path+ model_name)
        
    def get_oov_from_model(self, document_vocabulary):
        ## the below words are not available in our pre-trained model model_name
        print("The below words are not found in our pre-trained model")
        words = []
        for word in set(document_vocabulary):  
            if word not in self.model:
                words.append(word)
        print(words)  

    def get_sentence_embeddings(self, data_df, column_name):
        sentence_embeddings_list = []
        for sentence in data_df[column_name]:      
            sentence_embeddings = np.repeat(0, self.model.vector_size)
            try:
                tokens = sentence.split(" ")
                ## get the word embedding
                for word in tokens:
                    if word in self.model:
                        word_embedding = self.model[word]
                    else:
                        word_embedding = np.repeat(0, self.model.vector_size)          
                    sentence_embeddings = sentence_embeddings + word_embedding
                ## take the average for sentence embeddings
                #sentence_embeddings = sentence_embeddings / len(tokens)
                sentence_embeddings_list.append(sentence_embeddings.reshape(1, -1))
            except Exception as e:
                print(e)
            
        return sentence_embeddings_list

### <font color="#007bff"><b>3. BERT Embedding</b></font><br><a id="4.3"></a>

*BERT* is a transformer-based model attempts to use the context of words to get embedding. BERT broke several records in NLP tasks. 

The following search query is an excellent way to understand BERT. 
> “2019 Brazil traveler to the USA need a visa”. 

We observe that the relationship of the word “to” to other words in the sentence are important to decode the meaning semantically. Returning information about USA citizens traveling to Brazil is not relevant since we are talking about Brazil citizens traveling to the USA. BERT can handle this well.

In [None]:
!pip install bert-embedding

Collecting bert-embedding
  Downloading https://files.pythonhosted.org/packages/62/85/e0d56e29a055d8b3ba6da6e52afe404f209453057de95b90c01475c3ff75/bert_embedding-1.0.1-py3-none-any.whl
Collecting gluonnlp==0.6.0
[?25l  Downloading https://files.pythonhosted.org/packages/e2/07/037585c23bccec19ce333b402997d98b09e43cc8d2d86dc810d57249c5ff/gluonnlp-0.6.0.tar.gz (209kB)
[K     |████████████████████████████████| 215kB 7.3MB/s 
[?25hCollecting numpy==1.14.6
[?25l  Downloading https://files.pythonhosted.org/packages/e5/c4/395ebb218053ba44d64935b3729bc88241ec279915e72100c5979db10945/numpy-1.14.6-cp36-cp36m-manylinux1_x86_64.whl (13.8MB)
[K     |████████████████████████████████| 13.8MB 328kB/s 
[?25hCollecting typing==3.6.6
  Downloading https://files.pythonhosted.org/packages/4a/bd/eee1157fc2d8514970b345d69cb9975dcd1e42cd7e61146ed841f6e68309/typing-3.6.6-py3-none-any.whl
Collecting mxnet==1.4.0
[?25l  Downloading https://files.pythonhosted.org/packages/c0/e9/241aadccc4522f99adee5b6043f73

In [None]:
from bert_embedding import BertEmbedding

In [None]:
## get bert embeddings
def get_bert_embeddings(sentences):
    bert_embedding = BertEmbedding()
    return bert_embedding(sentences)

### <font color="#007bff"><b>4. Sentence BERT (SBERT) Embedding</b></font><br><a id="4.3"></a>


In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f5/5a/6e41e8383913dd2ba923cdcd02be2e03911595f4d2f9de559ecbed80d2d3/sentence-transformers-0.3.9.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 3.4MB/s 
[?25hCollecting transformers<3.6.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 6.3MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 13.3MB/s 
[?25hCollecting tokenizers==0.9.3
[?25l  Downloading https://files.pythonhosted.org/packages/4c/34/b39eb9994bc3c999270b69c9eea40ecc6f0e97991dba28282b9fd32d44ee/tokenizers-0.9.3-cp36-cp3

In [None]:
from sentence_transformers import SentenceTransformer
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

100%|██████████| 405M/405M [00:18<00:00, 21.6MB/s]


In [None]:
## get sbert embeddings
def get_sbert_embeddings(sentences):
    return sbert_model.encode(sentences)

## <font color="#007bff"><b>Encoding and Analysis using above techniques</b></font><br><a id="4"></a>

#### **1. TF-IDF Computation**

In [None]:
tf_idf = TF_IDF()
tf_idf.create_tf_idf_model(processed_QA_df, "processed_questions")
## get the tf-idf reprentation 
question_QA_vectors = tf_idf.get_training_QA_vectors()

First 10 question representation of TF-IDF vector
unable access online application send application form email post [(0, 0.39028148276826063), (1, 0.2641098051558333), (2, 0.39028148276826063), (3, 0.2903858053190506), (4, 0.2903858053190506), (5, 0.39028148276826063), (6, 0.39028148276826063), (7, 0.39028148276826063)]
eligibility bitsat [(8, 0.32050826552749295), (9, 0.9472457187702449)]
complete online application take printout printout [(1, 0.131031595749367), (4, 0.28813557627267694), (10, 0.38725715198934396), (11, 0.7745143039786879), (12, 0.38725715198934396)]
extra multiple payment single application connection error [(1, 0.1368348114310468), (13, 0.4044082579070788), (14, 0.4044082579070788), (15, 0.4044082579070788), (16, 0.4044082579070788), (17, 0.4044082579070788), (18, 0.4044082579070788)]
edit correct datum application form [(1, 0.176667599103797), (3, 0.388488136661716), (19, 0.5221320162245074), (20, 0.5221320162245074), (21, 0.5221320162245074)]
12th exam result expe

In [None]:
## Get the document vocabulary list from TF-IDF
document_vocabulary = tf_idf.get_train_vocabulary()

#### **2. Embeddings (Glove)**

In [None]:
## Now, Let's try building embedding based
import gensim.downloader as api
from gensim.models import KeyedVectors

In [None]:
## create Embedding object
embedding = Embeddings("")
## look for out of vocabulary COVID QA dataset - pretrained model
embedding.get_oov_from_model(document_vocabulary)
## get the sentence embedding for COVID QA dataset
question_QA_embeddings = embedding.get_sentence_embeddings(processed_QA_df, "processed_questions")

Downloading model

KeyboardInterrupt: ignored

#### **3. BERT Embeddings**

In [None]:
question_QA_bert_embeddings_list = get_bert_embeddings(processed_QA_df["questions"].to_list())

Vocab file is not found. Downloading.
Downloading /root/.mxnet/models/book_corpus_wiki_en_uncased-a6607397.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/book_corpus_wiki_en_uncased-a6607397.zip...
Downloading /root/.mxnet/models/bert_12_768_12_book_corpus_wiki_en_uncased-75cc780f.zip from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/bert_12_768_12_book_corpus_wiki_en_uncased-75cc780f.zip...


#### **4. Sentence BERT (SBERT) Embedding**


In [None]:
sentences = processed_QA_df["questions"].to_list()
question_QA_sbert_embeddings = get_sbert_embeddings(sentences)

# print('Sample BERT embedding vector - length', len(question_QA_sbert_embeddings_list[1]))
# print('Sample BERT embedding vector - note includes negative values', question_QA_sbert_embeddings_list[0])

## <font color="#007bff"><b>Evaluating with test queries</b></font><br><a id="5"></a>

Utility for evaluating user test query.

One of the best techniques to find a similarity score is **Cosine Similarity**. We will use cosine similarity for comparing each representation now. How to calculate cosine similarity as below code snippet,

In [None]:
## helps to retrieve similar question based of input vectors/embeddings for test query
def retrieveSimilarFAQ(train_question_vectors, test_question_vectors, train_QA_df, train_column_name, test_QA_df, test_column_name):
    similar_question_index = []
    for test_index, test_vector in enumerate(test_question_vectors):
        sim, sim_Q_index = -1, -1
        for train_index, train_vector in enumerate(train_question_vectors):
            sim_score = cosine_similarity(train_vector, test_vector)[0][0]
            
            if sim < sim_score:
                sim = sim_score
                sim_Q_index = train_index

        print("######")
        print(f"Query Question: {test_QA_df[test_column_name].iloc[test_index]}")    
        print(f"Retrieved Question: {train_QA_df[train_column_name].iloc[sim_Q_index]}")
        print("######")

Let's create sample few question for testing purpose.

In [None]:
test_query_string = ["How to know if I am eligible to apply for BITSAT?",
                    "Can I get the application form via mail?",
                    "What are the fees?",
                    "Can I appear for BITSAT 2020 if I also took BITSAT 2019?",
                    "Can I appear if I failed 12th"]

# test_query_string = ["how does covid-19 spread?", 
#                      "What are the symptoms of COVID-19?",
#                     "Should I wear a mask to protect myself from covid-19",              
#                     "Is there a vaccine for COVID-19",
#                     "can the virus transmit through air?",
#                     "can the virus spread through air?"]

#test_query_string = ["Is it required to have background in  algorithms and complexity for data scientist roles"]

test_QA_df = pd.DataFrame(test_query_string, columns=["test_questions"])              
## pre-process testing QA data
text_preprocessor = TextPreprocessor(test_QA_df, column_name="test_questions")
query_QA_df = text_preprocessor.process(perform_stopword=True)

In [None]:
## TF-IDF vector represetation
query_QA_vectors = tf_idf.get_vector_for_test_set(query_QA_df, "processed_test_questions")
query_QA_df.head()
      

NameError: ignored

### **Test with TF-IDF computation**

In [None]:
retrieveSimilarFAQ(question_QA_vectors, query_QA_vectors, processed_QA_df, "questions", query_QA_df, "test_questions")

######
Query Question: how does covid 19 spread 
Retrieved Question:   how does covid 19 spread 
######
######
Query Question: what are the symptoms of covid 19 
Retrieved Question:   what are the symptoms of covid 19 
######
######
Query Question: should i wear a mask to protect myself from covid 19
Retrieved Question:   can i catch covid 19 from my pet
######
######
Query Question: is there a vaccine for covid 19
Retrieved Question:   should i worry about covid 19 
######
######
Query Question: can the virus transmit through air 
Retrieved Question:   can i catch covid 19 from my pet
######
######
Query Question: can the virus spread through air 
Retrieved Question:   what is community spread 
######


### **Test with Embeddings (glove-twitter-25)**

In [None]:
## get the sentence embedding for COVID QA query
query_QA_embeddings = embedding.get_sentence_embeddings(query_QA_df, "processed_test_questions")

retrieveSimilarFAQ(question_QA_embeddings, query_QA_embeddings, processed_QA_df, "questions", query_QA_df, "test_questions")

######
Query Question: how does covid 19 spread 
Retrieved Question:   how does covid 19 spread 
######
######
Query Question: what are the symptoms of covid 19 
Retrieved Question:   what are the symptoms of covid 19 
######
######
Query Question: should i wear a mask to protect myself from covid 19
Retrieved Question:   should i wear a mask to protect myself from catching the covid 19 virus 
######
######
Query Question: is there a vaccine for covid 19
Retrieved Question:   is there a vaccine  drug or treatment for covid 19 
######
######
Query Question: can the virus transmit through air 
Retrieved Question:   can the virus that causes covid 19 be transmitted through the air 
######
######
Query Question: can the virus spread through air 
Retrieved Question:   can the virus that causes covid 19 be transmitted through the air 
######


### **Test with BERT Embeddings**

In [None]:
query_QA_bert_embeddings_list = get_bert_embeddings(test_QA_df["test_questions"].to_list())

In [None]:
## store QA bert embeddings in list
question_QA_bert_embeddings = []
for embeddings in question_QA_bert_embeddings_list:
    question_QA_bert_embeddings.append(embeddings[1])

## store query string bert embeddings in list
query_QA_bert_embeddings = []
for embeddings in query_QA_bert_embeddings_list:
    query_QA_bert_embeddings.append(embeddings[1])

In [None]:
retrieveSimilarFAQ(question_QA_bert_embeddings, query_QA_bert_embeddings, processed_QA_df, "questions", query_QA_df, "test_questions")

######
Query Question: how does covid 19 spread 
Retrieved Question:   how does covid 19 spread 
######
######
Query Question: what are the symptoms of covid 19 
Retrieved Question:   what are the symptoms of covid 19 
######
######
Query Question: should i wear a mask to protect myself from covid 19
Retrieved Question:   should i wear a mask to protect myself from catching the covid 19 virus 
######
######
Query Question: is there a vaccine for covid 19
Retrieved Question:   is there a vaccine  drug or treatment for covid 19 
######
######
Query Question: can the virus transmit through air 
Retrieved Question:   can the virus that causes covid 19 be transmitted through the air 
######
######
Query Question: can the virus spread through air 
Retrieved Question:   can the virus that causes covid 19 be transmitted through the air 
######


### **Test with Sentence BERT Embeddings (SBERT)**

In [None]:
query_QA_sbert_embeddings_list = sbert_model.encode(test_query_string)

In [None]:
def evaluate_sbert(question_sbert_embeddings, query_sbert_embeddings, train_QA_df, train_column_name, test_QA_df, test_column_name):

  for test_index, test_vector in enumerate(query_sbert_embeddings):
        sim, sim_Q_index, sim2, sim_Q_index2 = -1, -1, -1, -1
        for train_index, train_vector in enumerate(question_sbert_embeddings):
            sim_score = cosine(train_vector, test_vector)
            
            if sim < sim_score:
                sim2 = sim
                sim_Q_index2 = sim_Q_index
                sim = sim_score
                sim_Q_index = train_index

            elif sim2 < sim_score:
              sim2 = sim_score
              sim_Q_index2 = train_index
              
        query = test_QA_df[test_column_name].iloc[test_index]
        retrieved_ques = train_QA_df[train_column_name].iloc[sim_Q_index]

        # to print query and corresponding retrieved question
        print("######")
        print(f"Query Question: {query}")    
        print(f"Retrieved Question 1: {retrieved_ques}")
        # print(f"Retrieved Question 2: {train_QA_df[train_column_name].iloc[sim_Q_index2]}")
        print("######")

In [None]:
evaluate_sbert(question_QA_sbert_embeddings, query_QA_sbert_embeddings_list, processed_QA_df, "questions", query_QA_df, "test_questions")

######
Query Question: how to know if i am eligible to apply for bitsat 
Retrieved Question 1: what is the eligibility for bitsat 
######
######
Query Question: can i get the application form via mail 
Retrieved Question 1: i am unable to access the online application  can you send the application form by email or post 
######
######
Query Question: what are the fees 
Retrieved Question 1: what are the tuition fees and other expenses 
######
######
Query Question: can i appear for bitsat 2020 if i also took bitsat 2019 
Retrieved Question 1: i had appeared in bitsat 2019 but my marks were below the cut off marks  can i appear in bitsat 2020 
######
######
Query Question: can i appear if i failed 12th
Retrieved Question 1: how am i eligible for admission to m sc  programmes after 12th 
######


## <font color="#007bff"><b>Observation</b></font><br><a id="5"></a>

We achieved the best results through **SBERT embedding** representation as it is able to derive semantically meaningful sentence embeddings (semantically similar sentences are closer in vector space). 