# Interviewing System based on BM25

Problems with tfidf:
- Problem of tf: if 'cat' occurs 1 time in 10 sequence length document A
and 10 times for seq length in doc B, the relevancy is not similar (even though it might seem at first due to proportion) as the 10 cat words might occur in last 30 tokens (highly unlikey here but might be true as the doc size becomes larger). So we have to transverse more as document length increas which seems to not penalised in tfidf<br>
- Partial derivative of tf-idf w.r.t to tf gives idf which is independent of keyowrds in particular document. That is increasing keywords in documents will have constant affect, which is undesirable. As having '10 cats' to '100 cats' for a sequence length increase from 100 to 1250 (instead of 1000) should be penalised more.

Reference : https://www.youtube.com/watch?v=ruBm9WywevM

### Load and format Dataset

In [1]:
from datasets import load_dataset

data = load_dataset('medalpaca/medical_meadow_wikidoc')
data

DatasetDict({
    train: Dataset({
        features: ['output', 'instruction', 'input'],
        num_rows: 10000
    })
})

In [2]:
import pandas as pd

In [3]:
df = data['train'].to_pandas()

In [4]:
df = df[['output','input']][:1000]
df.columns = ['answer','question']

In [5]:
pd.set_option('display.max_colwidth', 200)
df.head()

Unnamed: 0,answer,question
0,"Squamous cell carcinoma of the lung may be classified according to the WHO histological classification system into 4 main types: papillary, clear cell, small cell, and basaloid.",Can you provide an overview of the lung's squamous cell carcinoma?
1,"Clear cell tumors are part of the surface epithelial-stromal tumor group of Ovarian cancers, accounting for 6% of these neoplastic cases. Clear cell tumors are also associated with the pancreas an...","What does ""Clear: cell"" mean?"
2,Two Japanese scientists commenced research into inhibitors of HMG-CoA reductase in 1971 reasoning that organisms might produce such products as the enzyme is important in some essential cell wall ...,Can you provide me with information regarding statins?
3,Symptoms of vulvovaginitis caused by Candida species are indistinguishable and include the following: \nPruritus is the most significant symptom Change in the amount and the color of vaginal dis...,What are the historical background and symptoms of Candida-induced vulvovaginitis?
4,Hypotension is the term for low blood pressure (BP). A systolic BP measuring less than 90mmHg and/ or diastolic BP of less than 60mmHg is considered hypotension. A difference of 20 mmHg systolic B...,"What does the ""Hypotension: Resident Survival Guide"" refer to?"


### Preprocess Data

In [6]:
from nltk.tokenize import word_tokenize

import re
from rank_bm25 import BM25Okapi
import numpy as np
import random

Bm25Okapi:  The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores, this algorithm also adds a floor to the idf value of epsilon.

In [7]:
def preprocess(docs:list):
    res = []
    for i in docs:
        #i = i.lower()
        res.append(re.sub(r'[^\w\s]', ' ', i))
    return res

- Trick to interchange between interview bot and search engine is that instead of computing similarity between queries and questions (from dataset) to get answer, do it instead for user answer and dataset answer to get question. 
- If question obtained due to user answer and the actual question obtained from dataset answer match, first condition for going forward is satisfied.
- Get similarity scores between user answer and real answer and set a threshold to continue.

### Defining Search Engine

In [8]:
def search_engine(dataset:pd.DataFrame, query:str):
    """_summary_

    Args:
        dataset (pandas.DataFrame): Dataframe with columns = ['question','answer']
        query (string): query to be matched

    Returns:
        user_score (int): Similarity score between answer by user and real answer
        user_index (int): Index of answer matched in dataset
    """
    dataset = dataset.drop_duplicates()
    dataset = dataset.dropna()
    
    topic = list(dataset['answer'])
    topic = preprocess(topic)
    
    tokenized_corpus = [word_tokenize(doc) for doc in topic]
    
    bm25 = BM25Okapi(tokenized_corpus)
    
    doc_scores = bm25.get_scores(query.split(' '))

    
    # See below for reason of normalising
    i = np.argsort(doc_scores)  # indicies of sorted array in ascending order
    sorted_arr = doc_scores[i]  # Sorted array
    sorted_arr = sorted_arr[-10:]
    sorted_arr /= np.linalg.norm(sorted_arr)
    user_score = max(sorted_arr)

    
    query_results = bm25.get_top_n(query.split(' '), topic, n = 5)
    user_index = topic.index(query_results[0])
    #result = dataset.iloc[user_index,:]
    
    return user_score, user_index

In [9]:
def qa_pair_generator(dataset:pd.DataFrame):
    indice = np.random.randint(len(dataset))
    question = dataset['question'][indice]
    answer = dataset['answer'][indice]
    return question, answer, indice

In [10]:
question, answer, indice = qa_pair_generator(df)

In [11]:
print(f'indice : {indice}\nquestion : {question}\nanswer : {answer}')

indice : 765
question : What is the historical background or context of T-cell prolymphocytic leukemia?
answer : 40 years ago, in 1973, Catovsky first described four cases of T-cell prolymphocytic leukemia.   In 1994, Harris a pathologist from Boston and his colleagues made an effort to classify T-cell prolymphocytic leukemia.


*The above function works<br>But for testing i am replicating results by defining indices instead of randomly creating using function.*

In [12]:
indice = 315
question, answer, = df.iloc[315]['question'],df.iloc[315]['answer']

In [13]:
print(f'indice : {indice}\nquestion : {question}\nanswer : {answer}')

indice : 315
question : What are the epidemiological and demographic factors associated with Peutz-Jeghers Syndrome?
answer : The epidemiology and demographics are as follows:  
The prevalence of Peutz-Jeghers syndrome is estimated to be 0.8 to 2.8 in 100000.
Peutz-Jeghers syndrome affects individuals between the ages of 10 to 30 years. Average age of diagnosis of Peutz-Jeghers syndrome is 23 years for males and 26 years for females.
There is no racial predilection to Peutz-Jeghers syndrome.
Peutz-Jeghers syndrome affects men and women equally. Average age of diagnosis of Peutz-Jeghers syndrome is 23 years for males and 26 years for females.


In [14]:
query = "The prevalence of Peutz-Jeghers syndrome is estimated to be 0.8 to 2.8 in 100000. eutz-Jeghers syndrome affects men and women equally. Average age of diagnosis of Peutz-Jeghers syndrome is 23 years for males and 26 years for females"

In [15]:
user_score, user_index = search_engine(df,query)
print(f'score : {user_score}\n\nyour answer : {query}\n\ncorrect answer : {answer}')

score : 0.6335925891591059

your answer : The prevalence of Peutz-Jeghers syndrome is estimated to be 0.8 to 2.8 in 100000. eutz-Jeghers syndrome affects men and women equally. Average age of diagnosis of Peutz-Jeghers syndrome is 23 years for males and 26 years for females

correct answer : The epidemiology and demographics are as follows:  
The prevalence of Peutz-Jeghers syndrome is estimated to be 0.8 to 2.8 in 100000.
Peutz-Jeghers syndrome affects individuals between the ages of 10 to 30 years. Average age of diagnosis of Peutz-Jeghers syndrome is 23 years for males and 26 years for females.
There is no racial predilection to Peutz-Jeghers syndrome.
Peutz-Jeghers syndrome affects men and women equally. Average age of diagnosis of Peutz-Jeghers syndrome is 23 years for males and 26 years for females.


In [16]:
user_index,indice

(315, 315)

In [17]:
def test_conductor(ans:str):
    out_score = 1
    user_index,indice = 0,0
    while user_index == indice:
        question, answer, indice = qa_pair_generator(df)
        print(f'Your question is : {question}')
        user_score, user_index = search_engine(df,query)
        print(f'score : {out_score}\n\nYour Answer : {ans}\n\nCorrect answer : {answer}')
        if user_score < 0.3:
            break

It is worth noting that the nature of BM25 scores differs from that of dense vectors. While embedding models are designed to generate scores ranging from 0 to 1 - reflecting the likelihood of a documentâ€™s relevance to the query - BM25 normalization merely divides by the maximum possible score a document can achieve for a given query. This means the score distribution for BM25 might vary significantly from one query to another.
<br><br>
The score that BM25 calculates is only usable to **compare search results for a specific query to each other**. It's not possible to transform that score to mean something independent of the query.

But there's one way to do something that might be OK in some cases. You decide if this works in your case:

Normalize each score, by dividing it with the sum of all scores (of say the top 10 results). Looking at score for the first hit, it now means: "Are there lots of other hits that also match this query?". If there are, the number will be low, else it will be high. 
<br>Reference:<br>https://stats.stackexchange.com/questions/171589/normalised-score-for-bm25 

**NOTE**: Other Embedding Models/Techniques can be used instead of BM25, for instance doc2vec or sentence transformer for semantic search.

### Using Sentence Transformer

In [18]:
from sentence_transformers import SentenceTransformer,util
import torch

model_id = "sentence-transformers/all-MiniLM-L6-v2" # Context Length 384
model = SentenceTransformer(model_id)

  return self.fget.__get__(instance, owner)()


In [19]:
def ST_search(corpus:list,query:str):
    topic = list(corpus['answer'])
    corpus_embeddings = model.encode(topic)
    query_embeddings = model.encode(query)
    similarity = util.cos_sim(corpus_embeddings,query_embeddings)
    
    index = torch.argmax(similarity)
    index = index.item()
    result_text = topic[index]
    return index, result_text,similarity.max()

In [20]:
index, result_text,similarity = ST_search(df,query)

In [21]:
print(f'Your score : {similarity}')

Your score : 0.9409849643707275


In [22]:
index, indice

(315, 315)

### Testing

`Methodology`: 
<br><br>Since, the model requires text input, it is difficult to automate testing process, So i naively conduct testing for 3 examples.<br><br>
Scoring criteria (confusinon_matrix) : (TruePositives,False Positives, False Negatives,True Negatives)
- I have calculated TP, FP, TN, FN based on similarity scores (threshold of 0.5 in most cases) and not on index matching.(usual way of calculating those)
- This is mainly done to get more definitive evalutation.

#### Example 1

In [23]:
query_new = "The syndrome has no racial or gender predilection. The syndrome observed to occur between 23 to 26 years for adults."

In [24]:
user_score, user_index = search_engine(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')

score using BM25 : 0.6637617723906973

user index:315

real index:315


In [25]:
index, result_text,similarity = ST_search(df,query_new)
print(f'score using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using sentence transformer : 0.5060998201370239

user index:315

real index:315


Scores:
- False Positive for BM25(0,1,0,0)
- ST(1,0,0,0)

In [26]:
query_new = "The syndrome has no racial or gender predilection. The syndrome observed to occur between 23 to 26 years for adults. The prevalence of Peutz-Jeghers syndrome is 0.8 to 2.8 in 100000"
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.6159147979703334

user index:315

real index:315

score using sentence transformer : 0.8363104462623596

user index:315

real index:315


Scores:
- BM25(1,1,0,0)
- ST(2,0,0,0)

Giving wrong answer to predict quality of two system

In [27]:
query_new = "The syndrome has no race or gender preference. The syndromes is seen to occur in young adults. The prevalence of syndrome is 1.6 to 2.5 in 1000"
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.3723763785263308

user index:711

real index:315

score using sentence transformer : 0.5869288444519043

user index:654

real index:315


Scores:
- BM25(1,1,0,1)
- False Positive for ST(2,1,0,0)

#### Example 2

In [28]:
indice = 893
question, answer, = df.iloc[indice]['question'],df.iloc[indice]['answer']
print(f'indice : {indice}\nquestion : {question}\nanswer : {answer}')

indice : 893
question : What is the history and what are the symptoms of disseminated intravascular coagulation?
answer : Patients with DIC may have a history of abruptio placentae, amniotic fluid embolism, aortic aneurysm, blood transfusion reaction, drug exposure (e.g. amphetamines), eclampsia, giant hemangioma, graft-versus-host disease, HELLP syndrome, hemolytic transfusion reaction, liver disease, malignancy (especially acute promyelocytic leukemia), sepsis (esp. gram-negative bacteria), severe allergic reaction, transplant rejection, trauma (e.g. fat embolism, head injury), venomous snake and viral hemorrhagic fever.


**Repeating the procedure of progressivily convulting language**

In [29]:
query_new = "People with history of placentae might be prone to disseminated intravascular coagulation."
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.35730133917757434

user index:266

real index:893

score using sentence transformer : 0.5851185321807861

user index:266

real index:893


Scores:
- BM25(1,1,0,2)
- False Positive ST(2,2,0,0)

In [30]:
query_new = "People with history of severe allergies, transplant rejection, trauma (e.g. fat embolism, head injury), venomous snake and viral hemorrhagic fever.be prone to DIC"
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.7534539451842835

user index:893

real index:893

score using sentence transformer : 0.8098570704460144

user index:893

real index:893


Scores:
- BM25(2,1,0,2)
- ST(3,2,0,0)

**Providing Ambigous Answers**

In [31]:
query_new = "Humans with history of severe allergies, brain transplant, trauma ,snake bites and viral  fever (hemorrhagi) are prone to DIC"
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.5067731031429341

user index:893

real index:893

score using sentence transformer : 0.7604563236236572

user index:893

real index:893


Scores:
- False positive BM25(2,2,0,2)
- False positive ST(3,3,0,0)

**Incorrect but syntactically same answers**

In [32]:
query_new = "Humans with history of severe algae,transport injection, coma , eagle kites and viral cold (hemorrhagi) are prone to BIC"
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.4004705214410462

user index:290

real index:893

score using sentence transformer : 0.4721139967441559

user index:893

real index:893


Scores:
-  BM25(2,2,0,3)
-  ST(3,3,0,1)

#### Example 3

In [34]:
indice = 858
question, answer, = df.iloc[indice]['question'],df.iloc[indice]['answer']
print(f'indice : {indice}\nquestion : {question}\nanswer : {answer}')

indice : 858
question : What is the purpose of an MRI in detecting Pseudomyxoma peritonei?
answer : Abdominal MRI is helpful in the diagnosis of pseudomyxoma peritonei. On abdominal MRI, pseudomyxoma peritonei is characterized by a mass which is hypointense on T1-weighted MRI and hyperintense on T2-weighted MRI. MRI has better sensitivity in detecting ascitic fluid and mucocele.


In [35]:
query_new = " On abdominal MRI, pseudomyxoma peritonei is characterized by a mass which is hypointense on T1-weighted MRI and hyperintense on T2-weighted MRI."
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.7184778547645139

user index:858

real index:858

score using sentence transformer : 0.9523378014564514

user index:858

real index:858


Scores:
-  BM25(3,2,0,3)
-  ST(4,3,0,1)

In [36]:
query_new = "Obviously for diagnosing."
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.33264087606812864

user index:191

real index:858

score using sentence transformer : 0.41637101769447327

user index:604

real index:858


Scores:
-  BM25(3,2,0,4)
-  ST(4,3,0,2)

In [37]:
query_new = "For curing pseudomyxoma peritonei and reducing ascitic fluid and mucocele levels."
user_score, user_index = search_engine(df,query_new)
index, result_text,similarity = ST_search(df,query_new)
print(f'score using BM25 : {user_score}\n\nuser index:{user_index}\n\nreal index:{indice}')
print(f'\nscore using sentence transformer : {similarity}\n\nuser index:{index}\n\nreal index:{indice}')

score using BM25 : 0.7945593302531387

user index:858

real index:858

score using sentence transformer : 0.6618459224700928

user index:858

real index:858


Scores:
-  False Positive BM25(3,3,0,4)
-  False Positive ST(4,4,0,2)

`Observations`: 
- Both models give no False Negatives. That is they are good at predicting wrong answers with BM25 being espically good at it.
- Sentence Transformers are equally good and bad at True positive and false positive.
- Since we are designing interview system BM25 seems to edge as the model gives higher prediction of incorrect answers, for slightly worse performance for correct answers.

#### Calculating some measures:

In [38]:
m1 = [3,3,0,4]
m2 = [4,4,0,2]

In [39]:
print(f'Confusion matrix for BM25:\n{m1[0]}\t{m1[1]}\n{m1[2]}\t{m1[3]}')
print(f'Confusion matrix for Sentence Transformers :\n{m2[0]}\t{m2[1]}\n{m2[2]}\t{m2[3]}')

Confusion matrix for BM25:
3	3
0	4
Confusion matrix for Sentence Transformers :
4	4
0	2


I am naively calculating metrics for simplicity, but sklearn can be used for better testing.

In [40]:
def get_report(m1:list,m2:list):
    pr_bm = m1[0]/(m1[0]+m1[1])
    pr_st = m2[0]/(m2[0]+m2[1])
    rec_bm = m1[0]/(m1[0]+m1[2])
    rec_st = m2[0]/(m2[0]+m2[2])
    ac_bm = ( m1[0] + m1[3]  ) / (m1[0]+m1[1]+m1[2]+m1[3])
    ac_st = ( m2[0] + m2[3] )/(m2[0]+m2[1]+m2[2]+m2[3])
    f1_bm = (2*pr_bm*rec_bm)/(pr_bm+rec_bm)
    f1_st = (2*pr_st*rec_st)/(pr_st+rec_st)
    print('Metric sheet\n-------------------------------------------------')
    print(f'Precision :\tBM25 : {pr_bm}\tST : {pr_st}')
    print(f'Recall :\tBM25 : {rec_bm}\tST : {rec_st}')
    print(f'Accuracy :\tBM25 : {ac_bm}\tST : {ac_st}')
    print(f'F1 :\t\tBM25 : {f1_bm:0.4f}\tST : {f1_st:0.4f}')

In [41]:
get_report(m1,m2)

Metric sheet
-------------------------------------------------
Precision :	BM25 : 0.5	ST : 0.5
Recall :	BM25 : 1.0	ST : 1.0
Accuracy :	BM25 : 0.7	ST : 0.6
F1 :		BM25 : 0.6667	ST : 0.6667


`Conclusions`: For search Engines F1 is generally used to evaluate and since both have similar F1 , precision and recall. So the remmaing measure namely accuracy is used to evaluate and BM25 seems to edge sentence transformer model.
Maybe a better model could beat BM25 but at the moment it cannot be discarded and will be used for further iterations.