# QNA Chatbot:-
We build a chatbot that can answer questions from a given document.
We use a three step approach which has the following sub modules:-
1. Breaking the document collection into smaller segments
2. Identifying Potential Segments that can answer the questions
3. Retreival Augmented Generation for question answering:-

## Installing necessary libraries

In [None]:
!pip install PyPDF2
!pip install sentence-transformers
!pip install datasets
!pip install faiss-cpu
!pip install langchain==0.0.163
!pip install pygpt4all==1.1.0
!pip install -U transformers
!pip install datasets
!pip install chromadb
!pip install tiktoken
!pip install gpt4all
!pip install -U accelerate

## Importing Libraries

In [2]:
from PyPDF2 import PdfReader
import os
import numpy as np
# First let import the most necessary libs
import pandas as pd
import numpy as np
# Library to import pre-trained model for sentence embeddings
from sentence_transformers import SentenceTransformer
# Calculate similarities between sentences
from sklearn.metrics.pairwise import cosine_similarity
# Visualization library
import seaborn as sns
import matplotlib.pyplot as plt
# package for finding local minimas
from scipy.signal import argrelextrema
import time
import math
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
import torch
from datasets import load_dataset
from datasets import Dataset
from datasets import load_from_disk
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import gpt4all
from transformers import GPTJForCausalLM, AutoTokenizer
import torch

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Extracting text from files :-
We iterate over all the files and extract text from the pdf files

In [4]:
dir_name = "/content/drive/MyDrive/truefoundry/"
text_list = list()
for file in os.listdir(dir_name):
    text_file = list()
    reader = PdfReader(dir_name+file)

    # getting a specific page from the pdf file
    for i in range(len(reader.pages)):
        page = reader.pages[i]

        # extracting text from page
        text = page.extract_text()
        text_file.append(text)
    text_list.append(text_file)

In [5]:
def get_avg_tokens(txt_list):
    word_count = []
    for i in txt_list:
        txt = i.split()
        word_count.append(len(txt))
    return np.median(word_count)

In [6]:
# Average token length for 1 page
get_avg_tokens(text_list[1])

853.5

## Breaking documents into smaller segments
Since the document length is too high for a page, we need to break it into smaller paragraphs which have similar context.
This has to be done in such a way that every segment contains a group of sentences that have similar context.
In order to do this we first break the documents into sentences seperated by a "." token. Then we create embeddings for every sentences using all-mpnet-base-v2 model. We iterate over all sentences and calculate the similarity between the current and previous sentences. Whenever we find a relative minima (i.e similarity is high before the point as well as after the point) we identify that as a split point and split sentences accordingly. A bigger insight can be by reading this [blog](https://medium.com/@npolovinkin/how-to-chunk-text-into-paragraphs-using-python-8ae66be38ea6)


## Loading model to embed documents

In [7]:
#loading sentence transformer to make sentence vectors
model = SentenceTransformer('all-mpnet-base-v2')
# Split text into sentences
sentences = text.split('. ')
# Embed sentences
embeddings = model.encode(sentences)

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [8]:
print(embeddings.shape[0])

9


## Helper functions to segment the documents

In [29]:

def rev_sigmoid(x:float)->float:
    return (1 / (1 + math.exp(0.5*x)))
    
def activate_similarities(similarities:np.array, p_size=1)->np.array:
        """ Function returns list of weighted sums of activated sentence similarities
        Args:
            similarities (numpy array): it should square matrix where each sentence corresponds to another with cosine similarity
            p_size (int): number of sentences are used to calculate weighted sum 
        Returns:
            list: list of weighted sums
        """
        # To create weights for sigmoid function we first have to create space. P_size will determine number of sentences used and the size of weights vector.
        x = np.linspace(-10,10,p_size)
        # Then we need to apply activation function to the created space
        y = np.vectorize(rev_sigmoid) 
        # Because we only apply activation to p_size number of sentences we have to add zeros to neglect the effect of every additional sentence and to match the length ofvector we will multiply
        activation_weights = np.pad(y(x),(0,similarities.shape[0]-p_size))
        ### 1. Take each diagonal to the right of the main diagonal
        diagonals = [similarities.diagonal(each) for each in range(0,similarities.shape[0])]
        ### 2. Pad each diagonal by zeros at the end. Because each diagonal is different length we should pad it with zeros at the end
        diagonals = [np.pad(each, (0,similarities.shape[0]-len(each))) for each in diagonals]
        ### 3. Stack those diagonals into new matrix
        diagonals = np.stack(diagonals)
        ### 4. Apply activation weights to each row. Multiply similarities with our activation.
        diagonals = diagonals * activation_weights.reshape(-1,1)
        ### 5. Calculate the weighted sum of activated similarities
        activated_similarities = np.sum(diagonals, axis=0)
        return activated_similarities

In [30]:
# The function to get all segments from a page
def get_seg_list(page):
    text = page
    sentences = text.split('. ')
    # Embed sentences
    embeddings = model.encode(sentences)
    # Create similarities matrix
    similarities = cosine_similarity(embeddings)
    activated_similarities = activate_similarities(similarities, p_size=embeddings.shape[0])
    minmimas = argrelextrema(activated_similarities, np.less, order=2)
    sentece_length = [len(each) for each in sentences]
    # Determine longest outlier
    long = np.mean(sentece_length) + np.std(sentece_length) *2
    # Determine shortest outlier
    short = np.mean(sentece_length) - np.std(sentece_length) *2
    # Shorten long sentences
    text = ''
    for each in sentences:
        if len(each) > long:
            # let's replace all the commas with dots
            comma_splitted = each.replace(',', '.')
        else:
            text+= f'{each}. '
    sentences = text.split('. ')
    # Now let's concatenate short ones
    text = ''
    for each in sentences:
        if len(each) < short:
            text+= f'{each} '
        else:
            text+= f'{each}. '
    split_points = [each for each in minmimas[0]]
    # Create empty string
    text = ''
    seg_list = list()
    for num,each in enumerate(sentences):
        # Check if sentence is a minima (splitting point)
        if num in split_points:
            # If it is than add a dot to the end of the sentence and a paragraph before it.
            seg_list.append(text)
            text+=""
        else:
            # If it is a normal sentence just add a dot to the end and keep adding sentences.
            text+=f'{each}. '
    return seg_list

In [11]:
len(get_seg_list(text_list[1][0]))

3

In [12]:
# segments contains all similar context paragraphs
segments = []

## Making segments for all pages in the dataset

In [31]:
for doc in text_list:
    for page in doc:
        segments.extend(get_seg_list(page))

In [33]:
len(segments)

438

In [18]:
print(len(segments[4].split()))

427


In [32]:
get_avg_tokens(segments)## Average token size reduces after we break the document into smaller changes

228.5

As we can see median size of tokens for a segment is much smaller than that for a page

## DPR for finding out the best segments to answer a question
Our database is too large for us to use the entire document for searching the answer to the questions. To make our search faster we need to narrow it down to a few segments that may possibly contain the answer. To do this, we use DPR for searching the best segments that can answer the question. We use the pretrained dpr-ctx_encoder-single-nq-base for context and dpr-question_encoder-single-nq-base for the questions. We generate indexes using FAISS and search for the top 5 segments that are best for answering our question. We concatenate all the segments and use this as the context for answering the final question.

## Loading context encoder for DPR

In [34]:
torch.set_grad_enabled(False)
ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


In [35]:
# Split documents further if the tokens are more than 512. First 500 tokens in first part remaining in the next part
sentence_list = []
segment_list = []
for string in segments:
    encoding = ctx_tokenizer.encode(string)
    if len(encoding) > 500:
        first_part = encoding[:500]
        second_part = encoding[500:]
        sentence_list.append(ctx_tokenizer.decode(first_part))
        sentence_list.append(ctx_tokenizer.decode(second_part))
        segment_list.append(string)
        segment_list.append(string)
    else:
        sentence_list.append(string)
        segment_list.append(string)

Token indices sequence length is longer than the specified maximum sequence length for this model (550 > 512). Running this sequence through the model will result in indexing errors


In [36]:
len(sentence_list)

617

### Creating database for search by encoding all segments by context encoder

In [37]:
data_dict = {}
data_dict['line'] = sentence_list
data_dict['segment'] = segment_list

In [39]:
df = pd.DataFrame(data_dict)
df.to_csv('/content/drive/MyDrive/segments.csv', index=False)

In [59]:
df.head()

Unnamed: 0,line,segment
0,Morbidity and Mortality Weekly Report 698 MMWR...,Morbidity and Mortality Weekly Report 698 MMWR...
1,Morbidity and Mortality Weekly Report 698 MMWR...,Morbidity and Mortality Weekly Report 698 MMWR...
2,Morbidity and Mortality Weekly Report 698 MMWR...,Morbidity and Mortality Weekly Report 698 MMWR...
3,[CLS] morbidity and mortality weekly report 69...,Morbidity and Mortality Weekly Report 698 MMWR...
4,", acog, the american college of physicians ( a...",Morbidity and Mortality Weekly Report 698 MMWR...


In [60]:
ds = Dataset.from_dict(data_dict)# Creating hf dataset

In [61]:
ds

Dataset({
    features: ['line', 'segment'],
    num_rows: 617
})

In [62]:
ds.save_to_disk('/content/drive/MyDrive/embedded_truefoundry')

Saving the dataset (0/1 shards):   0%|          | 0/617 [00:00<?, ? examples/s]

In [63]:
ds = load_from_disk('/content/drive/MyDrive/embedded_truefoundry')

In [64]:
ds

Dataset({
    features: ['line', 'segment'],
    num_rows: 617
})

### Creating embeddings 

In [67]:
ds_with_embeddings = ds.map(lambda example: {'embeddings': ctx_encoder(**ctx_tokenizer(example["line"], truncation = True,return_tensors="pt"))[0][0].numpy()})



In [68]:
ds_with_embeddings

Dataset({
    features: ['line', 'segment', 'embeddings'],
    num_rows: 617
})

In [52]:
ds_with_embeddings.save_to_disk('/content/drive/MyDrive/embedded_truefoundry_vectors')

Saving the dataset (0/1 shards): 0 examples [00:00, ? examples/s]

In [5]:
ds_with_embeddings = load_from_disk('/content/drive/MyDrive/embedded_truefoundry_vectors')

In [6]:
ds_with_embeddings

Dataset({
    features: ['line', 'segment', 'embeddings'],
    num_rows: 617
})

## Adding Faiss Index to the dataset 

In [7]:
ds_with_embeddings.add_faiss_index(column='embeddings')

  0%|          | 0/1 [00:00<?, ?it/s]

Dataset({
    features: ['line', 'segment', 'embeddings'],
    num_rows: 617
})

## Initialize question tokenizer

In [4]:
q_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
q_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

In [8]:
questions = ["When did the GARDASIL 9 recommendations change?",
"What were the past 3 recommendation changes for GARDASIL 9?",
"Is GARDASIL 9 recommended for Adults?",
"Does the ACIP recommend one dose GARDASIL 9?"
]

In [9]:
contexts = list()

## Getting context text for every question

In [10]:
for question in questions:
    question_embedding = q_encoder(**q_tokenizer(question, return_tensors="pt"))[0][0].detach().numpy()
    scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', question_embedding, k=5)
    context = "\n".join(retrieved_examples["line"])
    contexts.append(context)

In [11]:
contexts[0]

'Recommendations and ReportsMMWR  / August 29, 2014  / Vol . 5 19reported local symptoms were injection-site redness, swelling, \nand induration. Postlicensure safety data are available from other countries that have implemented vaccination programs using HPV2 (154,176). In a review of passive reports from countries that have implemented HPV2 vaccination programs, the distribution of adverse events was consistent with prelicensure trials. \nRecommendations and ReportsMMWR  / August 29, 2014  / Vol . 5 15bacterial meningitis, viral myocarditis, pulmonary embolism, \ndiabetic ketoacidosis, and seizure disorder. VSD conducts evaluations of specific events that might be associated with vaccination ( 151). Data were analyzed after \n600,558 doses of HPV4 had been administered to females. No statistically significant increased risks were observed for any of the prespecified endpoints including Guillain-Barré syndrome (GBS), stroke, venous thromboembolism, appendicitis, seizures, syncope, all

## Retrieval Augmented Generation for the answer from the context
Now that we have our question and context we finally perform retrieval augmented generation to get the answer for the question. The model used is gpt4all-j-v.1.3 which is a small model that can be run on colab and kaggle. We can use better models too but under the constraints this was the best possible choice. We use prompt engineering which passes the context and question from the model and asks it to give the answer from the context

## Loading LLM model

In [12]:
gptj = gpt4all.GPT4All("ggml-gpt4all-j-v1.3-groovy")

Found model file at  /root/.cache/gpt4all/ggml-gpt4all-j-v1.3-groovy.bin


In [13]:
answers = list()

## Performing retrieval for all questions

In [14]:
for i in range(len(questions)):
    question = questions[i]
    context = contexts[i]
    messages = [{"role": "user", "content": f"Please use the following context to answer questions.Context: {context} --- Question: {question}"}]
    ret = gptj.chat_completion(messages)
    answers.append(ret)

### Instruction: 
            The prompt below is a question to answer, a task to complete, or a conversation 
            to respond to; decide which and write an appropriate response.
            
### Prompt: 
Please use the following context to answer questions.Context: Recommendations and ReportsMMWR  / August 29, 2014  / Vol . 5 19reported local symptoms were injection-site redness, swelling, 
and induration. Postlicensure safety data are available from other countries that have implemented vaccination programs using HPV2 (154,176). In a review of passive reports from countries that have implemented HPV2 vaccination programs, the distribution of adverse events was consistent with prelicensure trials. 
Recommendations and ReportsMMWR  / August 29, 2014  / Vol . 5 15bacterial meningitis, viral myocarditis, pulmonary embolism, 
diabetic ketoacidosis, and seizure disorder. VSD conducts evaluations of specific events that might be associated with vaccination ( 151). Data were analyzed 

## Saving the output

In [15]:
import json

json_string = json.dumps(answers)

In [17]:
with open("/content/drive/MyDrive/output_truefoundry.json", "w") as outfile:
    outfile.write(json_string)