# **Write a simple chatbot to query a large PDF document using LLMs  and return responses. Preferred language: Python.**


Ways to do it:

1. using exsisting trained LLM: Bert or ChatGPT-3.5
2. Build my own model? (Needs lots of data and time tho)

## Assumptions:
Write a simple chatbot to query a large PDF document using LLMs  and return responses. Preferred language: Python.

1. there is only one/single PDF document. Note we can make it work for multiple pdf documents.
2. The PDF document is **not** a question answer reference, it is paragraph type document. From which my task is to make a bot that can answer questions about the documents such as information etc


## model choices: LLM
* GPT 3 to 4, are best for generating new text from the document (still with info). But requires api access == $$$

* Bert: best for detail specific answering, can be finetuned on our document, only disadvatage: may struggle to create new text, as it can only refer a text.

* T5: is a text to text LLM, so it is better for chatbots as it can answer in an human-like way. But only disadvantage is accuracy.

> Bert also has variants: Lightweight versions, don't remember their names, same for t5 has better new models like flan-t5

* XLNet: best for complex queries, but more expensive to train and use

## Rough planning:

* will use a sample pdf, and

## problems with finetuning:
* LLM if finetuned on a single PDF, for example a case study, they won't be able to learn the details out of it, its too sparse for it to learn, Will work if you have a large dataset.
* **For small dataset, creating embeddings for the knowledge base is better**


## Preprocessing the document: things to consider:
* if the document is structured, i.e. is just text, like a wikipedia article
* if the document has unstructured data, like tables or graphs, images
* for tables, this step can be done in the preprocessing part, but for graphs and images more complex models will be required.



## Metrics after deployment:
* UER: User engagement rate, this is for the session's performance, how likely the users are using the bot
* **fallback and containment rates**: % of queries solved by chatbot, can also include Handoff rates, % of the time the query was handed off to a human
*  Others are there like: # of chats, # popular topics, # of avg questions, etc






# CODE:

In [None]:
HuggingFaceApi = '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' # should be in an .env file, but its a free account so doesn't matter for now

# Extracting text from PDFs

* first we will get text from our pdf file, then we will break the text into chunks (with overlap)

* Then we will store it in a knowledge base, using langchain

In [None]:
!pip install PyPDF2



In [None]:
!pip install InstructorEmbedding



In [None]:
!pip install sentence_transformers



In [None]:
!pip install langchain



In [None]:
# importing libs:
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS # makes embe
from langchain.embeddings import HuggingFaceInstructEmbeddings

In [None]:
!pip install faiss-gpu



In [None]:
# get text from pdf
def get_text(file_name):
  text = ''
  reader = PdfReader(file_name)
  for page in reader.pages:
    text += page.extract_text()

  return text

# get chunks (using chunk size of 512, common for most llm, also needs chunk)
def get_chunks(text):
  text_splitter = CharacterTextSplitter(
      separator='\n',
      chunk_size = 512,
      chunk_overlap = 200,
      length_function = len,
  )

  chunks = text_splitter.split_text(text)
  return chunks

# get embeddings of the chunks adn return a knowledge base
def get_embeddings(chunks):
    emb = HuggingFaceInstructEmbeddings(model_name = "hkunlp/instructor-xl") # best hugginface model, but is slower, but okay for this one pdf.
    know_base = FAISS.from_texts(texts=chunks, embedding = emb)
    return know_base

# testing and running part:
#file_name = '/content/Sample.pdf'
file_name = '/content/TanishDissertationFinal2.pdf'
text = get_text(file_name) # this is raw text from the pdf, again mentioning the assumption the pdf is a simple text based otherwise for images, graphs more complex preprocessing is required
chunks = get_chunks(text)
kb = get_embeddings(chunks)






  from tqdm.autonotebook import trange


load INSTRUCTOR_Transformer
max_seq_length  512


In [None]:
# now we have the instructor embeddings which are stored in kb
print(kb)
# we need to ask a query, then pass it through the same llm,


<langchain.vectorstores.faiss.FAISS object at 0x7d8991cf68c0>


# Handling queries

* so now we have a knowledge base, we can ask a query, then embedding the query and match it (cosine similarity) with our KB, this will give us the most important chunks.

In [None]:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from langchain.llms import HuggingFaceHub




# to store the chat history and give context to the llm
def chat_maker(kb):
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    chats = ConversationalRetrievalChain.from_llm(
        llm = HuggingFaceHub(repo_id="google/flan-t5-xxl", model_kwargs={"temperature":0.2, "max_length":1000}, huggingfacehub_api_token='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'),
        retriever=kb.as_retriever(),
        memory=memory,
    )
    return chats



In [None]:
convo = chat_maker(kb) # kb is the knowledge base we created



In [None]:
print('type "exit" to exit the chat')
while True:
  query = input("Ask a question about AI: ") # the document is about artificial intelligence
  if query == 'exit':
    break
  print()
  history = convo({'question': query})['chat_history'] # add question to conva and get the new chat history which has the new answer
  print(history[-1].content)
  print('\n\n')

type "exit" to exit the chat
Ask a question about AI: what is optical flow

the apparent movement of objects, edges, and surfaces between consecutive video frames



Ask a question about AI: what are the types?

Sparse and Dense optical flow



Ask a question about AI: What is sparse?

Motion vectors are calculated for specific objects or features in the frame, i.e., not for each pixel in the image.



Ask a question about AI: what is dense then?

dense optical flow



Ask a question about AI: What is the parallax?

objects closer to a viewpoint have a faster perceived motion when compared to objects further away



Ask a question about AI: what is raft?

a machine learning model for estimating optical flow



Ask a question about AI: exit
