### QA Bot: A PDF QA Assistant

The main objective of this notebook is to demonstrate how to use the LLM to create the QA bot for PDF files. The process can break into a number of steps:
1. Data Preprocessing - PDF files are processed to extract their textual content
2. Content Chunking - The text content is divided into fixed-size chunks of 512 tokens, overlapping 100 tokens. 
3. Embeddings Generation - Each content chunk is transformed into embeddings using ```e5-base-b2``` model. (Any other suitable model can be used here)
4. Building the QA bot: ```google/flan-t5-large``` LLM model was selected for this task.


Note that in this demo, I did not enhance the Conversational Memory. For those reader who are interested in this topic, please take a look about [4]

<img src="overview.png"  width="600" height="300">

### PDF file sources
- https://en.wikipedia.org/wiki/Large_language_model
- https://en.wikipedia.org/wiki/ChatGPT
- https://en.wikipedia.org/wiki/Stable_Diffusion
- https://en.wikipedia.org/wiki/Midjourney



#### References
[1] - Chunking Strategries for LLM Applications:
   https://www.pinecone.io/learn/chunking-strategies/
   
[2] - Massive Text Embedding Benchmark (MTEB) Leaderboard - https://huggingface.co/spaces/mteb/leaderboard

[3] - Converstaional Retrieval QA - https://python.langchain.com/docs/modules/chains/popular/chat_vector_db

[4] - Conversational Memory for LLMs with Langchain - https://www.pinecone.io/learn/series/langchain/langchain-conversational-memory/

In [1]:
from PyPDF2 import PdfReader
import glob
import os

from transformers import AutoTokenizer, pipeline

from langchain import HuggingFacePipeline
from langchain.chains import ConversationalRetrievalChain
from langchain.chains.conversation.memory import ConversationSummaryMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

import warnings
warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

### 1. Data Preprocessing

In [2]:
file_content = []
metadatas = []
fs = glob.glob("./files/*.pdf")

for f in fs:
    reader = PdfReader(f)
    for p in reader.pages:
        file_content.append(p.extract_text())
        metadatas.append({"source": os.path.basename(f)})
    

### 2. Content Chunking

In [3]:
# create the splitter
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 512,
    chunk_overlap  = 100
)

lst_chunk = text_splitter.create_documents(file_content, metadatas=metadatas)


### 3. Embeddings Generation

In [4]:
embeddings = HuggingFaceEmbeddings(model_name="intfloat/e5-base-v2") 

db = FAISS.from_documents(lst_chunk, embeddings)

### 4. QA Bot Construction

In [5]:
model = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model, padding_side="left")

retriever = db.as_retriever(search_kwargs={"k": 3, "score_threshold": 0.9}) # top 3, threshold: 0.9

pl = pipeline(
            model=model,
            tokenizer = tokenizer,
            model_kwargs={"max_length": 512, "temperature": 0.0},
            device_map="auto"
        )
hf_llm = HuggingFacePipeline(pipeline=pl)



Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at google/flan-t5-large and are newly initialized: ['decoder.embed_tokens.weight', 'encoder.embed_tokens.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
general_system_template = r""" 
Given a specific context, please give a short answer to the question. 
If you cannot find a proper ansser from the context, just say that I don't know, don't try to make up an answer. 
 ----
{context}
----
"""
general_user_template = "Question:```{question}```"
messages = [
            SystemMessagePromptTemplate.from_template(general_system_template),
            HumanMessagePromptTemplate.from_template(general_user_template)
]
qa_prompt = ChatPromptTemplate.from_messages( messages )

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=hf_llm,
        retriever=retriever, 
        memory=memory,
        combine_docs_chain_kwargs={'prompt': qa_prompt}
)




### Chat with QA Bot

In [7]:
conversation_chain({'question': "What's the ChatGPT?"}) 

{'question': "What's the ChatGPT?",
 'chat_history': [HumanMessage(content="What's the ChatGPT?", additional_kwargs={}, example=False),
  AIMessage(content='Chat Generative Pre-Trained Transformer', additional_kwargs={}, example=False)],
 'answer': 'Chat Generative Pre-Trained Transformer'}

In [8]:
conversation_chain({'question': "What's the latest engine version of ChatGPT?"}) 

{'question': "What's the latest engine version of ChatGPT?",
 'chat_history': [HumanMessage(content="What's the ChatGPT?", additional_kwargs={}, example=False),
  AIMessage(content='Chat Generative Pre-Trained Transformer', additional_kwargs={}, example=False),
  HumanMessage(content="What's the latest engine version of ChatGPT?", additional_kwargs={}, example=False),
  AIMessage(content='GPT-4', additional_kwargs={}, example=False)],
 'answer': 'GPT-4'}

In [9]:
conversation_chain({'question': "what is the major difference between ChatGPT 3.5 and 4.0"}) # show I don't know if not enough info

{'question': 'what is the major difference between ChatGPT 3.5 and 4.0',
 'chat_history': [HumanMessage(content="What's the ChatGPT?", additional_kwargs={}, example=False),
  AIMessage(content='Chat Generative Pre-Trained Transformer', additional_kwargs={}, example=False),
  HumanMessage(content="What's the latest engine version of ChatGPT?", additional_kwargs={}, example=False),
  AIMessage(content='GPT-4', additional_kwargs={}, example=False),
  HumanMessage(content='what is the major difference between ChatGPT 3.5 and 4.0', additional_kwargs={}, example=False),
  AIMessage(content="I don't know", additional_kwargs={}, example=False)],
 'answer': "I don't know"}