| Version | Date     | Creator          | Change description                               |
|---------|----------|------------------|--------------------------------------------------|
| v0.02   | 04/09/23 | Jaikishan Khatri | Generation with diff models, model comparison    |
| v0.01   | 03/09/23 | Jaikishan Khatri | Loader, Splitter, Storage, Retreival, Generation |

# QA Chatbot for parsing Harry Potter books to generate answers

## Process

According to LangChain process for transforming unstructured raw data into a QA chain is as follows:

1. <b>Loading</b>: We must load our data first. Numerous sources can be used to load unstructured data. The LangChain integration portal currently has 157 Document Loaders. Each loader produces a LangChain Document as the data output.
2. <b>Splitting</b>: Documents are divided into splits of a predetermined size using text splitters. 
3. <b>Storage</b>: The splits will be stored and frequently embedded in storage (such as a vectorstore). The LangChain integration portal currently has 38 Embedding Models and 53 Vector Stores.
4. <b>Retrieval</b>: The app fetches splits from storage (for instance, frequently with embeddings similar to the input query).
5. <b>Generation</b>: An LLM generates a response using a prompt that contains the query and the data that was retrieved. The LangChain integration portal currently has 68 LLMs and 13 Chat Models.
6. <b>Conversation</b> (Extension): Adds Memory to the QA chain to hold a multi-turn dialogue.

![LLM-QA-flowchart.jpeg](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

Image source: [LangChain](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

### Dependencies

### Imports

In [1]:
# loaders
from langchain.document_loaders import PyPDFLoader
# from langchain.document_loaders import PyMuPDFLoader
from langchain.document_loaders import DirectoryLoader

# text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# retrievers
from langchain.retrievers import SVMRetriever

# prompts
from langchain import PromptTemplate, LLMChain

# vector stores
from langchain.vectorstores import FAISS

# models
from langchain.llms import HuggingFacePipeline
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig, XLMRobertaForCausalLM


import warnings
warnings.filterwarnings("ignore")

  from tqdm.autonotebook import trange
2023-09-04 23:17:39.053427: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-04 23:17:39.263165: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-04 23:17:40.054560: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/home/unix_jk/miniconda3/envs/tf/lib/
2023-09-04 23:17:40.

### Configuration
For manipulable variables in the experiment

In [2]:
class CFG:
    # LLMs
    model_name = 'llama2-7b' # mdeberta-v3, wizardlm, bloom, falcon, llama2-7b, llama2-13b, Photolens-llama-2-7b, xlm-roberta
    temperature = 0,
    top_p = 0.95,
    repetition_penalty = 1.15    
    
    # splitting
    split_chunk_size = 1000
    split_overlap = 0
    
    # embeddings
    embeddings_model_repo = 'sentence-transformers/multi-qa-mpnet-base-dot-v1' 
    # 'sentence-transformers/all-MiniLM-L6-v2', 
    # 'sentence-transformers/multi-qa-MiniLM-L6-cos-v1'
    # 'sentence-transformers/multi-qa-mpnet-base-dot-v1'
    # 'sentence-transformers/multi-qa-distilbert-cos-v1'
    
    # retriever
    retriever_type = 'similarity_search' # 'similarity_search', 'MultiQueryRetriever', 'Max marginal relevance', 'SVMRetriever'
    
    # number of extracted passages
    k = 5
    
    # quantization config
#     quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
    
    # paths
    PDFs_path = './Data/HP books/'
    Embeddings_path =  './faiss_index_hp/'
    Persist_directory = './harry-potter-vectorstore/' 
    offload_folder = './offload_folder/'
    csv_path = './model_comparison/'

In [3]:
def get_model(model = CFG.model_name):
    
    """
    Returns the tokenizer and model for the specified model name.
    
    Args:
        model (str): model name
        
    Returns:
        tokenizer (transformers.tokenization_utils_base.PreTrainedTokenizerFast): tokenizer for the specified model
        model (transformers.models.gpt_neo.modeling_gpt_neo.GPTNeoForCausalLM): model for the specified model
        max_len (int): maximum length of the model
    """
    
    print('\nDownloading model: ', model, '\n\n')

    if model == 'wizardlm':
        model_repo = 'TheBloke/wizardLM-7B-HF'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True
        )
        
        max_len = 1024
        
    # cuda error
    elif model == 'xlm-roberta':
        model_repo = 'IProject-10/xlm-roberta-base-finetuned-squad2'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)
        
        model = XLMRobertaForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_len = 2048
    
    elif model == 'Photolens-llama-2-7b':
        model_repo = 'Photolens/llama-2-7b-langchain-chat'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)
        
        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_len = 4096
        
    elif model == 'llama2-7b':
        model_repo = 'daryl149/llama-2-7b-chat-hf'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_len = 2048

    elif model == 'llama2-13b':
        model_repo = 'daryl149/llama-2-13b-chat-hf'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            offload_folder=CFG.offload_folder,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
            quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
        )
        
        max_len = 8192

    elif model == 'bloom':
        model_repo = 'bigscience/bloom-7b1'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
        )
        
        max_len = 1024

    elif model == 'falcon':
        model_repo = 'h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_len = 1024
    # tokenizer error
    elif model == 'mdeberta-v3':
        model_repo = 'timpal0l/mdeberta-v3-base-squad2'
        
        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_len = 2048

    else:
        print("Not implemented model (tokenizer and backbone)")

    return tokenizer, model, max_len

In [4]:
%%time

tokenizer, model, max_len = get_model(model = CFG.model_name)


Downloading model:  llama2-7b 




Widget Javascript not detected.  It may not be installed or enabled properly. Reconnecting the current kernel may help.



CPU times: user 4.38 s, sys: 4.23 s, total: 8.61 s
Wall time: 11.7 s


### Pipeline

Create a pipeline for the model using HuggingFacePipeline

In [5]:
pipe = pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    pad_token_id = tokenizer.eos_token_id,
    max_length = max_len,
    temperature = CFG.temperature,
    top_p = CFG.top_p,
    repetition_penalty = CFG.repetition_penalty
)

llm = HuggingFacePipeline(pipeline = pipe)

In [6]:
llm

HuggingFacePipeline(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, metadata=None, pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7faf6c471a80>, model_id='gpt2', model_kwargs=None, pipeline_kwargs=None)

### Step 1: Loader 
using PyPDFLoader

Load PDF using `pypdf and chunks at character level.\
Loader also stores page numbers in metadata.

In [7]:
%%time

loader = DirectoryLoader(
    CFG.PDFs_path,
    glob="./*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)

documents = loader.load()

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:34<00:00,  4.91s/it]

CPU times: user 33.3 s, sys: 1.38 s, total: 34.7 s
Wall time: 34.4 s





In [8]:
len(documents)

4114

### Step 2: Splitter

In [9]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CFG.split_chunk_size,
    chunk_overlap  = CFG.split_overlap,
    length_function = len,
    add_start_index = True,
)
all_splits = text_splitter.split_documents(documents)
len(all_splits)

8383

### Step 3: Store

In [10]:
def create_embeddings(embeddings_model_repo: str, embeddings_path: str = './faiss_index_hp', new_vectorstore: bool=False):
    
    if new_vectorstore:
        
        if embeddings_model_repo.startswith('sentence-transformers'):
            embeddings = HuggingFaceInstructEmbeddings(model_name = embeddings_model_repo,
                                                       model_kwargs = {"device": "cuda"})

        elif embeddings_model_repo.startswith('GPT4All'):
            embeddings = HuggingFaceInstructEmbeddings(model_name = embeddings_model_repo,
                                                       model_kwargs = {"device": "cuda"})

        # create embeddings and new_vectorstore
        vectorstore = FAISS.from_documents(documents = all_splits, 
                                           embedding = embeddings)

        # persist vector database
        vectorstore.save_local("faiss_index_hp")
        
    else:

        # download embeddings model
        embeddings = HuggingFaceInstructEmbeddings(model_name = CFG.embeddings_model_repo,
                                                   model_kwargs = {"device": "cuda"})

        # load vectorstore embeddings
        vectorstore = FAISS.load_local(CFG.Embeddings_path, embeddings)
        
    return vectorstore, embeddings

In [11]:
vectorstore, embeddings = create_embeddings(CFG.embeddings_model_repo, CFG.Embeddings_path, new_vectorstore=True)

load INSTRUCTOR_Transformer
max_seq_length  512


In [12]:
question = "What are Hagrid's favourite animals?"
docs = vectorstore.similarity_search(question)
len(docs)

4

In [13]:
docs

[Document(page_content='found\tHagrid,\tdon’t\tyou\tthink?\tWhy\tdidn’t\tI\tsee\tit\tbefore?”\n\t\t\t\t\t\t“What\tare\tyou\ttalking\tabout?”\tsaid\tRon,\tbut\tHarry,\tsprinting\tacross\tthe\ngrounds\ttoward\tthe\tforest,\tdidn’t\tanswer.\n\t\t\t\t\t\tHagrid\twas\tsitting\tin\tan\tarmchair\toutside\this\thouse;\this\ttrousers\tand\nsleeves\twere\trolled\tup,\tand\the\twas\tshelling\tpeas\tinto\ta\tlarge\tbowl.\n\t\t\t\t\t\t“Hullo,”\the\tsaid,\tsmiling.\t“Finished\tyer\texams?\tGot\ttime\tfer\ta\tdrink?”\n\t\t\t\t\t\t“Yes,\tplease,”\tsaid\tRon,\tbut\tHarry\tcut\thim\toff.\n\t\t\t\t\t\t“No,\twe’re\tin\ta\thurry.\tHagrid,\tI’ve\tgot\tto\task\tyou\tsomething.\tYou\tknow\nthat\tnight\tyou\twon\tNorbert?\tWhat\tdid\tthe\tstranger\tyou\twere\tplaying\tcards\twith\nlook\tlike?”\n\t\t\t\t\t\t“Dunno,”\tsaid\tHagrid\tcasually,\t“he\twouldn’\ttake\this\tcloak\toff.”\n\t\t\t\t\t\tHe\tsaw\tthe\tthree\tof\tthem\tlook\tstunned\tand\traised\this\teyebrows.\n\t\t\t\t\t\t“It’s\tnot\tthat\tunusual,\tyeh\tg

### Step 4. Retrieve
similarity_search\
MultiQueryRetriever\
Max marginal relevance\
SVMRetriever

### Promt Template

In [14]:
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Answer in the same language the question was asked.
Use three sentences maximum and keep the answer as concise as possible.

{context}

Question: {question}
Answer:"""


PROMPT = PromptTemplate(
    template = prompt_template, 
    input_variables = ["context", "question"]
)

In [15]:
retriever = vectorstore.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : "similarity"})

In [16]:
retriever

VectorStoreRetriever(tags=['FAISS'], metadata=None, vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x7faf64106110>, search_type='similarity', search_kwargs={'k': 5, 'search_type': 'similarity'})

### Step 5: Generate
Retriever chain for QA

In [None]:
from langchain.chains import RetrievalQA
# from langchain.llms import GPT4All

# llm_gpt = GPT4All(model='./models/nous-hermes-13b.ggmlv3.q4_0', max_tokens=4096, n_threads = 12)

# nous-hermes-13b.ggmlv3.q4_0
# GPT4All-13B-snoozy.ggmlv3.q4_0
# ggml-gpt4all-j-v1.3-groovy

qa_chain = RetrievalQA.from_chain_type(
#     llm = llm_gpt,
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever, 
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

### Post process outputs

In [None]:
import time
import textwrap

In [None]:
def wrap_text_preserve_newlines(text, width=700):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text


def process_llm_response(llm_response):
    ans = wrap_text_preserve_newlines(llm_response['result'])
    
    sources_used = ' \n'.join(
        [
            source.metadata['source'].split('/')[-1][:-4] + ' - page: ' + str(source.metadata['page'])
            for source in llm_response['source_documents']
        ]
    )
    
    ans = ans + '\n\nSources: \n' + sources_used
    return ans

In [None]:
def llm_ans(query):
    start = time.time()
    llm_response = qa_chain(query)
    ans = process_llm_response(llm_response)
    end = time.time()

    time_elapsed = int(round(end - start, 0))
    time_elapsed_str = f'\n\nTime elapsed: {time_elapsed} s'
    return ans + time_elapsed_str



### Compare models

In [None]:
query = "Which are Hagrid's favorite animals?"
# print(llm_ans(query))

# create a new dictionary
ans = qa_chain(query)['result']
ans_dict = {query: ans}
print(ans)

In [None]:
query = "Which challenges does Harry face during the Triwizard Tournament?"
ans = qa_chain(query)['result']
# print(llm_ans(query))


# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "Give me 5 examples of cool potions and explain what they do"
ans = qa_chain(query)['result']
print(ans)
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "What did Gandalf do in the story?"
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "¿Cuál es la profesión de los padres de Harry Potter?"
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "Dame 5 ejemplos de pociones geniales y explica para qué sirven."
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "Was Gandalf in the Harry Potter books?"
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "Name all seven Weasley children."
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "Moony, Wormtail, Padfoot, and Prongs are code names for which four characters?"
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "What position does Harry play on the Gryffindor Quidditch team?"
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
query = "Name the three different types of balls used in Quidditch."
# print(llm_ans(query))

# add to the dictionary
ans = qa_chain(query)['result']
ans_dict = {**ans_dict, **{query: ans}}
print(ans)

In [None]:
# exporting csv of result answers for model comparison

import pandas as pd
ans_df = pd.DataFrame.from_dict([ans_dict])

model_repo_name = CFG.embeddings_model_repo

try:
    model_repo_name = CFG.embeddings_model_repo.split('/')[1]
except:
    model_repo_name = CFG.embeddings_model_repo
    
ans_df.to_csv(CFG.csv_path + CFG.model_name + '_' + model_repo_name +'.csv', index=False)

### Conversational Chatbot

In [17]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [18]:
custom_prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Answer in the same language the question was asked.
Use three sentences maximum and keep the answer as concise as possible.

{context}

Combine the chat history and follow up question into a standalone question. Keep the question same if there is no chat history.
Chat History: {chat_history}
Follow up question: {question}

Answer:"""


CUSTOM_QUESTION_PROMPT = PromptTemplate(
    template = custom_prompt_template, 
    input_variables = ["context", "chat_history", "question"]
)

In [24]:
from langchain.chains import ConversationalRetrievalChain

qa = ConversationalRetrievalChain.from_llm(
    llm = llm,
    retriever = retriever,
    condense_question_prompt=CUSTOM_QUESTION_PROMPT,
    memory=memory
)

ValidationError: 1 validation error for ConversationalRetrievalChain
chain_type_kwargs
  extra fields not permitted (type=value_error.extra)

In [None]:
from langchain.chains import ConversationalRetrievalChain


conv_qa_chain = RetrievalQA.from_chain_type(
#     llm = llm_gpt,
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever, 
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

In [22]:
query = "What position does Harry play on the Gryffindor Quidditch team?"
print(qa(query))

{'question': 'What position does Harry play on the Gryffindor Quidditch team?', 'chat_history': [HumanMessage(content='What position does Harry play on the Gryffindor Quidditch team?', additional_kwargs={}, example=False), AIMessage(content=' Harry plays the position of Seeker on the Gryffindor House team.', additional_kwargs={}, example=False)], 'answer': ' Harry plays the position of Seeker on the Gryffindor House team.'}


In [23]:
query = "What are the names of the three balls?"
print(qa(query))

ValueError: Missing some input keys: {'context'}