| Version | Date     | Creator          | Change description                                 |
|---------|----------|------------------|----------------------------------------------------|
| v0.07   | 10/09/23 | Jaikishan Khatri | Documentation update, Spanish language integration |
| v0.06   | 08/09/23 | Jaikishan Khatri | Model performance and public release of repo       |
| v0.05   | 07/09/23 | Jaikishan Khatri | Model comparison and tuning for better outputs     |
| v0.04   | 06/09/23 | Jaikishan Khatri | Memory integration for chat functionality          |
| v0.03   | 05/09/23 | Jaikishan Khatri | Trial of different embedding models                |
| v0.02   | 04/09/23 | Jaikishan Khatri | Generation with diff local LLM models              |
| v0.01   | 03/09/23 | Jaikishan Khatri | Loader, Splitter, Storage, Retrieval, Generation   |

# QA Chatbot for parsing Harry Potter books to generate answers

## Process  

Here are a few methods you may use to limit an AI to only providing answers based on a particular dataset:  
- **Retrieval-based**: Create a data indexing and searching information retrieval system that solely searches your dataset for solutions. To find answers, the chatbot would ask this search engine questions. Still, the chatbot itself might be a standard conversational paradigm. This system is also known as a multi-agent system.  
- **Fine-tuning**: Use your dataset to fine-tune a large language model that is already available, such as [Llama 2](https://ai.meta.com/llama/). By doing this, you can customize the model while keeping its basic conversational capabilities.  
- **Modular approach**: Use a generic conversational model for interpreting natural language and a different module that asks questions of your knowledge base to produce answers. The parts are kept apart as a result. This method is also known as the MOE (Mixture of Experts).
- **Conversational scoping**: Whatever the primary method implementation, there should probably be some conversational scoping techniques, such as permitting only specific topics, rerouting questions that are off-topic back to the domain, responding "I don't know" when they are not relevant, etc.  

The following notebook focuses on the Retrieval-based model using [LangChain] and [HuggingFace](https://huggingface.co/) models.

According to 🦜🔗*LangChain*, process for transforming unstructured raw data into a QA chain is as follows:

1. **Loading**: We must load our data first. Numerous sources can be used to load unstructured data.
2. **Splitting**: Documents are divided into splits of a predetermined size using text splitters. 
3. **Storage**: The splits will be stored and frequently embedded in storage (such as a vectorstore).
4. **Retrieval**: The app fetches splits from storage (for instance, frequently with embeddings similar to the input query).
5. **Generation**: An LLM generates a response using a prompt that contains the query and the data that was retrieved. 
6. **Conversation** (Extension): Adds Memory to the QA chain to hold a multi-turn dialogue.

![LLM-QA-flowchart.jpeg](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

Image source: [LangChain](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

## Dependencies

Install dependencies from requirements.txt and make sure GPU is available through CuDNN.

In [1]:
# pip install -r requirements.txt

## Imports

In [2]:
import pandas as pd
import os
import torch 

import warnings
warnings.filterwarnings("ignore")

from langchain.chains import RetrievalQA
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

2023-09-11 01:14:36.325879: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-11 01:14:36.415226: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-11 01:14:36.824536: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-09-11 01:14:36.824608: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not l

In [3]:
# Following code runs on CUDA and CuDNN
# Check if GPU is available

torch.cuda.is_available()

True

## Configuration
Major Hyperparameters which can be used to tune the output of LLMs:  
- `temperature` is a hyperparameter that controls the randomness of language model output.  
- `top_p` also known as nucleus sampling, is another hyperparameter that controls the randomness of language model output. It sets a threshold probability and selects the top tokens whose cumulative probability exceeds the threshold. The model then randomly samples from this set of tokens to generate output.  
- `max_length` maximum token length.
- `repetition_penalty` The penalty to apply to repeated tokens.
- `chunk_size` is a hyperparameter for embeddings model which changes the size of chunks created by splitting.
- `chunk_overlap` is another hyperparameter for embeddings model to have overlap between two chunks, this in turn helps in reducing data loss because of low semantic chunks created by the splitter.
- `k` is the number of document chunks to feed the LLM model as context after extracting from the retriever.
- `search_type` there are majorly three types of vectorstore retrievers: similarity, max marginal reference and similarity score threshold.

In [4]:
# working available models
available_models = {'vicuna': 'lmsys/vicuna-7b-v1.3',
                    'wizardlm': 'TheBloke/wizardLM-7B-HF',
                    'Photolens-llama-2-7b': 'Photolens/llama-2-7b-langchain-chat',
                    'llama2-7b': 'daryl149/llama-2-7b-chat-hf',
                    'bloom': 'bigscience/bloom-7b1',
                    'falcon': 'h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2',
                    'Llama2-7b-es': 'clibrain/Llama-2-7b-ft-instruct-es-gptq-4bit'
                   }

class CFG:
    
    # Model name
    # other working models in this program currently include: wizardlm, bloom, falcon, llama2-7b, Photolens-llama-2-7b, vicuna, 
    model_name = 'vicuna'
    
    # using temperature 0.0 or 0.1 because we don't want the LLM to go out of context
    temperature = 0.1
    
    # using 0.95 to keep relevancy
    top_p = 0.95 
    
    # use 1 for no penalty (higher the number higher the penalty)
    repetition_penalty = 1 
    
    # document splitting
    split_chunk_size = 500
    split_overlap = 100
    
    # vector-based retriever search type
    # 'similarity', 'mmr', 'similarity_score_threshold' (requires search_kwargs={"score_threshold": .5})
    search_type = 'similarity' 
    
    # create a new vectorstore, False for using pre-built vectorstore, always True after changing embeddings parameters
    new_vectorstore = True
    
    # number of extracted passages using retriever
    k = 4
    
    # quantization config
#     quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
    
    # paths
    offload_folder = './offload_folder/'
    csv_path = './model_comparison/'
    
    # select language to select model parameters
    model_language = 'English' # 'English', 'Spanish'
    
    if model_language == 'English':
        
        # embeddings (larger models take up more RAM)
        embeddings_model_repo = 'sentence-transformers/all-MiniLM-L12-v2'
        
        # path of vectorstore
        embeddings_path =  './data/vectorstore-en'
        
        # path of pdf books in English
        pdfs_path = './data/hp-books-en/'
        
    elif model_language == 'Spanish':
        
        # embeddings (larger models take up more RAM)
        embeddings_model_repo = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'
        
        # path of vectorstore
        embeddings_path =  './data/vectorstore-es'
        
        # path of pdf books in Spanish
        pdfs_path = './data/hp-books-es/'
        
    else:
        raise ValueError('Language can only be "English" or "Spanish"')

## Load Model

- There are many ways to load models using APIs from different LLM providers like Paid: [ChatOpenAI](https://openai.com/product) and Free: [HuggingFaceHub](https://huggingface.co/inference-api), etc. 
Models can also be loaded locally through [PrivateGPT](https://github.com/imartinez/privateGPT), [llama.cpp](https://github.com/ggerganov/llama.cpp), [GPT4ALL](https://github.com/nomic-ai/gpt4all).  

- Currently, I'm using `HuggingFacePipeline` to load models locally which are hosted at [HuggingFace](https://huggingface.co/) because of good free resources available locally to me. 

- The file might not run in Colab due to limited free resources provided for free by Google. Required resources: 32 GB RAM, 8 GB dedicated GPU or Parallel GPUs with HuggingFace accelerate.

### Load HuggingFace models locally

In [5]:
def create_model(av_models, model_name = CFG.model_name):
    
    """ Returns the tokenizer and model for the specified model name.
    Models are currently selected based on the ability to run on local machine with 32 GB Memory and 8 GB Cuda RAM to run.
    
    Parameters:
    ----------
    model : str
        Name of the model to be used.
        
    Returns:
    -------
    tokenizer : transformers.tokenization_utils_base.PreTrainedTokenizerBase
        Tokenizer for the specified model.
    model : transformers.modeling_utils.PreTrainedModel
        Model for the specified model.
    max_len : int
        Maximum length of the input sequence for the specified model.
    """
    
    if model_name == 'vicuna':
        model_repo = av_models[model_name]
        model_tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)
        model_name = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_length = 4096    
        
    elif model_name == 'wizardlm':
        model_repo = av_models[model_name]
        model_tokenizer = AutoTokenizer.from_pretrained(model_repo)
        model_name = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True
        )
        
        max_length = 4096  
        
        
    elif model_name == 'Photolens-llama-2-7b':
        model_repo = av_models[model_name]
        model_tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)
        model_name = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_length = 4096
        
    elif model_name == 'llama2-7b':
        model_repo = av_models[model_name]
        model_tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)
        model_name = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_length = 4096

    elif model_name == 'bloom':
        model_repo = av_models[model_name]
        model_tokenizer = AutoTokenizer.from_pretrained(model_repo)
        model_name = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
        )
        
        max_length = 4096

    elif model_name == 'falcon':
        model_repo = av_models[model_name]
        model_tokenizer = AutoTokenizer.from_pretrained(model_repo)
        model_name = AutoModelForCausalLM.from_pretrained(
            model_repo,
            load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_length = 4096
    
    elif model_name == 'Llama2-7b-es':
        model_repo = av_models[model_name]
        model_tokenizer = AutoTokenizer.from_pretrained(model_repo)
        model_name = AutoModelForCausalLM.from_pretrained(
            model_repo,
#             load_in_4bit=True,
            device_map='auto',
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )
        
        max_length = 4096
        
    else:
        raise ValueError('Incorrect Model Name')

    return model_tokenizer, model_name, max_length

In [6]:
%%time

tokenizer, model, max_len = create_model(available_models, model_name=CFG.model_name)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=True`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 4.85 s, sys: 4.88 s, total: 9.73 s
Wall time: 12 s


### Pipeline

Create a pipeline for the model using `HuggingFacePipeline`

In [7]:
### Comment this cell if you want to run GPT4ALL models locally

pipe = pipeline(
    task = "text-generation",
    model = model,
    tokenizer = tokenizer,
    pad_token_id = tokenizer.eos_token_id,
    max_length = max_len,
    temperature = CFG.temperature,
    top_p = CFG.top_p,
    repetition_penalty = CFG.repetition_penalty
)

llm = HuggingFacePipeline(pipeline = pipe)

### Load GPT4ALL models locally

- To run GPT4ALL models locally, python library `gpt4all` should be installed. Currently, GPT4ALL models run on CPU only.
- GPT4ALL models need the `.bin` files in the directory `./models/`, which can either be downloaded from the [model explorer](https://gpt4all.io/index.html) on the website or directly using parameter `allow_download=True`.

In [8]:
# ### Uncomment to run GPT4ALL models (Comment the Pipeline from HuggingFace)

# from langchain.llms import GPT4All

# llm = GPT4All(model='./models/ggml-gpt4all-j-v1.3-groovy', 
#               allow_download=True, 
#               temp=CFG.temperature
#               top_p=CFG.top_p
#               repeat_penalty=CFG.repetition_penalty,
#               max_tokens=4096, 
#               n_threads = 12)

# ### Tested models on this notebook
# # nous-hermes-13b.ggmlv3.q4_0
# # GPT4All-13B-snoozy.ggmlv3.q4_0
# # ggml-gpt4all-j-v1.3-groovy

#### Print the model pipeline

In [9]:
llm

HuggingFacePipeline(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, metadata=None, pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x7efcf7ef0d90>, model_id='gpt2', model_kwargs=None, pipeline_kwargs=None)

## QA Pipeline

### Step 1: Loading

- Load data (here Harry Potter PDF books) and parse data from a directory and convert them into text.  
- Using `DirectoryLoader` to load the directory of the PDF documents.  
- Benefits of using `PyPDFLoader` is that it creates chunks at character level and also stores page numbers in metadata which can be used to reference the source files.
- Loading can be done by multiple ways as mentioned in the [LangChain Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders.html) section.

**Note:**
The LangChain integration portal currently has [157 Document Loaders](https://integrations.langchain.com/). Each loader produces a LangChain Document as the data output.

In [10]:
def get_raw_pdf(pdfs_path):
    
    """ Loads PDF documents from a directory and converts them into text.
     
    Parameters
    ----------
    pdfs_path : str
        Path to the directory containing PDF documents.
        
    Returns
    -------
    documents : list
        List of LangChain Documents.
    """
    
    loader = DirectoryLoader(
        pdfs_path,
        loader_cls=PyPDFLoader,
        glob="./*.pdf",
        show_progress=True,
        use_multithreading=True
    )
    documents = loader.load()
    return documents

### Step 2: Splitter

- Split the text up into small, semantically meaningful chunks (often sentences) of predefined sizes. This helps in creating smaller batches of data to be embedded in VectorStore. 
- These semantically related pieces of text are stored closer to each other for better extraction.
- Benefit of using `RecursiveCharacterTextSplitter` is that it splits text by recursively looking at characters. Recursively tries to split by different characters to find one that works.


In [11]:
def get_document_chunks(documents, split_chunk_size: int=500, split_overlap: int=0):
    
    """ Splits the documents into chunks of predefined size.
    
    Parameters
    ----------
    documents : list
        List of LangChain Documents.
    split_chunk_size : int
        Size of the chunks to be created from the documents.
    split_overlap : int
        Overlap between two chunks.
            
    Returns
    -------
    doc_chunks : list
        List of LangChain Documents.   
    """
        
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=split_chunk_size,
        chunk_overlap=split_overlap,
        separators = ["\n\n", "\n", "\t", " ", ""],
        length_function=len,
        add_start_index=True,
    )
    doc_chunks = text_splitter.split_documents(documents)
    return doc_chunks

### Step 3: Store

- Embedding models are used to embedded sentences (Performance Sentence Embeddings) and to embedded search queries & paragraphs (Performance Semantic Search).  
- The current embedding model was selected from the list of best [pre-trained sentence transformers](https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models) which are hosted on HuggingFace.
- The benefit of using [FAISS vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss)  is that it can use GPU for constructing embeddings which is faster that many other Vector Stores. It also stores the embeddings in local directory to be used by the models locally.

**Note:**
The LangChain integration portal currently has [53 VectorStores](https://integrations.langchain.com/vectorstores) and [39 Embedding Models](https://integrations.langchain.com/embeddings).

In [12]:
def _create_new_vectorstore(pdfs_path: str, split_chunk_size: int, split_overlap: int, 
                            embeddings_model_repo: str, embeddings_path: str):
    
    """ Creates a new vectorstore and embeddings model. 
    
    Parameters
    ----------
    pdfs_path : str
        Path to the directory containing PDF documents.
    split_chunk_size : int
        Size of the chunks to be created from the documents.
    split_overlap : int
        Overlap between two chunks.
    embeddings_model_repo : str
        Name of the embedding model to be used.
    embeddings_path : str
        Path to the directory where the vectorstore is to be stored.
        
    Returns
    -------
    vectorstore : langchain.vectorstores.faiss.FAISS
        Vectorstore containing the embeddings of the documents.
    """

    # load PDF documents
    documents = get_raw_pdf(pdfs_path)
    
    # split them into chunks
    doc_chunks = get_document_chunks(documents, split_chunk_size, split_overlap)
    
    # create an embedding model
    model_embeddings = HuggingFaceInstructEmbeddings(model_name = embeddings_model_repo,
                                                     model_kwargs = {"device": "cuda"})

    # create new_vectorstore
    vec_store = FAISS.from_documents(documents = doc_chunks,
                                     embedding = model_embeddings)

    # persist vector database
    vec_store.save_local(embeddings_path)
    
    return vec_store, model_embeddings

def _load_prev_vectorstore(embeddings_model_repo: str, embeddings_path: str):
    
    """ Loads a previously created vectorstore and embeddings model.
    
    Parameters
    ----------
    embeddings_model_repo : str
        Name of the embedding model to be used.
    embeddings_path : str
        Path to the directory where the vectorstore is to be stored.
                
    Returns
    -------
    vectorstore : langchain.vectorstores.faiss.FAISS
        Vectorstore containing the embeddings of the documents.
    """
    
    # download embeddings model
    model_embeddings = HuggingFaceInstructEmbeddings(model_name = embeddings_model_repo,
                                                     model_kwargs = {"device": "cuda"})

    # load vectorstore and embeddings
    vec_store = FAISS.load_local(embeddings_path, model_embeddings)
    
    return vec_store, model_embeddings


def create_vectorstore_embeddings(pdfs_path: str, split_chunk_size: int, split_overlap: int, 
                                  embeddings_model_repo: str, embeddings_path: str,
                                  new_vectorstore: bool=False):
    
    """ Creates a new vectorstore and embeddings model if new_vectorstore is True else loads a previously created vectorstore and embeddings model.
    
    Parameters
    ----------
    pdfs_path : str
        Path to the directory containing PDF documents.
    split_chunk_size : int
        Size of the chunks to be created from the documents.
    split_overlap : int
        Overlap between two chunks.
    embeddings_model_repo : str
        Name of the embedding model to be used.
    embeddings_path : str
        Path to the directory where the vectorstore is to be stored.
    new_vectorstore : bool
        If True, creates a new vectorstore and embeddings model else loads a previously created vectorstore and embeddings model.
        
    Returns
    -------
    vectorstore : langchain.vectorstores.faiss.FAISS
        Vectorstore containing the embeddings of the documents.
    """
    
    if not new_vectorstore:
        if os.path.isfile(embeddings_path+'/index.faiss'):
            vec_store, model_embeddings = _load_prev_vectorstore(embeddings_model_repo, embeddings_path)
            
        else:
            vec_store, model_embeddings = _create_new_vectorstore(pdfs_path, split_chunk_size, split_overlap, embeddings_model_repo, embeddings_path)
    
    else:
        vec_store, model_embeddings = _create_new_vectorstore(pdfs_path, split_chunk_size, split_overlap, embeddings_model_repo, embeddings_path)
     
        
    return vec_store, model_embeddings

In [13]:
vectorstore, embeddings = create_vectorstore_embeddings(CFG.pdfs_path,
                                                        CFG.split_chunk_size,
                                                        CFG.split_overlap,
                                                        CFG.embeddings_model_repo,
                                                        CFG.embeddings_path,
                                                        CFG.new_vectorstore)

100%|██████████| 7/7 [00:34<00:00,  4.93s/it]


load INSTRUCTOR_Transformer
max_seq_length  512


### Step 4. Retrieve
- A retriever is an interface that returns documents given an unstructured query. Vector Stores can be taken as retrievers to retrieve relevant documents.

- Different types of retrieval methods include similarity search, Max marginal relevance, and similarity score threshold.  

In [14]:
retriever = vectorstore.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : CFG.search_type})

### Custom Prompt

- Prompt engineering is the process of structuring text that can be interpreted and understood by a generative AI model. It is a part of tuning methodology for generating better outputs from LLMs. .  
- The context is extracted from the Retriever and passed into the `context` variable and the query or user question is passed into `question` variable and passed through `PromptTemplate` to create a custom prompt.


In [15]:
prompt_template_en = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Answer in the same language the question was asked.

{context}

Question: {question}
Answer:"""

prompt_template_es = """Utilice las siguientes piezas de contexto para responder la pregunta al final.
Si no sabe la respuesta, simplemente diga que no la sabe, no intente inventar una respuesta.
Mantenga su respuesta lo más concisa posible.

{context}

Pregunta: {question}
Respuesta:"""

if CFG.model_language =='English':
    prompt_template =  prompt_template_en
elif CFG.model_language == 'Spanish':
    prompt_template = prompt_template_es
else:
    raise ValueError('Language can only be "English" or "Spanish"')

PROMPT = PromptTemplate(
    template = prompt_template,
    input_variables=["context", "question"]
)

### Step 5: Generate
Create a response from the collected documents using the LLM/Chat model.

- The LangChain integration portal currently has [69 LLMs](https://integrations.langchain.com/llms) and [14 Chat Models](https://integrations.langchain.com/chat-models).
- Conversation (Extension) is applied on `chatbotapp.py`

In [16]:
qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever, 
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

## Compare models

In [17]:
def compare_model_ans(user_query, model_answer, answer_dict):
    
    """ Compares the answers from different models and stores them in a dictionary.
    
    Parameters
    ----------
    user_query : str
        Query or question asked by the user.    
    model_answer : dict
        Answer returned by the model.
    answer_dict : dict
        Dictionary to store the answers from different questions.
        
    Returns
    -------
    ans_dict : dict
        Dictionary with answers from different questions.
    """
    
    if answer_dict is None:
        answer_dict = {user_query: model_answer['result']}
    else:
        answer_dict = {**answer_dict, **{user_query: model_answer['result']}}
    return answer_dict

In [18]:
%%time

ans_dict = None

query = "Which are Hagrid's favorite animals?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 Hagrid's favorite animals are magical creatures.
CPU times: user 1.11 s, sys: 398 ms, total: 1.51 s
Wall time: 1.54 s


In [19]:
%%time

query = "Which challenges does Harry face during the Triwizard Tournament?"

ans = qa_chain(query)
ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 Harry faces three challenges during the Triwizard Tournament: the first challenge is a swimming race, the second challenge is a diving competition, and the third challenge is a dragon-riding competition.
CPU times: user 2.7 s, sys: 649 ms, total: 3.35 s
Wall time: 3.36 s


In [20]:
%%time

query = "Give me 5 examples of cool potions and explain what they do"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])



1.	Unfogging Solution: This potion is used to remove the effects of a "fomorogue" curse, which causes the victim to be unable to speak the truth. The potion is made by mixing the powdered leaves of the "floccinous frog" plant with a small amount of the "unfogging draught" made from the "star-shaped fern" and "glimmerwater" plants. The resulting potion is then drunk by the victim, restoring their ability to speak the truth.
2.	Polyjuice Potion: This potion is used to transform one person into another, allowing them to take on the appearance and voice of the target. The potion is made by mixing the powdered leaves of the "sphinx mandrake" plant with a small amount of the "hair of the dragon" and "hair of the cat" plants. The resulting potion is then drunk by the person who wants to transform into the target.
3.	Amortentia Potion: This potion is used to help a witch or wizard remember a particular scent, which can be used to trigger a memory or a feeling of love. The potion is made by m

In [21]:
%%time

query = "Name all seven Weasley children."
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 Bill, Charlie, Percy, Ron, and Ginny Weasley.
CPU times: user 1.81 s, sys: 649 ms, total: 2.45 s
Wall time: 2.46 s


In [22]:
%%time

query = "What position does Harry play on the Gryffindor Quidditch team?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 Harry plays as the Keeper on the Gryffindor Quidditch team.
CPU times: user 1.8 s, sys: 694 ms, total: 2.49 s
Wall time: 2.49 s


In [23]:
%%time

query = "Name the three different types of balls used in Quidditch."
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 The three different types of balls used in Quidditch are the Quaffle, the Bludger, and the Golden Snitch.
CPU times: user 2.18 s, sys: 743 ms, total: 2.92 s
Wall time: 2.92 s


In [24]:
%%time

query = "What is Hermione's cat's name?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 Crookshanks
CPU times: user 974 ms, sys: 452 ms, total: 1.43 s
Wall time: 1.42 s


#### Out of scope questions

In [25]:
%%time

query = "What did Gandalf do in the story?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 Gandalf was a powerful wizard who helped the characters in the story. He was a friend of Bilbo Baggins and helped the dwarves in their quest to reclaim their treasure from the dragon Smaug. He also helped the characters defeat the dark lord Sauron and his armies.
CPU times: user 3.36 s, sys: 667 ms, total: 4.02 s
Wall time: 4.02 s


In [26]:
%%time

query = "Was Gandalf in the Harry Potter books?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 No, Gandalf was not in the Harry Potter books.
CPU times: user 891 ms, sys: 180 ms, total: 1.07 s
Wall time: 1.08 s


In [27]:
%%time

query = "Which insect is Ron afraid of?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])


 Ron is afraid of spiders.
CPU times: user 978 ms, sys: 429 ms, total: 1.41 s
Wall time: 1.41 s


In [28]:
%%time

query = "Who killed Dobby?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 Dobby killed himself by hanging himself with a rope he made from his own socks.
CPU times: user 1.75 s, sys: 521 ms, total: 2.27 s
Wall time: 2.27 s


In [29]:
%%time

query = "How many players are on a Quidditch team?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])


 There are seven players on a Quidditch team: three Chasers, two Beaters, a Keeper, and a Seeker.
CPU times: user 2.26 s, sys: 783 ms, total: 3.04 s
Wall time: 3.04 s


In [30]:
%%time

query = "How many possible Quidditch fouls are there?"
ans = qa_chain(query)

ans_dict = compare_model_ans(query, ans, ans_dict)
print(ans['result'])

 There are 700 ways of committing a Quidditch foul.
CPU times: user 1.81 s, sys: 689 ms, total: 2.5 s
Wall time: 2.5 s


#### Spanish language questions

## Export CSV Results

- Model results are extracted for 10 predefined questions and exported as CSV to `csv_path`.  
- If an answer is empty, that signifies that CUDA ran out of memory for that particular questions because of excess tokens.
- The best model is saved in "./model_comparison/best_model/" directory

In [31]:
# exporting csv of result answers for model comparison
def export_results_to_csv(answer_dict, csv_path, model_name, embeddings_model_repo, search_type, 
                          temp, top_p, r_penalty, chunk_size, overlap):
    
    """ Exports the answers from different models to a CSV file.
        
    Parameters
    ----------
    answer_dict : dict
        Dictionary with answers from different questions.
    csv_path : str
        Path to the directory where the CSV file is to be stored.
    model_name : str
        Name of the model used.
    embeddings_model_repo : str
        Name of the embedding model used.
    search_type : str
        Type of search used by the retriever.
    temp : float
        Temperature used by the LLM.
    top_p : float
        Top_p used by the LLM.
    r_penalty : float
        Repetition penalty used by the LLM.
    chunk_size : int
        Size of the chunks to be created from the documents.
    overlap : int
        Overlap between two chunks.
    """
    
    ans_df = pd.DataFrame.from_dict([answer_dict])

    embeddings_model_repo = embeddings_model_repo.replace('/', '--')

    ans_df.to_csv(csv_path
                  + model_name + '_' 
                  + embeddings_model_repo + '_'
                  + search_type + '_('
                  + str(temp) + '_'
                  + str(top_p) + '_'
                  + str(r_penalty) + '_'
                  + str(chunk_size) + '_'
                  + str(overlap) + ')'
                  + '.csv', index=False)
    
    print('Model results saved at '
          + csv_path
          + ' with the name '
          + model_name + '_'
          + embeddings_model_repo + '_'
          + search_type + '_('
          + str(temp) + '_'
          + str(top_p) + '_'
          + str(r_penalty) + '_'
          + str(chunk_size) + '_'
          + str(overlap) + ')'
          + '.csv')

In [32]:
export_results_to_csv(ans_dict,
                      CFG.csv_path,
                      CFG.model_name, 
                      CFG.embeddings_model_repo, 
                      CFG.search_type,
                      CFG.temperature, 
                      CFG.top_p, 
                      CFG.repetition_penalty,
                      CFG.split_chunk_size,
                      CFG.split_overlap
                     )

Model results saved at ./model_comparison/ with the name vicuna_sentence-transformers--all-MiniLM-L12-v2_similarity_(0.1_0.95_1_500_100).csv


## Model performance

- Since the data does not have a ground truth, i.e. the questions are not open domain, LLM performance using Exact-Match accuracy (EM) and f1 score is not feasible ([reference](https://aclanthology.org/2023.acl-long.307.pdf)). 
- One way to overcome this is to label the data (here Harry Potter Books) with ground truth and analyse the results with [Semantic Answer Similarity (SAS)](https://arxiv.org/abs/2108.06130) which is an extensive process itself.
- Another method is to rank the answers of different models using another standalone LLM model. This is the base of [AlpacaEval](https://github.com/tatsu-lab/alpaca_eval?ref=radekosmulski.com) and current model rankings are posted at [leaderboard](https://tatsu-lab.github.io/alpaca_eval/).
- Currently, model performance is calculated by experts to differentiate the quality of answers on 10 pre-defined questions with different modalities like short questions, long questions, out of scope questions and different language questions. The best model which generates better responses and its parameters are exported to run in `chatbotapp.py`.

In [33]:
import glob

results_path = './model_comparison/' 

result_files = glob.glob(os.path.join(results_path, "*.csv"))

index_list = list()

for i in result_files:
    index_list.append(i.split('/')[2].split('.csv')[0])
    
results_pd = pd.concat((pd.read_csv(f) for f in result_files), ignore_index=False)
results_pd['model_name'] = index_list
results_pd = results_pd.set_index('model_name')

In [34]:
results_pd

Unnamed: 0_level_0,Which are Hagrid's favorite animals?,Which challenges does Harry face during the Triwizard Tournament?,Give me 5 examples of cool potions and explain what they do,Name all seven Weasley children.,What position does Harry play on the Gryffindor Quidditch team?,Name the three different types of balls used in Quidditch.,What is Hermione's cat's name?,What did Gandalf do in the story?,Which insect is Ron afraid of?,Who killed Dobby?,How many players are on a Quidditch team?,How many possible Quidditch fouls are there?,¿Cuál es la profesión de los padres de Harry Potter?,"Moony, Wormtail, Padfoot, and Prongs are code names for which four characters?",Dame 5 ejemplos de pociones geniales y explica para qué sirven.,Was Gandalf in the Harry Potter books?
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
falcon_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1_500_0),Dragons.,Harry faces many challenges during the Triwiz...,\n1. Healing Potion: This potion restores the ...,\nRon\nHermione\nGinny\nFred\nGeorge\nLuna,Harry plays as a Keeper.,"The Quaffle, the Bludger, and the Golden Snitch.",Crookshanks,Gandalf is a wizard who helps Frodo and Sam o...,Cockroach,Harry Potter,There are seven players on a Quidditch team.,There are 10 possible Quidditch fouls.,Los padres de Harry Potter son profesores de ...,,,
falcon_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1_500_200),Hippogriffs,Harry faces challenges during the Triwizard T...,\n1. Polyjuice Potion: This potion allows the ...,\nRon\nHermione\nGinny\nFred\nGeorge\nCharlie,Harry plays as the Keeper.,The three different types of balls used in Qu...,Crookshanks,Gandalf was a wizard who helped the Fellowshi...,The giant centipede.,Harry Potter,There are seven players on a Quidditch team.,,Los padres de Harry Potter son profesores de ...,They are code names for the four friends of H...,,
llama2-7b_intfloat--multilingual-e5-large_similarity_(0.1_0.95_1.15_800_0),Hagrid doesn't have a favorite animal.,"During the Triwizard Tournament, Harry faces ...","Ah, excellent! *adjusts spectacles* Well, my ...","Fred, George, Ron, Charlies, Percy, Bill, and...",Seeker,The three different types of balls used in Qu...,Snowy,"In the story, Gandalf went to the rescue of F...",Ron is afraid of spiders.,Harry Potter,There are 7 players on a Quidditch team.,There are seven hundred ways of committing a ...,I don't know.,,,
llama2-7b_intfloat--multilingual-e5-large_similarity_(0.5_0.95_1.15_800_0),"""Well, so they say... I'd like a dragon.""","During the Triwizard Tournament, Harry faces ...","Ah, excellent! *adjusts spectacles* Well, my ...","Ron, Fred, George, Charlus (Charlie), Percy, ...",Harry plays the position of Seeker on the Gry...,The three different types of balls used in Qu...,,"In the story, Gandalf went to the rescue of t...",,,,,Los padres de Harry Potter son abogados.,,\nDame five examples of genius potions and exp...,
vicuna_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1_500_100),Hagrid's favorite animals are dragons.,Harry faces three challenges during the Triwi...,\n\n1. Polyjuice Potion: This potion allows th...,"George, Fred, Ron, Ginny, Bill, Charlie, and ...",,,Crookshanks,Gandalf is a wizard who helps the characters ...,Scabbers.,Snape killed Dobby.,,,"Los padres de Harry Potter son muggles, perso...","Moony, Wormtail, Padfoot, and Prongs are code...",,
vicuna_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1_600_100),Hagrid's favorite animals are dragons.,Harry faces three challenges during the Triwi...,\n\n1. Polyjuice Potion: This potion allows th...,"George, Fred, Ron, Ginny, Bill, Charlie, and ...",,,Crookshanks,Gandalf is a wizard who helps the characters ...,Scabbers.,Snape killed Dobby.,,,"Los padres de Harry Potter son muggles, perso...","Moony, Wormtail, Padfoot, and Prongs are code...",,
vicuna_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1_800_0),Hagrid's favorite animals are dragons.,Harry faces three challenges during the Triwi...,\n\n1. Polyjuice Potion: This potion allows th...,\n\n1. \n2. \n3. \n4. \n5. \n6. \n7.,,"Quaffle, Bludger, and Golden Snitch.",I don't know.,Gandalf was a wizard who helped the character...,Ron is afraid of spiders.,Harry Potter killed Dobby.,Harry Potter and Colin Creevey.,There are seven hundred ways of committing a ...,"Los padres de Harry Potter son muggles, perso...",,,
vicuna_sentence-transformers--all-MiniLM-L12-v2_similarity_(0.1_0.95_1_500_100),Hagrid's favorite animals are magical creatures.,Harry faces three challenges during the Triwi...,\n\n1.\tUnfogging Solution: This potion is use...,"Bill, Charlie, Percy, Ron, and Ginny Weasley.",Harry plays as the Keeper on the Gryffindor Q...,The three different types of balls used in Qu...,Crookshanks,Gandalf was a powerful wizard who helped the ...,Ron is afraid of spiders.,Dobby killed himself by hanging himself with ...,There are seven players on a Quidditch team: ...,There are 700 ways of committing a Quidditch ...,,,,"No, Gandalf was not in the Harry Potter books."
wizardlm_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1.15_800_0),Dogs and dragons (according to him).,The challenges that Harry faces during the Tr...,"I apologize, but as an AI assistant, I am not ...","George, Fred, Ron, Ginny, Arthur, Molly, and B...",Harry plays the position of Seeker on the Gry...,The three different types of balls used in Qu...,"I'm sorry, but there is no mention of a cat o...","In the story, Gandalf used his knowledge and ...","I'm sorry, but I cannot provide an accurate a...","I'm sorry, but I cannot provide an accurate a...","In the game of Quidditch, each team consists ...",There are seven hundred ways of committing a ...,"I'm sorry, but I cannot provide an accurate a...",,,
wizardlm_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1.25_800_0),Dogs and dragons (according to him).,The challenges that Harry faces during the Tr...,I apologize for any confusion but it seems lik...,"George, Fred, Ron, Arthur, Molly, Bill, Charlie.",Harry plays the position of Seeker on the Gry...,The three different types of balls used in Qu...,"I am sorry, but there is no mention of a cat ...","In J. R. R. Tolkien's The Lord of the Rings, ...","I am sorry, but there is no information given...","I am sorry, but as an AI assistant, I do not ...","In the game of Quidditch, each team consists ...",There are seven hundred ways of committing a ...,"I'm sorry, but there is no mention of any par...",,,


#### Insert best model according to your judgement

In [35]:
best_model = "vicuna_intfloat--multilingual-e5-large_similarity_(0.0_0.95_1_500_100)"
# best_model = None

In [36]:
# # to export excel file to best_model folder for reference
# import shutil
# original = './model_comparison/' + best_model + '.csv'
# target = './model_comparison/best_model/' + best_model + '.csv'
# shutil.copyfile(original, target)

#### Best model parameters export

In [37]:
# model name
bm_name = best_model.split('_')[0]

# model repository
bm_repo = available_models[bm_name]

# embedding model
try:
    bm_embedder = best_model.split('_')[1].replace('--', '/')
except:
    bm_embedder = best_model.split('_')[1]
    
# search_type  
bm_search_type = best_model.split('_')[2]

# model temperature
try:
    bm_temp = best_model.split('_')[3].split('(')[1]
except:
    bm_temp = best_model.split('_')[3]
    
# model top_p
bm_top_p = best_model.split('_')[4]

# repetition penalty
bm_rep_penalty = best_model.split('_')[5]

# model split chunks
bm_split_chunks = best_model.split('_')[6]

# model split overlap
try:
    bm_split_overlap = best_model.split('_')[7].split(')')[0]
except:
    bm_split_overlap = best_model.split('_')[7]

In [38]:
param = [bm_name, bm_repo, bm_embedder, bm_search_type, bm_temp, bm_top_p, bm_rep_penalty, bm_split_chunks, bm_split_overlap]
file = open('./model_comparison/best_model/best-model-parameters.txt','w')
for i in param:
    file.write(i+"\n")
file.close()

### Improvements
There are many improvements that can be made with better resources and time as well as manipulating the following parameters:
- Model hyperparameters like temperature, top_p, repetition_penalty, max_length, etc.
- Different embeddings models
- Retriever hyperparameters like similarity, MMR, similarity threshold, k or different retrievers like SVMRetriever
- Bigger models with higher performance
- Custom prompt engineering
- Other types of models can be implemented to improve the performance, current models were taken from `HuggingFaceHub` and are working well
- Splitting: chunk size, overlap can be manipulated to improve performance

### Use cases
- Other types of PDFs can be imported to parse through for QA
- More languages can be added by either adding a separate LLM chain to translate, using multilingual embedding models and retrievers or using a different LLM model which supports multiple languages. Fine-tuning models is also possible on different language datasets.