| Version | Date     | Creator          | Change description                               |
|---------|----------|------------------|--------------------------------------------------|
| v0.01   | 03/09/23 | Jaikishan Khatri | Loader, Splitter, Storage, Retreival, Generation |

# QA Chatbot for parsing Harry Potter books to generate answers

## Pipeline

The pipeline for converting raw unstructured data into a QA chain looks like this:

1. <b>Loading</b>: First we need to load our data. Unstructured data can be loaded from many sources. Use the LangChain integration hub to browse the full set of loaders. Each loader returns data as a LangChain Document.
2. <b>Splitting</b>: Text splitters break Documents into splits of specified size
3. <b>Storage</b>: Storage (e.g., often a vectorstore) will house and often embed the splits
4. <b>Retrieval</b>: The app retrieves splits from storage (e.g., often with similar embeddings to the input question)
5. <b>Generation</b>: An LLM produces an answer using a prompt that includes the question and the retrieved data
6. <b>Conversation</b> (Extension): Hold a multi-turn conversation by adding Memory to your QA chain.

![LLM-QA-flowchart.jpeg](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

Image source: [LangChain](https://python.langchain.com/assets/images/qa_flow-9fbd91de9282eb806bda1c6db501ecec.jpeg)

### Dependencies

### Imports

In [85]:
# loaders
from langchain.document_loaders import PyPDFLoader
# from langchain.document_loaders import PyMuPDFLoader
from langchain.document_loaders import DirectoryLoader

# text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

# retreivers
from langchain.retrievers import SVMRetriever

# prompts
from langchain import PromptTemplate, LLMChain

# vector stores
from langchain.vectorstores import FAISS

# models
from langchain.llms import HuggingFacePipeline
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

### Configuration
For manipulable variables in the experiment

In [96]:
class CFG:
    # LLMs
    model_name = 'Photolens-llama-2-7b' # wizardlm, bloom, falcon, llama2-7b, llama2-13b, 'Photolens-llama-2-7b'
    temperature = 0,
    top_p = 0.95,
    repetition_penalty = 1.15    
    
    # splitting
    split_chunk_size = 1000
    split_overlap = 0
    
    # embeddings
    embeddings_model_repo = 'sentence-transformers/all-MiniLM-L6-v2' # 'sentence-transformers/all-MiniLM-L6-v2', 
    # 'sentence-transformers/multi-qa-MiniLM-L6-cos-v1', GPT4AllEmbeddings()
    
    retriever_type = 'similarity_search' # 'similarity_search', 'MultiQueryRetriever', 'Max marginal relevance', 'SVMRetriever'
    
    # similar passages
    k = 5
    
    # quantization config
#     quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
    
    # paths
    PDFs_path = './Data/HP books/'
    Embeddings_path =  './faiss-hp-sentence-transformers/'
    Persist_directory = './harry-potter-vectorstore/' 
    offload_folder = './offload_folder/'

### Step 1: Loader 
using PyPDFLoader

Load PDF using `pypdf and chunks at character level.\
Loader also stores page numbers in metadata.

In [52]:
%%time

loader = DirectoryLoader(
    CFG.PDFs_path,
    glob="./*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)

documents = loader.load()


  0%|                                                                                                                                                                                   | 0/7 [00:00<?, ?it/s][A
 14%|████████████████████████▍                                                                                                                                                  | 1/7 [00:11<01:08, 11.44s/it][A
 29%|████████████████████████████████████████████████▊                                                                                                                          | 2/7 [00:12<00:41,  8.33s/it][A
 43%|█████████████████████████████████████████████████████████████████████████▎                                                                                                 | 3/7 [00:15<00:26,  6.62s/it][A
 57%|█████████████████████████████████████████████████████████████████████████████████████████████████▋                                                        

CPU times: user 32.2 s, sys: 1.3 s, total: 33.5 s
Wall time: 33.4 s





In [53]:
len(documents)

4114

### Step 2: Splitter

In [56]:


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = CFG.split_chunk_size,
    chunk_overlap  = CFG.split_overlap,
    length_function = len,
    add_start_index = True,
)
all_splits = text_splitter.split_documents(documents)
len(all_splits)

8383

### Step 3: Store

In [93]:
def create_embeddings(embeddings_model_repo: str, Embeddings_path: str = './faiss_index_hp', new_vectorstore: bool=False):
    
    if new_vectorstore == True:
        
        if embeddings_model_repo.startswith('sentence-transformers'):
            embeddings = HuggingFaceInstructEmbeddings(model_name = embeddings_model_repo,
                                                       model_kwargs = {"device": "cuda"})

        elif embeddings_model_repo.startswith('GPT4All'):
            embeddings = HuggingFaceInstructEmbeddings(model_name = embeddings_model_repo,
                                                       model_kwargs = {"device": "cuda"})

        ### create embeddings and new_vectorstore
        vectorstore = FAISS.from_documents(documents = all_splits, 
                                           embedding = embeddings)

        ### persist vector database
        vectorstore.save_local("faiss_index_hp")
        
    else:

        ### download embeddings model
        embeddings = HuggingFaceInstructEmbeddings(model_name = CFG.embeddings_model_repo,
                                                   model_kwargs = {"device": "cuda"})

        ### load vectorstore embeddings
        vectorstore = FAISS.load_local(CFG.Embeddings_path, embeddings)
        
    return vectorstore, embeddings

In [None]:
create_embeddings(CFG.embeddings_model_repo, CFG.Embeddings_path, new_vectorstore=False)

In [88]:
question = "What are Hagrid's favourite animals?"
docs = vectorstore.similarity_search(question)
len(docs)

4

### Step 4. Retrieve
similarity_search\
MultiQueryRetriever\
Max marginal relevance\
SVMRetriever

### Promt Template

In [98]:
prompt_template = """
Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked. 
Use only the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer:"""


PROMPT = PromptTemplate(
    template = prompt_template, 
    input_variables = ["context", "question"]
)

In [99]:
retriever = vectorstore.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : "similarity"})

In [100]:
retriever

VectorStoreRetriever(tags=['FAISS'], metadata=None, vectorstore=<langchain.vectorstores.faiss.FAISS object at 0x7f60d2bf7100>, search_type='similarity', search_kwargs={'k': 5, 'search_type': 'similarity'})

In [None]:
def get_retriever(retriever_type: str = 'similarity_search', k: int = 4, question: str, vectorstore, all_splits, embeddings):
    
    # searches for similarity among the retreived documents
    if retriever_type == 'similarity_search':
        retriever = vectorstore.as_retriever(search_kwargs = {'k': k, 'search_type' : 'similarity'})
    
    # selects for relevance and diversity among the retrieved documents
    elif retriever_type == 'Max marginal relevance':
        retriever = vectorstore.as_retriever(search_kwargs = {'k': k, 'search_type' : 'MMR'})
        
    elif retriever_type == 'SVMRetriever':
        svm_retriever = SVMRetriever.from_documents(all_splits, embeddings)
        retriever_docs = svm_retriever.get_relevant_documents(question)
        
    # generates variants of the input question to improve retrieval using llm
#     elif retriever_type == 'MultiQueryRetriever':
#         retriever_from_llm = MultiQueryRetriever.from_llm(retriever=vectordb.as_retriever(), llm=llm)
    
    else:
        raise ValueError('Incorrect Retriever Type')
        
    return retriever_docs

In [90]:
svm_retriever = SVMRetriever.from_documents(all_splits, embeddings)
docs_svm=svm_retriever.get_relevant_documents(question)
len(docs_svm)

4

In [91]:
docs_svm

[Document(page_content='breeding creatures he has dubbed “Blast-Ended \nSkrewts,” highly dangerous crosses between manti-\ncores and fire-crabs. The creation of new breeds of magical creature is, of course, an activity usually \nclosely observed by the Department for the Regu-\nlation and Control of Magical Creatures. Hagrid, however, considers himself to be above such petty \nrestrictions.', metadata={'source': 'Data/HP books/Harry Potter - Book 4 - The Goblet of Fire.pdf', 'page': 453, 'start_index': 992}),
 Document(page_content='brate, does not present much of a ch allenge; the mouse, as a mammal,', metadata={'source': 'Data/HP books/Harry Potter - Book 5 - The Order of the Phoenix.pdf', 'page': 335, 'start_index': 1932}),
 Document(page_content='“But\tthere\taren’t\twild\tdragons\tin\tBritain?”\tsaid\tHarry.\n\t\t\t\t\t\t“Of\tcourse\tthere\tare,”\tsaid\tRon.\t“Common\tWelsh\tGreen\tand\tHebridean\nBlacks.\tThe\tMinistry\tof\tMagic\thas\ta\tjob\thushing\tthem\tup,\tI\tcan\ttell\tyo

In [94]:
docs = vectorstore.similarity_search(question)

In [95]:
docs

[Document(page_content='“But\tthere\taren’t\twild\tdragons\tin\tBritain?”\tsaid\tHarry.\n\t\t\t\t\t\t“Of\tcourse\tthere\tare,”\tsaid\tRon.\t“Common\tWelsh\tGreen\tand\tHebridean\nBlacks.\tThe\tMinistry\tof\tMagic\thas\ta\tjob\thushing\tthem\tup,\tI\tcan\ttell\tyou.\tOur\nkind\thave\tto\tkeep\tputting\tspells\ton\tMuggles\twho’ve\tspotted\tthem,\tto\tmake\tthem\nforget.”\n\t\t\t\t\t\t“So\twhat\ton\tearth’s\tHagrid\tup\tto?”\tsaid\tHermione.', metadata={'source': 'Data/HP books/Harry Potter - Book 1 - The Sorcerers Stone.pdf', 'page': 166, 'start_index': 1964}),
 Document(page_content='Passersby\tstared\ta\tlot\tat\tHagrid\tas\tthey\twalked\tthrough\tthe\tlittle\ttown\tto\nthe\tstation.\tHarry\tcouldn’t\tblame\tthem.\tNot\tonly\twas\tHagrid\ttwice\tas\ttall\tas\nanyone\telse,\the\tkept\tpointing\tat\tperfectly\tordinary\tthings\tlike\tparking\tmeters\tand\nsaying\tloudly,\t“See\tthat,\tHarry?\tThings\tthese\tMuggles\tdream\tup,\teh?”\n\t\t\t\t\t\t“Hagrid,”\tsaid\tHarry,\tpanting\ta\tbit\

### Step 5: Generate
Retriever chain for QA

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever, 
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)