
# Overview

- Use [Langchain](https://python.langchain.com/en/latest/index.html) to <font color='orange'> build a chatbot that can answer questions about books or any pdf files.
- **<font color='orange'>Flexible and customizable RAG pipeline (Retrieval Augmented Generation)</font>**
- Experiment with various LLMs (Large Language Models)
- Use [FAISS vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss) to store text embeddings created with [Sentence Transformers](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) from 🤗. FAISS runs on GPU and it is much faster than Chroma
- Use [Retrieval chain](https://python.langchain.com/docs/modules/data_connection/retrievers/) to retrieve relevant passages from embedded text
- Summarize retrieved passages
- Leverage Kaggle dual GPU (2 * T4) with [Hugging Face Accelerate](https://huggingface.co/docs/accelerate/index)
- Chat UI with [Gradio](https://www.gradio.app/guides/quickstart)

**<font color='green'>No need to create any API key to use this notebook! Everything is open source.</font>**


### Models

- [TheBloke/wizardLM-7B-HF](https://huggingface.co/TheBloke/wizardLM-7B-HF)
- [daryl149/llama-2-7b-chat-hf](https://huggingface.co/daryl149/llama-2-7b-chat-hf)
- [daryl149/llama-2-13b-chat-hf](https://huggingface.co/daryl149/llama-2-13b-chat-hf)
- [mistralai/Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)

In [None]:
! nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-dac9e11b-1404-94f6-6bd3-0b2de97506dd)
GPU 1: Tesla T4 (UUID: GPU-fb9bdc1f-3dba-3590-94ec-96589fa3ed78)


# Installs

In [None]:
#The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.
#Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models.
#LangChain is a framework for developing applications powered by large language models (LLMs).
#tiktoken is a fast open-source tokenizer by OpenAI.
#pypdf. pypdf is a free and open-source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files.
#Faiss is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size.
#Instructor-embedding: an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning.
#A Transformer is a type of deep learning architecture that uses an attention mechanism to process text sequences.
#Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.
#Bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers and quantization functions.
#LangChain Community contains third-party integrations that implement the base interfaces defined in LangChain Core, making them ready-to-use in any LangChain application.

In [None]:

# Install the desired version of sentence_transformers within the virtual environment
!pip install sentence_transformers==2.2.2


In [None]:

# Clear the output to keep the notebook clean
from IPython.display import clear_output
clear_output()


sentence_transformers version 2.2.2 installed successfully!


In [None]:
! pip install -qq -U langchain

In [None]:
! pip install -qq -U tiktoken

In [None]:
! pip install -qq -U pypdf

In [None]:
! pip install -qq -U faiss-gpu

In [None]:
! pip install -qq -U InstructorEmbedding

In [None]:
! pip install -qq -U transformers

In [None]:
! pip install -qq -U accelerate

In [None]:
! pip install -qq -U bitsandbytes

In [None]:
!pip install -U langchain-community

# Imports

In [None]:
%%time

import warnings
warnings.filterwarnings("ignore")

CPU times: user 28 µs, sys: 0 ns, total: 28 µs
Wall time: 31.7 µs


In [None]:
# Importing the os module for interacting with the operating system
import os

# Importing the glob module to find all the pathnames matching a specified pattern
import glob

# Importing the textwrap module for formatting text
import textwrap

# Importing the time module to handle time-related tasks
import time

# Importing the langchain module for building language model chains
import langchain


In [None]:
# Importing PyPDFLoader for loading PDF documents
from langchain.document_loaders import PyPDFLoader

# Importing DirectoryLoader for loading all documents from a directory
from langchain.document_loaders import DirectoryLoader


langchain-community installed successfully!


In [None]:
### Importing modules for text splitting
from langchain.text_splitter import RecursiveCharacterTextSplitter

### Importing modules for prompts and LLM chains
from langchain import PromptTemplate, LLMChain


In [None]:
### Importing modules for vector stores
from langchain.vectorstores import FAISS

### Importing modules for models
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceInstructEmbeddings

### Importing modules for retrievers
from langchain.chains import RetrievalQA


In [None]:
# Importing the torch module for working with PyTorch, an open source machine learning library
import torch

# Importing the transformers module for working with Transformer models
import transformers

# Importing specific classes and functions from the transformers module
from transformers import (
    AutoTokenizer,        # For loading pre-trained tokenizers
    AutoModelForCausalLM, # For loading pre-trained causal language models
    BitsAndBytesConfig,   # For configuring quantization settings for models
    pipeline              # For creating inference pipelines
)

# Clearing the output (commonly used in Jupyter notebooks)
from IPython.display import clear_output
clear_output()


In [None]:
# Printing the versions of the imported libraries
print('langchain:', langchain.__version__)
print('torch:', torch.__version__)
print('transformers:', transformers.__version__)

langchain: 0.2.3
torch: 2.0.0
transformers: 4.41.2


In [None]:
# Finding and sorting all the file paths in the specified directory
sorted(glob.glob('/kaggle/input/books/*'))

['/kaggle/input/harrypotter2/harry-potter-and-the-deathly-hallows-j.k.-rowling.pdf',
 '/kaggle/input/harrypotter2/harry-potter-sorcerers-stone.pdf']

# CFG

- CFG class enables easy and organized experimentation

In [None]:
class CFG:
    # Language Model Configuration
    model_name = 'llama2-13b-chat'  # Options: 'wizardlm', 'llama2-7b-chat', 'llama2-13b-chat', 'mistral-7B'
    temperature = 0  # Controls randomness of model's outputs
    top_p = 0.95  # Controls diversity of model's outputs
    repetition_penalty = 1.15  # Penalizes model for repeating itself in the output

    # Text Splitting Configuration
    split_chunk_size = 800  # Size of chunks to split text into for processing
    split_overlap = 0  # Overlap between chunks

    # Embeddings Configuration
    embeddings_model_repo = 'sentence-transformers/all-MiniLM-L6-v2'  # Repository for the embeddings model

    # Similar Passages Configuration
    k = 6  # Number of similar passages to retrieve

    # File Paths
    PDFs_path = '/kaggle/input/books/'  # Path to the PDF files containing the text data
    Embeddings_path = '/kaggle/input/faiss-hp-sentence-transformers'  # Path to the embeddings data
    Output_folder = './books-vectordb'  # Folder to save the output data


# Define model

In [None]:
def get_model(model=CFG.model_name):
    """
    Downloads and initializes a specific model based on the `model` parameter.

    Args:
        model (str): The name of the model to use. Defaults to `CFG.model_name`.

    Returns:
        tuple: A tuple containing the initialized tokenizer, model, and `max_len` parameter.
    """
    # Print a message indicating which model is being downloaded
    print('\nDownloading model:', model, '\n\n')

    # Default values for tokenizer, model, and max_len
    tokenizer, model, max_len = None, None, None

    # Check if the model is 'wizardlm'
    if model == 'wizardlm':
        # Set the model repository for 'wizardlm'
        model_repo = 'TheBloke/wizardLM-7B-HF'

        # Initialize the tokenizer for 'wizardlm'
        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        # Configure the quantization for 'wizardlm'
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )

        # Initialize the model for 'wizardlm'
        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config=bnb_config,
            device_map='auto',
            low_cpu_mem_usage=True
        )

        # Set the maximum length for 'wizardlm'
        max_len = 1024

    # Check if the model is 'llama2-7b-chat' or 'llama2-13b-chat'
    elif model in ['llama2-7b-chat', 'llama2-13b-chat']:
        # Set the model repository for 'llama2-7b-chat' or 'llama2-13b-chat'
        model_repo = f'daryl149/{model}-hf'

        # Initialize the tokenizer for 'llama2-7b-chat' or 'llama2-13b-chat'
        tokenizer = AutoTokenizer.from_pretrained(model_repo, use_fast=True)

        # Configure the quantization for 'llama2-7b-chat' or 'llama2-13b-chat'
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )

        # Initialize the model for 'llama2-7b-chat' or 'llama2-13b-chat'
        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config=bnb_config,
            device_map='auto',
            low_cpu_mem_usage=True,
            trust_remote_code=True
        )

        # Set the maximum length for 'llama2-7b-chat' or 'llama2-13b-chat'
        max_len = 2048 if model == 'llama2-7b-chat' else 8192

    # Check if the model is 'mistral-7B'
    elif model == 'mistral-7B':
        # Set the model repository for 'mistral-7B'
        model_repo = 'mistralai/Mistral-7B-v0.1'

        # Initialize the tokenizer for 'mistral-7B'
        tokenizer = AutoTokenizer.from_pretrained(model_repo)

        # Configure the quantization for 'mistral-7B'
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16,
            bnb_4bit_use_double_quant=True,
        )

        # Initialize the model for 'mistral-7B'
        model = AutoModelForCausalLM.from_pretrained(
            model_repo,
            quantization_config=bnb_config,
            device_map='auto',
            low_cpu_mem_usage=True
        )

        # Set the maximum length for 'mistral-7B'
        max_len = 1024

    # Handle the case when the model is not implemented
    else:
        print("Not implemented model (tokenizer and backbone)")

    # Return the initialized tokenizer, model, and max_len
    return tokenizer, model, max_len


In [None]:
# Measure the execution time of the following code block
%%time

# Get the tokenizer, model, and max_len using the get_model function with model set to CFG.model_name
tokenizer, model, max_len = get_model(model=CFG.model_name)

# Clear the output of the cell
clear_output()


CPU times: user 8.75 s, sys: 25.7 s, total: 34.4 s
Wall time: 1min 54s


In [None]:
#The model.eval() method is used to set the model to evaluation mode. In PyTorch, this is important when you have layers like Dropout or BatchNorm which behave differently during training and evaluation.
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear4bit(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (up_proj): Linear4bit(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear4bit(in_features=13824, out_features=5120, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    

In [None]:
### check how Accelerate split the model across the available devices (GPUs)
model.hf_device_map

{'model.embed_tokens': 0,
 'model.layers.0': 0,
 'model.layers.1': 0,
 'model.layers.2': 0,
 'model.layers.3': 0,
 'model.layers.4': 0,
 'model.layers.5': 0,
 'model.layers.6': 0,
 'model.layers.7': 0,
 'model.layers.8': 0,
 'model.layers.9': 0,
 'model.layers.10': 0,
 'model.layers.11': 0,
 'model.layers.12': 0,
 'model.layers.13': 0,
 'model.layers.14': 0,
 'model.layers.15': 0,
 'model.layers.16': 0,
 'model.layers.17': 1,
 'model.layers.18': 1,
 'model.layers.19': 1,
 'model.layers.20': 1,
 'model.layers.21': 1,
 'model.layers.22': 1,
 'model.layers.23': 1,
 'model.layers.24': 1,
 'model.layers.25': 1,
 'model.layers.26': 1,
 'model.layers.27': 1,
 'model.layers.28': 1,
 'model.layers.29': 1,
 'model.layers.30': 1,
 'model.layers.31': 1,
 'model.layers.32': 1,
 'model.layers.33': 1,
 'model.layers.34': 1,
 'model.layers.35': 1,
 'model.layers.36': 1,
 'model.layers.37': 1,
 'model.layers.38': 1,
 'model.layers.39': 1,
 'model.norm': 1,
 'lm_head': 1}

# 🤗 pipeline

- Hugging Face pipeline

In [None]:
### Create a Hugging Face pipeline for text generation
pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    pad_token_id=tokenizer.eos_token_id,
#     do_sample=True,
    max_length=max_len,
    temperature=CFG.temperature,
    top_p=CFG.top_p,
    repetition_penalty=CFG.repetition_penalty
)

### Create a langchain pipeline using the Hugging Face pipeline
llm = HuggingFacePipeline(pipeline=pipe)


In [None]:
llm

HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x79de501eb6d0>)

In [None]:
%%time
### testing model, not using the books yet
### answer is not necessarily related to books
query = "Give me 5 examples of cool potions and explain what they do"
llm.invoke(query)

CPU times: user 33.4 s, sys: 125 ms, total: 33.5 s
Wall time: 33.5 s


'Give me 5 examples of cool potions and explain what they do.\n\nSure thing! Here are five examples of cool potions that you might find in a fantasy world, along with their effects:\n\n1. Potion of Healing: This potion restores health to the drinker, healing wounds and injuries. It might also grant temporary immunity to future damage or disease.\n2. Potion of Strength: This potion grants the drinker increased physical strength and endurance for a short period of time, allowing them to lift heavier objects, run faster, and fight longer.\n3. Potion of Speed: This potion allows the drinker to move at incredible speeds for a short period of time, making it easier to escape danger or chase down enemies.\n4. Potion of Invisibility: This potion makes the drinker temporarily invisible, allowing them to sneak past guards, avoid detection by monsters, or steal valuable items without being caught.\n5. Potion of Flight: This potion gives the drinker the ability to fly for a short period of time, a

# 🦜🔗 Langchain

- Multiple document retriever with LangChain

In [None]:
CFG.model_name

'llama2-13b-chat'

# Loader

- [Directory loader](https://python.langchain.com/docs/modules/data_connection/document_loaders/file_directory) for multiple files
- This step is not necessary if you are just loading the vector database
- This step is necessary if you are creating embeddings. In this case you need to:
    - load de PDF files
    - split into chunks
    - create embeddings
    - save the embeddings in a vector store
    - After that you can just load the saved embeddings to do similarity search with the user query, and then use the LLM to answer the question
    
You can comment out this section if you use the embeddings I already created.

In [None]:
# Measure the execution time of the following code block
%%time

# Load PDF documents using DirectoryLoader
loader = DirectoryLoader(
    CFG.PDFs_path,
    glob="./*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)

# Load the documents
documents = loader.load()



  0%|          | 0/2 [00:00<?, ?it/s][A
 50%|█████     | 1/2 [00:54<00:54, 54.60s/it][A
100%|██████████| 2/2 [01:00<00:00, 30.23s/it][A

CPU times: user 1min, sys: 231 ms, total: 1min
Wall time: 1min





In [None]:
print(f'We have {len(documents)} pages in total')

We have 871 pages in total


In [None]:
documents[8].page_content

'agree.”\n\t\t\t\t\t\tHe\tdidn’t\tsay\tanother\tword\ton\tthe\tsubject\tas\tthey\twent\tupstairs\tto\tbed.\nWhile\tMrs.\tDursley\twas\tin\tthe\tbathroom,\tMr.\tDursley\tcrept\tto\tthe\tbedroom\nwindow\tand\tpeered\tdown\tinto\tthe\tfront\tgarden.\tThe\tcat\twas\tstill\tthere.\tIt\twas\nstaring\tdown\tPrivet\tDrive\tas\tthough\tit\twere\twaiting\tfor\tsomething.\n\t\t\t\t\t\tWas\the\timagining\tthings?\tCould\tall\tthis\thave\tanything\tto\tdo\twith\tthe\nPotters?\tIf\tit\tdid...if\tit\tgot\tout\tthat\tthey\twere\trelated\tto\ta\tpair\tof\t—\twell,\the\tdidn’t\nthink\the\tcould\tbear\tit.\n\t\t\t\t\t\tThe\tDursleys\tgot\tinto\tbed.\tMrs.\tDursley\tfell\tasleep\tquickly\tbut\tMr.\nDursley\tlay\tawake,\tturning\tit\tall\tover\tin\this\tmind.\tHis\tlast,\tcomforting\tthought\nbefore\the\tfell\tasleep\twas\tthat\teven\tif\tthe\tPotters\twere\tinvolved,\tthere\twas\tno\nreason\tfor\tthem\tto\tcome\tnear\thim\tand\tMrs.\tDursley.\tThe\tPotters\tknew\tvery\twell\nwhat\the\tand\tPetunia\tthough

# Splitter

- Splitting the text into chunks so its passages are easily searchable for similarity
- This step is also only necessary if you are creating the embeddings
- [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/reference/modules/document_loaders.html?highlight=RecursiveCharacterTextSplitter#langchain.document_loaders.MWDumpLoader)

In [None]:
# Create a RecursiveCharacterTextSplitter for splitting text
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=CFG.split_chunk_size,
    chunk_overlap=CFG.split_overlap
)

# Split the documents into chunks
texts = text_splitter.split_documents(documents)

# Print the number of chunks created
print(f'We have created {len(texts)} chunks from {len(documents)} pages')


We have created 2566 chunks from 871 pages


# Create Embeddings


- Embedd and store the texts in a Vector database (FAISS)
- [LangChain Vector Stores docs](https://python.langchain.com/docs/modules/data_connection/vectorstores/)
- [FAISS - langchain](https://python.langchain.com/docs/integrations/vectorstores/faiss)
- [Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks - paper Aug/2019](https://arxiv.org/pdf/1908.10084.pdf)
- [This is a nice 4 minutes video about vector stores](https://www.youtube.com/watch?v=dN0lsF2cvm4)

___

- If you use Chroma vector store it will take ~35 min to create embeddings
- If you use FAISS vector store on GPU it will take just ~3 min

___

We need to create the embeddings only once, and then we can just load the vector store and query the database using similarity search.

Loading the embeddings takes only a few seconds.

I uploaded the embeddings to a Kaggle Dataset so we just load it from [here](https://www.kaggle.com/datasets/hinepo/faiss-hp-sentence-transformers).

In [None]:
%%time

### we create the embeddings only if they do not exist yet
if not os.path.exists(CFG.Embeddings_path + '/index.faiss'):

    ### download embeddings model
    embeddings = HuggingFaceInstructEmbeddings(
        model_name = CFG.embeddings_model_repo,
        model_kwargs = {"device": "cuda"}
    )

    ### create embeddings and DB
    vectordb = FAISS.from_documents(
        documents = texts,
        embedding = embeddings
    )

    ### persist vector database
    vectordb.save_local(f"{CFG.Output_folder}/faiss_index_hp") # save in output folder
#     vectordb.save_local(f"{CFG.Embeddings_path}/faiss_index_hp") # save in input folder

load INSTRUCTOR_Transformer
max_seq_length  512
CPU times: user 6.73 s, sys: 63.7 ms, total: 6.79 s
Wall time: 6.75 s


If creating embeddings, remember that on Kaggle we can not write data to the input folder.

So just write (save) the embeddings to the output folder and then load them from there.

# Load vector database

- After saving the vector database, we just load it from the Kaggle Dataset I mentioned
- Obviously, the embeddings function to load the embeddings must be the same as the one used to create the embeddings

In [None]:
%%time

from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import FAISS

# Download embeddings model
embeddings = HuggingFaceInstructEmbeddings(
    model_name=CFG.embeddings_model_repo,
    model_kwargs={"device": "cuda"}
)

# Load vector DB embeddings
vectordb = FAISS.load_local(
    CFG.Embeddings_path,  # from input folder
    embeddings,
    allow_dangerous_deserialization=True  # Allow deserialization
)

from IPython.display import clear_output
clear_output()

print("FAISS vector database loaded successfully!")


load INSTRUCTOR_Transformer
max_seq_length  512
Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/magics/execution.py", line 1325, in time
    exec(code, glob, local_ns)
  File "<timed exec>", line 11, in <module>
  File "/opt/conda/lib/python3.10/site-packages/langchain_community/vectorstores/faiss.py", line 1092, in load_local
    index = faiss.read_index(str(path / f"{index_name}.faiss"))
  File "/opt/conda/lib/python3.10/site-packages/faiss/swigfaiss.py", line 9849, in read_index
    return _swigfaiss.read_index(*args)
RuntimeError: Error in faiss::FileIOReader::FileIOReader(const char*) at /project/faiss/faiss/impl/io.cpp:68: Error: 'f' failed: could not open /kaggle/input/faiss-hp-sentence-transformers/index.faiss for reading: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2105, in showtraceback
    stb = self.In

In [None]:
%%time

### download embeddings model
embeddings = HuggingFaceInstructEmbeddings(
    model_name = CFG.embeddings_model_repo,
    model_kwargs = {"device": "cuda"}
)

### load vector DB embeddings
vectordb = FAISS.load_local(
    CFG.Embeddings_path, # from input folder
#     CFG.Output_folder + '/faiss_index_hp', # from output folder
    embeddings
)

clear_output()

load INSTRUCTOR_Transformer
max_seq_length  512


ValueError: The de-serialization relies loading a pickle file. Pickle files can be modified to deliver a malicious payload that results in execution of arbitrary code on your machine.You will need to set `allow_dangerous_deserialization` to `True` to enable deserialization. If you do this, make sure that you trust the source of the data. For example, if you are loading a file that you created, and know that no one else has modified the file, then this is safe to do. Do not set this to `True` if you are loading a file from an untrusted source (e.g., some random site on the internet.).

In [None]:
### test if vector DB was loaded correctly
vectordb.similarity_search('magic creatures')

[Document(page_content='J.K. Rowling      HARRY POTTER AND THE DEATHLY HALLOWS  \n634 believe that you have magic that I do not, or else a weapon more powerful \nthan mine? ” \n“I believe both, ” said Harry, and he saw shoc k flit across the snakelike \nface, though it w as instantly dispelled; Voldemort began to laugh, and the \nsound was more frightening than his screams; humorless and insane, it \nechoed around the silent Hall.  \n“You think you know more magic than I do? ” he said. “Than I, than Lord \nVoldemort, who has perform ed magic that Dumbledore himself never \ndreamed of? ” \n“Oh, he dreamed of it, ” said Harry, “but he knew more than you, knew \nenough not to do what you ’ve done.” \n“You mean he was weak! ” screamed Voldemort. “Too weak to dare, too', metadata={'source': '/kaggle/input/harrypotter2/harry-potter-and-the-deathly-hallows-j.k.-rowling.pdf', 'page': 633}),
 Document(page_content='J.K. Rowl ing     HARRY POTTER AND THE DEATHLY HALLOWS  \n573 watched her as gre

# Prompt Template

- Custom prompt

In [None]:
prompt_template = """
Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

{context}

Question: {question}
Answer:"""


PROMPT = PromptTemplate(
    template = prompt_template,
    input_variables = ["context", "question"]
)

In [None]:
# llm_chain = LLMChain(prompt=PROMPT, llm=llm)
# llm_chain

# Retriever chain

- Retriever to retrieve relevant passages
- Chain to answer questions
- [RetrievalQA: Chain for question-answering](https://python.langchain.com/docs/modules/data_connection/retrievers/)

In [None]:
retriever = vectordb.as_retriever(search_kwargs = {"k": CFG.k, "search_type" : "similarity"})

qa_chain = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff", # map_reduce, map_rerank, stuff, refine
    retriever = retriever,
    chain_type_kwargs = {"prompt": PROMPT},
    return_source_documents = True,
    verbose = False
)

In [None]:
### testing MMR search
question = "Which are Hagrid's favorite animals?"
vectordb.max_marginal_relevance_search(question, k = CFG.k)

[Document(page_content='“Hagrid’s\talways\twanted\ta\tdragon,\the\ttold\tme\tso\tthe\tfirst\ttime\tI\tever\tmet\nhim,\t“\tsaid\tHarry.\n\t\t\t\t\t\t“But\tit’s\tagainst\tour\tlaws,”\tsaid\tRon.\t“Dragon\tbreeding\twas\toutlawed\tby\nthe\tWarlocks’\tConvention\tof\t1709,\teveryone\tknows\tthat.\tIt’s\thard\tto\tstop\nMuggles\tfrom\tnoticing\tus\tif\twe’re\tkeeping\tdragons\tin\tthe\tback\tgarden\t—\nanyway,\tyou\tcan’t\ttame\tdragons,\tit’s\tdangerous.\tYou\tshould\tsee\tthe\tburns\nCharlie’s\tgot\toff\twild\tones\tin\tRomania.”\n\t\t\t\t\t\t“But\tthere\taren’t\twild\tdragons\tin\tBritain?”\tsaid\tHarry.\n\t\t\t\t\t\t“Of\tcourse\tthere\tare,”\tsaid\tRon.\t“Common\tWelsh\tGreen\tand\tHebridean\nBlacks.\tThe\tMinistry\tof\tMagic\thas\ta\tjob\thushing\tthem\tup,\tI\tcan\ttell\tyou.\tOur\nkind\thave\tto\tkeep\tputting\tspells\ton\tMuggles\twho’ve\tspotted\tthem,\tto\tmake\tthem\nforget.”\n\t\t\t\t\t\t“So\twhat\ton\tearth’s\tHagrid\tup\tto?”\tsaid\tHermione.', metadata={'source': '/kaggle/inp

In [None]:
### testing similarity search
question = "Which are Hagrid's favorite animals?"
vectordb.similarity_search(question, k = CFG.k)

[Document(page_content='“Hagrid’s\talways\twanted\ta\tdragon,\the\ttold\tme\tso\tthe\tfirst\ttime\tI\tever\tmet\nhim,\t“\tsaid\tHarry.\n\t\t\t\t\t\t“But\tit’s\tagainst\tour\tlaws,”\tsaid\tRon.\t“Dragon\tbreeding\twas\toutlawed\tby\nthe\tWarlocks’\tConvention\tof\t1709,\teveryone\tknows\tthat.\tIt’s\thard\tto\tstop\nMuggles\tfrom\tnoticing\tus\tif\twe’re\tkeeping\tdragons\tin\tthe\tback\tgarden\t—\nanyway,\tyou\tcan’t\ttame\tdragons,\tit’s\tdangerous.\tYou\tshould\tsee\tthe\tburns\nCharlie’s\tgot\toff\twild\tones\tin\tRomania.”\n\t\t\t\t\t\t“But\tthere\taren’t\twild\tdragons\tin\tBritain?”\tsaid\tHarry.\n\t\t\t\t\t\t“Of\tcourse\tthere\tare,”\tsaid\tRon.\t“Common\tWelsh\tGreen\tand\tHebridean\nBlacks.\tThe\tMinistry\tof\tMagic\thas\ta\tjob\thushing\tthem\tup,\tI\tcan\ttell\tyou.\tOur\nkind\thave\tto\tkeep\tputting\tspells\ton\tMuggles\twho’ve\tspotted\tthem,\tto\tmake\tthem\nforget.”\n\t\t\t\t\t\t“So\twhat\ton\tearth’s\tHagrid\tup\tto?”\tsaid\tHermione.', metadata={'source': '/kaggle/inp

# Post-process outputs

- Format llm response
- Cite sources (PDFs)
- Change `width` parameter to format the output

In [None]:
def wrap_text_preserve_newlines(text, width=700):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text


def process_llm_response(llm_response):
    ans = wrap_text_preserve_newlines(llm_response['result'])

    sources_used = ' \n'.join(
        [
            source.metadata['source'].split('/')[-1][:-4]
            + ' - page: '
            + str(source.metadata['page'])
            for source in llm_response['source_documents']
        ]
    )

    ans = ans + '\n\nSources: \n' + sources_used
    return ans

In [None]:
def llm_ans(query):
    start = time.time()

    llm_response = qa_chain.invoke(query)
    ans = process_llm_response(llm_response)

    end = time.time()

    time_elapsed = int(round(end - start, 0))
    time_elapsed_str = f'\n\nTime elapsed: {time_elapsed} s'
    return ans + time_elapsed_str

# Ask questions

- Question Answering from multiple documents
- Invoke QA Chain
- Talk to your data

In [None]:
CFG.model_name

'llama2-13b-chat'

In [None]:
query = "Which challenges does Harry face during the Triwizard Tournament?"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

dad     says    it      must’ve been    a       powerful        Dark    wizard  to      get     round   Gringotts,      but     they
don’t   think   they    took    anything,       that’s  what’s  odd.    ’Course,        everyone        gets    scared
when    something       like    this    happens in      case    You-Know-Who’s  behind  it.”
                                                Harry   turned  this    news    over    in      his     mind.   He      was     starting        to      get     a       prickle
of      fear    every   time    You-Know-Who    was     mentioned.      He      supposed        this    was     all
part    of      entering        the     magical world,  but     it      had     been    a       lot     more    comfortable     saying
“Voldemort”     wi

In [None]:
query = "Why do the Malfoys look so unhappy with their lot? "
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

“Why do the Malfoys look so unhappy with their lot? Is my return, my
rise to power, not the very thing they professed to desire  for so many years? ”
“Of course, my Lord, ” said Lucius Malfoy. His hand shook as he wiped
sweat from his upper lip. “We did desire it – we do.”
To Malfoy ’s left, his wife made an odd, sti ff nod, her eyes averted from
Voldemort and the snake. To his right, his son, Draco, who had been gazing

a       mist    before  them    and     they    kept    as      close   as      possible        to      their   hot     cauldrons.
                                                “I      do      feel    so      sorry,” said    Draco   Malfoy, one     Potions class,  “for    all     those
people  who     have    to      stay    at      Hogwarts        for     Chri

In [None]:
query = "What are horcrux?"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

and held it as gingerly as if it wer e something recently dead.
“This is the one that gives explicit instructions on how to make a Horcrux.
Secrets of the Darkest Art – it’s a horrible book, really aw ful, full of evil
magic. I wonder when Dumbledore removed it from the library ... If he
didn’t do it until he was headmaster, I bet Voldemort got all the instruction
he needed from here. ”
“Why did he have to ask Slughorn how to make a Horcrux, then, if he ’d
already read that? ” asked Ron.
“He only approached Slughorn to find out what would happen if  you split
your soul into seven, ” said Harry. “Dumbledore was sure Riddle already
knew how to make a Horcrux by the time he asked Slughorn about them. I
think you ’re right, Hermione, that could easily have been where he got the

rath

In [None]:
query = "Give me 5 examples of cool potions and explain what they do"
print(llm_ans(query))


Don't try to make up an answer, if you don't know just say that you don't know.
Answer in the same language the question was asked.
Use only the following pieces of context to answer the question at the end.

First-year      students        will    require:
1.      Three   sets    of      plain   work    robes   (black)
2.      One     plain   pointed hat     (black) for     day     wear
3.      One     pair    of      protective      gloves  (dragon hide    or      similar)
4.      One     winter  cloak   (black, silver  fastenings)
Please  note    that    all     pupils’ clothes should  carry   name    tags

COURSE  BOOKS
All     students        should  have    a       copy    of      each    of      the     following:
The     Standard        Book    of      Spells  (Grade  1)
by      Miranda Goshawk
A       History of      Magic
by      Bathilda        Bagshot
Magical Theory
by      Adalbert        Waffling
A       Beginners’      Guide   to      Transfiguration
by      Emeric  Swi

# Gradio Chat UI

- **<font color='orange'>At the moment this part only works on Google Colab. Gradio and Kaggle started having compatibility issues recently.</font>**
- If you plan to use the interface, it is preferable to do so in Google Colab
- I'll leave this section commented out for now
- Chat UI prints below

___

- Create a chat UI with [Gradio](https://www.gradio.app/guides/quickstart)
- [ChatInterface docs](https://www.gradio.app/docs/chatinterface)
- The notebook should be running if you want to use the chat interface

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
! pip install --upgrade gradio -qq
clear_output()

In [None]:
# Update typing_extensions to the latest version
#!pip install --upgrade typing_extensions

# Reinstall gradio to ensure all dependencies are met
#!pip install --upgrade gradio

# Import gradio and print its version
#import gradio as gr
#print(gr.__version__)


In [None]:
#import gradio as gr
#print(gr.__version__)

In [None]:
# def predict(message, history):
#     # output = message # debug mode

#     output = str(llm_ans(message)).replace("\n", "<br/>")
#     return output

# demo = gr.ChatInterface(
#     predict,
#     title = f' Open-Source LLM ({CFG.model_name}) for Harry Potter Question Answering'
# )

# demo.queue()
# demo.launch()

# Conclusions

- Feel free to fork and optimize the code. Lots of things can be improved.

- Things I found had the most impact on models output quality in my experiments:
    - Prompt engineering
    - Bigger models
    - Other models families
    - Splitting: chunk size, overlap
    - Search: Similarity, MMR, k
    - Pipeline parameters (temperature, top_p, penalty)
    - Embeddings function
    - LLM parameters (max len)


- LangChain, Hugging Face and Gradio are awesome libs!

- **<font color='orange'>If you liked this notebook, don't forget to show your support with an Upvote!</font>**

- In case you are interested in LLMs, I also have some other notebooks you might want to check:

    - [Instruction Finetuning](https://www.kaggle.com/code/hinepo/llm-instruction-finetuning-wandb)
    - [Preference Finetuning - LLM Alignment](https://www.kaggle.com/code/hinepo/llm-alignment-preference-finetuning)
    - [Synthetic Data for Finetuning](https://www.kaggle.com/code/hinepo/synthetic-data-creation-for-llms)
    - [Safeguards and Guardrails](https://www.kaggle.com/code/hinepo/llm-safeguards-and-guardrails)
    
___

🦜🔗🤗