<a href="https://colab.research.google.com/github/PouriaRouzrokh/machine-learning/blob/master/Education/Retrieval_Augmented_Generation/RAG_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Retrieval Augmented Generations with Large Language Models

---

This Notebook was created by the following members of the SIIM Machine Learning Commitee, Subcommitee of Machine Leaning Education:

<b>Pouria Rouzrokh, MD, MPH, MHPE;
Shahriar Faghani, MD, MPH, MHPE;
Felipe Kitamura, MD, PhD;
Timothy L. Kline, PhD </b>  
With special thanks to: **Moein Shariatnia, MD Candidate**

---

In recent years, the applications of large language models (LLMs) like GPT-4 have expanded at an exponential pace. From simplifying basic tasks such as setting reminders and answering emails to more complex ones like drafting research papers, coding software, and even assisting in artistic creations. In general, LLMs have found a foothold in a diverse array of domains. Notably, in the field of medicine, these models have shown promise in interpreting complex data sets, searching patient records, and even generating synthetic text data. Their versatility stems from their enormous training datasets and the underlying architectures, allowing them to generate human-like textual responses in real-time.

However, like all tools, LLMs come with their set of limitations. One of the prominent challenges is the **"hallucination"** errors, where the model might generate information that is incorrect or not present in its training data. In fields like medicine, such errors could lead to misleading interpretations and, in worst-case scenarios, detrimental patient outcomes. The crux of the issue is that while LLMs can generate plausible-sounding content, they do not inherently verify the factual accuracy of the generated output against a trusted data source.

In this notebook we will learn about **Retrieval Augmented Generation (RAG)**, an approach that may help mitigate the hallcuination errors in LLMs. This approach synergizes the powerful generative capabilities of LLMs with the accuracy of retrieval-based models. In RAG, when a query is made, the model first fetches relevant documents or data snippets (retrieval phase) from a large pool of documents (could be already available or also provided by the user) and then uses this information to generate a response (generation phase). By combining the strengths of both retrieval and generation models, RAG aims to provide more accurate and contextually relevant answers. For medical fields, using RAG can potentially ensure that responses are not only contextually rich but also grounded in accurate data, ensuring a higher degree of trustworthiness in the model's outputs.

## Notebook Requirements

Running this notebook needs you to have access to the following items:
1. **HuggingFace token**: This token will be necessary to access specific models and datasets in subsequent steps. For a quick overview of the process, you can check out the [Hugging Face documentation](https://huggingface.co/docs/api-inference/quicktour). It explains how to register or log in, obtain a user access or API token from your Hugging Face profile settings, and submit the token when making requests to the API.
2. **Access to LLAMA2 Model**: Accessing to LLAMA2 model on HuggingFace could be done via (usually approved very quickly): https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
3. **A Kaggle username and key**: This is necessary to download the dataset we will work with. Here is a [tutorial](https://github.com/JovianHQ/opendatasets#kaggle-credentials) to learn how to obtain this information.
4. Access to **Medium** posts: Medium is a platform for hosting blog posts about data science. Accessing three medium articles per month are free, but you nee a paid subscrition for more frequent access.

Please note that accessing all above items are free, but you may need to wait a little while for some of them to become available.

## Setting Up the Environment

Before diving into Retrieval Augmented Generation (RAG), we need to set up our environment by installing the necessary libraries. The libraries listed here provide us with tools and functionalities to implement and leverage RAG, as well as other related processes. Here's a brief overview of some of the core libraries:

*   **transformers**: Contains implementations of many state-of-the-art models, including those related to RAG.
*   **sentence-transformers**: Helps in creating embeddings for sentences, useful for the retrieval phase in RAG.
*   **Langchain**: a library to interact with LLMs in a high-level format.
*   **pinecone-client** and **chromadb**: Facilitate interactions with databases and external data sources.
*   **datasets**: Provides easy access to a plethora of datasets, which can be handy when testing and validating RAG.
*   **accelerate**: Aids in accelerating Python workflows.
*   **einops** and **xformers**: Offer advanced operations and architectures for neural networks.
*   **bitsandbytes**: Assists in efficient deep learning model loading.
*   **opendatasets**: An easy way to download and use open datasets in our experiments.

After installing these, we import the necessary modules to prepare for our subsequent RAG experiments. It's essential to ensure a smooth experience by having all the required tools at our disposal.

In [1]:
!pip install -qU \
  transformers==4.31.0 \
  sentence-transformers==2.2.2 \
  pinecone-client==2.2.2 \
  datasets==2.14.0 \
  accelerate==0.21.0 \
  einops==0.6.1 \
  langchain==0.0.240 \
  xformers==0.0.20 \
  bitsandbytes==0.41.0\
  opendatasets==0.1.22\
  chromadb==0.4.6

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/7.4 MB[0m [31m4.1 MB/s[0m eta [36m0:00:02[0m[2K     [91m━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/7.4 MB[0m [31m53.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━[0m [32m6.3/7.4 MB[0m [31m70.4 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.4/7.4 MB[0m [31m69.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m54.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [2]:
# Imports

import os
import shutil
from typing import List, Dict
import warnings

import langchain
import opendatasets as od
import pandas as pd
from torch import cuda, bfloat16
from tqdm.notebook import tqdm
import random
import transformers

## Environment Configuration

In this section, we're setting up some preliminary configurations to ensure our experiments run seamlessly:

*   Warnings Filter: It's common to encounter warnings in Python, especially when working with experimental libraries or using functions that might soon be deprecated. To keep our notebook clutter-free, we suppress these warnings.
*   Pandas Display: When displaying dataframes using Pandas, long text columns can get truncated for brevity. We've increased the maximum displayed column width to 1000 characters for better visibility.
*   Hugging Face Token: Hugging Face provides numerous models and datasets that we might utilize. To access them, we need to provide an authentication token. This token connects us to the Hugging Face platform and ensures seamless downloads and uploads.
*   Cleaning Up: The Google Colab environment typically comes with a sample data directory. Since we want a clean workspace, we're removing this directory.

> **Note:** Ensure you have your Hugging Face token ready before proceeding, as it will be necessary to access specific models and datasets in subsequent steps.
For a quick overview of the process, you can check out the [Hugging Face documentation](https://huggingface.co/docs/api-inference/quicktour). It explains how to register or log in, obtain a user access or API token from your Hugging Face profile settings, and submit the token when making requests to the API.

In [3]:
# Configurations

warnings.filterwarnings('ignore')
pd.options.display.max_colwidth = 1000

HF_TOKEN = input('Please enter your Hugging Face token: ')

Please enter your Hugging Face token: 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


In [4]:
# Remove the sample directory

if os.path.exists('sample_data'):
  shutil.rmtree('sample_data')

## Data Acquisition and Preparation

To explore and validate our RAG model, we'll be using the "Question-Answer Dataset" from Kaggle. This dataset comprises question-answer pairs, making it suitable for our purpose.

Steps:

*    Downloading the Dataset: We use the opendatasets library to effortlessly fetch our dataset from Kaggle. Ensure you're authenticated on Kaggle and have the necessary permissions to download the dataset.
*    Loading the Questions: The dataset is split across multiple files. We loop through these files, load them into Pandas dataframes, and concatenate them into a single dataframe. The encoding is set to 'latin1' and the delimiter to '\t' (tab) to correctly interpret the dataset structure. To maintain unique questions, duplicates are dropped based on the "Question" column.

> **Note**: Please note that you need to have your UserName and Kaggle Key to download the dataset. Here is a [tutorial](https://github.com/JovianHQ/opendatasets#kaggle-credentials) to learn how to obtain this information.

In [8]:
# Download the "Question-Answer Dataset" dataset from Kaggle

od.download("https://www.kaggle.com/datasets/rtatman/questionanswer-dataset")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: pouriarouzrokh
Your Kaggle Key: ··········
Downloading questionanswer-dataset.zip to ./questionanswer-dataset


100%|██████████| 3.55M/3.55M [00:01<00:00, 2.91MB/s]





Once downloaded, we display the number of unique questions and the first few rows of our aggregated dataframe to understand its structure and contents. As seen in the output, we have a total of 2463 unique questions, and the dataset columns provide insights like article title, question, answer, difficulty levels from the questioner and answerer, and more.

This dataset will form the foundation of our retrieval and generation tasks in the upcoming steps.

In [15]:
# Load the questions

data_directory = "/content/questionanswer-dataset"
df = pd.DataFrame()
for file in ['S10_question_answer_pairs.txt','S09_question_answer_pairs.txt','S08_question_answer_pairs.txt']:
    filename = os.path.join(data_directory, file)
    df_tmp = pd.read_csv(filename, encoding='latin1', sep='\t').drop_duplicates(subset="Question")
    df = pd.concat([df,df_tmp])

print(f'Number of questions: {len(df)}')
df.head()

Number of questions: 2463


Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile,ï»¿ArticleTitle
0,Alessandro_Volta,Was Alessandro Volta a professor of chemistry?,Alessandro Volta was not a professor of chemistry.,easy,easy,S10_set4_a10,
2,Alessandro_Volta,Did Alessandro Volta invent the remotely operated pistol?,Alessandro Volta did invent the remotely operated pistol.,easy,easy,S10_set4_a10,
4,Alessandro_Volta,Was Alessandro Volta taught in public schools?,Volta was taught in public schools.,easy,easy,S10_set4_a10,
6,Alessandro_Volta,Who did Alessandro Volta marry?,Alessandro Volta married Teresa Peregrini.,medium,medium,S10_set4_a10,
8,Alessandro_Volta,What did Alessandro Volta invent in 1800?,"In 1800, Alessandro Volta invented the voltaic pile.",medium,easy,S10_set4_a10,


#### Data cleaning

Data cleaning is a pivotal step in most data-driven projects. Clean and structured data ensures better performance and interpretability of our models.

Cleaning Steps:

*    Dropping Unnecessary Columns: The last column (extra Article Title) in our dataframe is redundant and doesn't contribute any meaningful information to our RAG model. Thus, we drop it.
*    Handling Missing Values: Questions and Answers are the backbone of our experiment. Hence, we remove any rows where the "Question" or "Answer" fields have missing values. This ensures our model isn't fed any incomplete or ambiguous data.


Post-cleaning, we inspect the size of our cleaned dataframe and the first few rows. As observed in the output, we're now working with 2190 unique and clean question-answer pairs, which are ready to be fed into the RAG model in subsequent steps.

In [16]:
# Cleaning the original dataframe

df.drop(df.columns[-1], 1, inplace=True) # We don't need the Article title column
df.dropna(subset=['Question'], inplace=True)
df.dropna(subset=['Answer'], inplace=True)

print(f'new df length: {len(df)}')
df.head()

new df length: 2190


Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
0,Alessandro_Volta,Was Alessandro Volta a professor of chemistry?,Alessandro Volta was not a professor of chemistry.,easy,easy,S10_set4_a10
2,Alessandro_Volta,Did Alessandro Volta invent the remotely operated pistol?,Alessandro Volta did invent the remotely operated pistol.,easy,easy,S10_set4_a10
4,Alessandro_Volta,Was Alessandro Volta taught in public schools?,Volta was taught in public schools.,easy,easy,S10_set4_a10
6,Alessandro_Volta,Who did Alessandro Volta marry?,Alessandro Volta married Teresa Peregrini.,medium,medium,S10_set4_a10
8,Alessandro_Volta,What did Alessandro Volta invent in 1800?,"In 1800, Alessandro Volta invented the voltaic pile.",medium,easy,S10_set4_a10


#### Data Collection and Preprocessing

Before we can apply Retrieval Augmented Generation (RAG) for our medical informatics tasks, we need to set up a dataset from which our model can retrieve information. In this section, we're collecting textual data from .clean files located in a directory named text_data.

The provided code:

1.    Defines a directory path txt_files_dir where our textual data resides.
Initializes an empty dictionary text_records to store the content of each text file.
2.    Iterates through each file in the directory, reads its content, and adds it to the dictionary using the file name (sans extension) as the key.
3.    Finally, we print the number of collected records and showcase the content of the first file to get a sense of our data.

This initial data collection phase ensures that we have a robust dataset from which our RAG model can pull relevant information when generating responses.

In [17]:
# Collect the texts for all txt files

txt_files_dir = os.path.join(data_directory, 'text_data/text_data')
text_records: Dict = dict()
for file_name in os.listdir(txt_files_dir):
    if file_name.endswith('.clean'):
        file_path = os.path.join(txt_files_dir, file_name)
        with open(file_path, 'r', encoding="utf-8", errors="replace") as f:
            assert os.path.isfile(file_path)
            text = f.read()
            source = file_name.split('.')[0]
            text_records[source] = text

print(f'Number of text records: {len(text_records)}')
list(text_records.items())[0]

Number of text records: 150


('S10_set5_a8',
 'Turkish_language\n\n\n\nTurkish (  IPA  ) is spoken as a first language by over 63 million people worldwide,    making it the most commonly spoken of the Turkic languages.  Its speakers are located predominantly in Turkey and Cyprus, with smaller groups in Iraq, Greece, Bulgaria, the Republic of Macedonia, Kosovo, Albania and other parts of Eastern Europe. Turkish is also spoken by several million immigrants in Western Europe, particularly in Germany.\n\nThe roots of the language can be traced to Central Asia, with the first written records dating back nearly 1,200 years. To the west, the influence of Ottoman Turkish—the variety of the Turkish language that was used as the administrative and literary language of the Ottoman Empire—spread as the Ottoman Empire expanded. In 1928, as one of Atatürk\'s Reforms in the early years of the Republic of Turkey, the Ottoman script was replaced with a phonetic variant of the Latin alphabet. Concurrently, the newly founded Turkish

## Regular question-answering with LLAMA2

Before diving into Retrieval Augmented Generation (RAG), it's crucial to understand the performance of traditional Large Language Models (LLMs) without retrieval augmentation. For this purpose, we're setting up a baseline using LLAMA 2, a state-of-the-art open-source language model.

> **Note**: RAG can also be applied to commercial language models like ChatGPT and GPT-4. However, you need to replicate all the codes using the OpenAI python library instead of the HuggingFace library.

The provided code performs the following tasks:

1.   Specifies the model_id corresponding to LLAMA 2 available on the HuggingFace Model Hub.
2.   Determines the computational device (GPU or CPU) for running the model.
3.   Configures quantization settings via BitsAndBytesConfig to load the model using reduced memory. Quantization is a technique to store and compute on model parameters using fewer bits, which can be particularly useful when working with large models on limited hardware.
4.   Initializes the model configuration and the model itself using the provided model_id.


After these steps, we have our LLAMA 2 model loaded and ready for subsequent question-answering tasks. Remember, this baseline will give us a performance benchmark that we can compare against RAG to evaluate the potential improvements. Please note that running this cell may take several minutes.

If you run into issues with this cell, you may need to request access to the model. Please go here and request access (usually approved very quickly): https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

In [11]:
# Loading the LLAMA2 model from HuggingFace

model_id = 'meta-llama/Llama-2-13b-chat-hf'
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing HF items, need auth token for these
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=HF_TOKEN
)

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=HF_TOKEN
)
model.eval()
print(f"Model loaded on {device}")

(…)a-2-13b-chat-hf/resolve/main/config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

(…)esolve/main/model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

(…)t-hf/resolve/main/generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Model loaded on cuda:0


A crucial step in working with language models is to convert the textual input into a format that the model can understand. This process is known as **tokenization**. Essentially, tokenization breaks down text into smaller pieces, commonly called tokens. These tokens are then mapped to unique integers, allowing them to be processed by the model. Please refer to this [tutorial](https://medium.com/@fhirfly/understanding-tokens-in-the-context-of-large-language-models-like-bert-and-t5-8aa0db90ef39) to learn more about tokenization.

The provided code initializes a tokenizer associated with our LLAMA 2 model. The `AutoTokenizer.from_pretrained` method from the HuggingFace library simplifies this process by fetching the appropriate tokenizer based on the given `model_id`.

Once initialized, this tokenizer can be used to convert questions, paragraphs, or any textual data into a format suitable for our model. Understanding and using a tokenizer correctly is vital, as it ensures that our model receives data in a format it's trained to understand.

For a deep dive into tokenization and its importance, you can refer to the HuggingFace documentation on tokenizers.

In [12]:
# Setup a tokenizer

tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=HF_TOKEN
)


(…)at-hf/resolve/main/tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

(…)-13b-chat-hf/resolve/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

(…)-hf/resolve/main/special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

#### Language Model Pipeline for Text Generation
Creating an efficient pipeline is fundamental for streamlined model inference. The provided code sets up a text generation pipeline using the loaded LLAMA 2 model and its corresponding tokenizer.

Here's a brief overview of the key components:

*    Pipeline: A utility from HuggingFace, which simplifies the process of input processing, model inference, and output post-processing into a single callable function.

*    Important parameters:

  *    `temperature`: It influences the randomness of the model's output. A value closer to 0 makes the output more deterministic (and hence, less diverse), while a value closer to 1 allows for more variation.
  *    `max_new_tokens`: This limits the length of the generated response, ensuring the model doesn't produce overly long outputs.
  *    `repetition_penalty`: A mechanism to prevent the model from repeating phrases or words in its response. A value greater than 1 penalizes repeated content.


The final two lines cater to a practical concern: enabling batching for inference. This ensures efficient GPU utilization when processing multiple inputs simultaneously.

Setting up such a pipeline is invaluable, especially when wanting to interact with the model iteratively or for batch processing, as it encapsulates the complexity of the underlying processes.

For more details on HuggingFace's pipeline utility, visit this [link](https://huggingface.co/docs/transformers/main_classes/pipelines).

In [13]:
# LLM pipeline

generator = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    task='text-generation',
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

# Necessary to enable batching for inference
generator.tokenizer.pad_token_id = generator.model.config.eos_token_id

Now, let's ask LLAMA2 one of the questions in our dataset and see how it responds.

In [18]:
# Simple LLM inference with HuggingFace

question = df.iloc[2]['Question']

prompt=f"{question}\nAnswer in a short sentence, without explaining too much."
print(f'Prompt:\n```{prompt}```')

res = generator(prompt)
print(f'\nResponse:\n```{res[0]["generated_text"]}```')

Prompt:
```Was Alessandro Volta taught in public schools?
Answer in a short sentence, without explaining too much.```

Response:
```

The answer is: No, Alessandro Volta was not taught in public schools.```


Now let's do something fun! Did the model give us the right answer?! No! It said the exact opposite. What you see here is a clear example of LLM hallucination; when an confidently returns an answer that has no roots in reality.

In [19]:
df.iloc[2]['Answer']

'Volta was taught in public schools.'

It's worth noting that you could also use the "Langchain" library to do this simple Q/A with the model. However, we will see the real merit of Langchain in the next section, when we use it for doing RAG with our model.

In [20]:
# Simple LLM inference with langchain

lc_generator = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True, # This should be true for langchain
    task='text-generation',
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=512,  # mex number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

llm = langchain.llms.HuggingFacePipeline(pipeline=lc_generator)
llm(prompt)

'\n\nThe answer is: No, Alessandro Volta was not taught in public schools.'

## Understanding RAG: A Simplified Overview

The Retrieval-Augmented Generation (RAG) framework offers a blend of traditional large language models and external knowledge retrieval, making it especially beneficial for specialized tasks. Let's simplify the process with a general overview without diving into a specific domain.

Imagine a vast digital library filled with books on a variety of subjects. Now, think of the RAG system as a librarian with an impeccable memory. When you ask this librarian a question, rather than relying solely on memory, she looks up relevant information from the library to provide a comprehensive and precise answer.

Here’s a step-by-step breakdown of the process:

1.  **Embedding Knowledge**: Initially, the RAG system scans all the books (or documents) in the library, understanding their content, and converting each page into a digital fingerprint, or "vector". These vectors are stored in a special digital catalog.

2.  **Question Analysis**: Now, when you ask the librarian a question, she instantly translates your question into a similar digital fingerprint to know what to look for in the catalog.

3.  **Finding Relevant Information**: Using the fingerprint of your question, the librarian quickly searches the catalog to find the pages (or chunks of data) most closely related to your query, as if comparing the similarities between the patterns of two fingerprints.

4.  **Crafting the Response**: With the relevant pages in hand, the librarian now composes a well-informed answer, ensuring it's based on the information from the library. This answer is not just from memory but is augmented by the recent information she retrieved.

5.  **Delivering the Answer**: The librarian then shares her comprehensive response with you.

At the heart of this process is the digital catalog (vector database). It ensures that the RAG system provides answers grounded in the information it has been provided, ensuring accuracy and relevance. This approach is particularly beneficial for scenarios where a system needs to tap into specific, up-to-date, or domain-relevant data to answer queries effectively.

The following figure simplifies the above methodology for question answering with LLMs using the RAG methodology:
<img src="https://i.ibb.co/5GchbqR/RAG.jpg" alt="RAG" border="0">

## RAG question-answering with LLAMA2

#### Loading the texts into Langchain documents

To apply the RAG methodology using the LangChain library, we first need to preprocess our data using some of the functionalities that this library provides.

To do so:

1.  We first initialize an empty list full_docs to store our processed documents.
2.  For each source in our text_records (which are presumably a collection of medical records or literature), we create metadata to tag the source of each document.
3.  We then create a Document object from the LangChain docstore module using the text content and its corresponding metadata. This structured format helps in better management and querying later on.
4.  Finally, we print out the total number of processed documents and provide a sample document as a sanity check.


Using a structured approach to load documents ensures that we can efficiently retrieve and use them in subsequent steps, especially when we implement Retrieval Augmented Generation.

In [21]:
# Loading text files as LangChain documents

full_docs: List = list()
for source in text_records:
  metadata = {'source': source}
  text = text_records[source]
  full_docs.append(
      langchain.docstore.document.Document(
          page_content=text, metadata=metadata
          )
  )

print(f'Number of documents: {len(full_docs)}')
print(f'Documents: {full_docs[0]}')


Number of documents: 150
Documents: page_content='Turkish_language\n\n\n\nTurkish (  IPA  ) is spoken as a first language by over 63 million people worldwide,    making it the most commonly spoken of the Turkic languages.  Its speakers are located predominantly in Turkey and Cyprus, with smaller groups in Iraq, Greece, Bulgaria, the Republic of Macedonia, Kosovo, Albania and other parts of Eastern Europe. Turkish is also spoken by several million immigrants in Western Europe, particularly in Germany.\n\nThe roots of the language can be traced to Central Asia, with the first written records dating back nearly 1,200 years. To the west, the influence of Ottoman Turkish—the variety of the Turkish language that was used as the administrative and literary language of the Ottoman Empire—spread as the Ottoman Empire expanded. In 1928, as one of Atatürk\'s Reforms in the early years of the Republic of Turkey, the Ottoman script was replaced with a phonetic variant of the Latin alphabet. Concurr

#### Chunking the texts

To facilitate efficient document retrieval, especially when dealing with large text files, it's often beneficial to divide these documents into manageable "**chunks**". This allows for faster indexing, storage, and retrieval, which is paramount in real-time applications like RAG.

The chunk_docs function:

*  Accepts a list of docs, the desired chunk_size, and an optional chunk_overlap as inputs.
*  Utilizes the RecursiveCharacterTextSplitter from the LangChain library to split each document into smaller text chunks. The chunk_size determines the length of each chunk, while chunk_overlap ensures a certain overlap between adjacent chunks for context preservation.
*  The split documents (chunks) are returned.
*  After defining the function, we call it on our full_docs and store the result in chunked_docs. We then print the total number of chunks and provide a sample chunk.

Dividing documents into chunks can be particularly advantageous in long texts like the Wikipedia pages we have here, where pertinent information may span a few sections, and context preservation is crucial.

> **Question**: why do we need to put leave some overlaps between the chunks we are generating?

In [22]:
# Chunking documents

def chunk_docs(
    docs: List, chunk_size: int = 1500, chunk_overlap: int = 200
) -> List:
    """Split each docuemnt into chunks, based on the chunk size and chunk_overlap.

    Args:
        docs (List[Document]): list of input docs, usually the pdf pages.
        chunk_size (int, optional): chunk size. Defaults to 1500.
        chunk_overlap (int, optional): chunk overlap. Defaults to 200.

    Returns:
        List[Document]: list of chunks as Documents.
    """
    text_splitter = langchain.text_splitter.RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    docs = text_splitter.split_documents(docs)

    return docs

chunked_docs = chunk_docs(full_docs)

print(f'Number of chunks: {len(chunked_docs)}')
print(f'Chunks: {chunked_docs[0]}')

Number of chunks: 3986
Chunks: page_content="Turkish_language\n\n\n\nTurkish (  IPA  ) is spoken as a first language by over 63 million people worldwide,    making it the most commonly spoken of the Turkic languages.  Its speakers are located predominantly in Turkey and Cyprus, with smaller groups in Iraq, Greece, Bulgaria, the Republic of Macedonia, Kosovo, Albania and other parts of Eastern Europe. Turkish is also spoken by several million immigrants in Western Europe, particularly in Germany.\n\nThe roots of the language can be traced to Central Asia, with the first written records dating back nearly 1,200 years. To the west, the influence of Ottoman Turkish—the variety of the Turkish language that was used as the administrative and literary language of the Ottoman Empire—spread as the Ottoman Empire expanded. In 1928, as one of Atatürk's Reforms in the early years of the Republic of Turkey, the Ottoman script was replaced with a phonetic variant of the Latin alphabet. Concurrently,

#### Setting Up the Embedding Model
Embeddings play a pivotal role in retrieval tasks. They transform our textual data into numerical vectors in a high-dimensional space, where semantically similar documents are closer to each other. This allows for efficient searching and matching of related content.

Here's a breakdown of the code:

1.  We specify our embedding model ID: sentence-transformers/all-MiniLM-L6-v2. This model is designed to convert sentences into embeddings that capture their semantic essence.
2.  We then initialize our embedding model using LangChain's HuggingFaceEmbeddings class. This class interfaces with the renowned HuggingFace library, allowing easy integration and usage of a plethora of pre-trained models.
  *  `model_name`: The identifier of our chosen embedding model.
  *  `model_kwargs`: Additional model-specific arguments. Here, we specify the device (like GPU) on which our model will run.
  *  `encode_kwargs`: Encoding-specific parameters, such as the device and batch size for processing.


By setting up an efficient embedding model, we lay the groundwork for the retrieval part of the RAG process, ensuring that our model can rapidly identify and fetch relevant document chunks or passages.

In [23]:
# Setup the embedding model

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'
embed_model = langchain.embeddings.huggingface.HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

(…)f3d3c277d6e90027e55de9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

(…)7d6e90027e55de9125/1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(…)e2f80f3d3c277d6e90027e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

(…)f80f3d3c277d6e90027e55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

(…)de9125/config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

(…)d3c277d6e90027e55de9125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

(…)90027e55de9125/sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

(…)6e90027e55de9125/special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

(…)f3d3c277d6e90027e55de9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

(…)7d6e90027e55de9125/tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

(…)3d3c277d6e90027e55de9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

(…)e2f80f3d3c277d6e90027e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

(…)80f3d3c277d6e90027e55de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Before diving deep into RAG with large datasets, it's always good to ensure that our embedding model works as expected. This segment provides a demonstration using a simple list of sample texts.

In [24]:
# Demonstrate the embed_model performance

sample_texts = [
    'This is sample text 1.',
    'This is sample text 2.',
    'This is sample text 3.',
    'This is sample text 4.',
    'This is sample text 5.',
]

embeddings = embed_model.embed_documents(sample_texts)

print(f"We have {len(embeddings)} doc embeddings, each with "
      f"a dimensionality of {len(embeddings[0])}.")

We have 5 doc embeddings, each with a dimensionality of 384.


For retrieval tasks, it's not just enough to create embeddings; we also need an efficient storage and retrieval system for these vector representations. In this section, we set up a vector store using LangChain's plugin for the "**ChromaDB**" vector store and populate it with our document embeddings.

ChromaDB is an open-source vector store used for storing and retrieving vector embeddings. It is a Python library that helps us work with vector stores, basically a vector database. With ChromaDB, we can store vector embeddings, perform semantic searches, similarity searches, and retrieve vector embeddings. It is designed to save embeddings along with metadata to be used later by large language models1. Additionally, it can also be used for semantic search engines over text data.

In the code below, we create a vector_db using LangChain's Chroma.from_documents method:
*  `documents`: The chunks we previously created are passed to be converted into embeddings.
*  `embedding`: Our earlier set up embedding model is used to convert each chunk into its vector representation.
*  `persist_directory`: Specifies a directory where the vector embeddings will be persisted for future usage. This is essential for large-scale applications to avoid recalculating embeddings.


We then call the `persist()` method on vector_db to ensure that all vector embeddings are saved to the specified directory.


Storing our embeddings in a structured and efficient vector store, like Chroma, ensures rapid retrieval during real-time tasks, a fundamental requirement in the RAG process.

In [25]:
# Setup a vector store and load it with all vector embeddings

vector_db = langchain.vectorstores.Chroma.from_documents(
    documents=chunked_docs,
    embedding=embed_model,
    persist_directory='./chroma_vectors',
)

vector_db.persist()

Now that our vector store is populated with embeddings, let's demonstrate the retrieval process. The idea is to query the vector store to find the most semantically relevant document chunks based on our query. But how can we do that?

**The magic of cosine similarity**:

Cosine similarity is a metric that measures the cosine of the angle between two vectors. It is often used to compute the similarity between word vectors, indicating how similar two words are in terms of their usage or meaning. In the context of Retrieval-Augmented Generation (RAG) with Large Language Models (LLMs), cosine similarity can be used to compute the similarity between embeddings or vectors representing different pieces of text. By comparing the cosine similarity between these embeddings, we can identify how similar or related they are to each other.

So what happens behind the scene when we query our vector database is that the cosine similarity between our query vector and all stored vectors in the database will be computed, and those with maximum similarity will be returned. These vectors belong to text chunks that are most likely to be similar - in terms of content - to our queried question.

Breakdown of the code:

1.  We define a sample query: "what city was the capital of Russia?"
2.  We use the similarity_search method of our vector_db to find the most relevant chunks:
*  `query`: The search query for which we want to retrieve relevant chunks.
*  `k`: The number of top relevant chunks we want to retrieve. Here, we've chosen to retrieve the top 10 chunks.


By demonstrating the retrieval process, we illustrate how the model identifies and fetches relevant information, a critical step before generating a comprehensive answer in the RAG approach.

>**Question**: Look at the returned chunks. Do they look relevant to the question we asked?

In [26]:
# Demonstrating how the retriever works

query = 'what city was the capital of Russia?'

vector_db.similarity_search(
    query,  # the search query
    k=10  # returns top 3 most relevant chunks of text
)

[Document(page_content="Saint_Petersburg\n\n\n\nSaint Petersburg ( ,  ) is a city and a federal subject (a federal city) of Russia located on the Neva River at the head of the Gulf of Finland on the Baltic Sea. The city's other names were Petrograd ( , 1914–1924) and Leningrad ( , 1924–1991). It is often called just Petersburg ( ) and is informally known as Piter (   ).\n\nFounded by Tsar Peter I of Russia on 27 May 1703, it was the capital of the Russian Empire for more than two hundred years (1713–1728, 1732–1918). Saint Petersburg ceased being the capital in 1918 after the Russian Revolution of 1917. Nicholas and Alexandra: An Intimate Account of the Last of the Romanovs and the Fall of Imperial Russia (Athenum, 1967) by Robert K. Massie, ASIN B000CGP8M2 (also, Ballantine Books, 2000, ISBN 0-345-43831-0 and Black Dog & Leventhal Publishers, 2005, ISBN 1-57912-433-X)  It is Russia's second largest city after Moscow with 4.6 million inhabitants, and over 6 million people live in its v

#### Setting up a RAG pipeline

Now that we have our vector database set up, let's put together a RAG pipeline using LangChain. Note that this pipeline works like the generator pipeline we created above, but it is guaranteed to work based on RAG; which means whatever responses the model generates is going to be grounded in some chunks of texts that have been extracted from the vector database.

The following rag_pipeline is defined by receiving our LLM and vector database.

In [27]:
k=5 # Number of documents to retrieve

rag_pipeline = langchain.chains.RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vector_db.as_retriever(search_kwargs = {"k": k})
)

OK, now let's ask our LLM the same question about Alessandro's Volta public education as we did above, but this time using the RAG pipeline. It seems that our model is now able to return the correct answer!

In [28]:
rag_pipeline(df.iloc[2]['Question'])

{'query': 'Was Alessandro Volta taught in public schools?',
 'result': ' Yes, Alessandro Volta was taught in public schools. According to the text, he was taught in the public schools of Como, Italy.'}

This brings us to the end of this notebook, however, please feel free to play with the following two code cells, when you can ask our LLM the same question with and without using RAG. Do you notice the differences? What if you ask a question that is not already answered in our datasets. What model will do better in that situation?

In [29]:
# Basic Q/A with LLM for a sample query

query= "In what years, the Mozart-era piano underwent tremendous changes that led to the modern form of the instrument, and did Mozard and Alexandro ever meet?"

llm(query)


'\n\nAnswer: The Mozart-era piano underwent significant changes during the late 18th century, particularly in the 1790s. This period saw the development of the "Viennese" style of piano construction, which featured a more compact, lightweight design with a stronger and more responsive action. This new design allowed for greater expressiveness and technical facility, and it became the standard for pianos throughout Europe.\n\nAs for whether Mozart and Alexandro ever met, there is no historical evidence to suggest that they did. While Mozart traveled extensively throughout Europe during his lifetime, there is no record of him visiting Italy or meeting any Italian composers or musicians, including Alexandro. It\'s possible that Alexandro may have been influenced by Mozart\'s music, but there is no direct connection between the two.'

In [30]:
# RAG Q/A with LLM for the same query

rag_pipeline(query)

{'query': 'In what years, the Mozart-era piano underwent tremendous changes that led to the modern form of the instrument, and did Mozard and Alexandro ever meet?',
 'result': ' The Mozart-era piano underwent tremendous changes between 1790 and 1860. There is no record of Mozart and Alexandro ever meeting.'}

Thank you for reading this notebook. We hope you have found it useful!