# Retrieval-Augmented Generation (RAG) with open-source Hugging Face LLMs using LangChain

![RAG pic](pictures/RAG.png)

## Introduction: 

**Retrieval-Augmented Generation (RAG)** is an approach in natural language processing (NLP) that enhances the capabilities of generative models by integrating external knowledge retrieval into the generation process. This technique aims to improve the quality, relevance, and factual accuracy of the generated text by allowing the model to dynamically access and incorporate information from a large corpus of documents or databases during the generation task. The process involves two key components: a retrieval system and a generative model.

**Working Mechanism**

The working mechanism of RAG typically involves the following steps:

- Query Formation: The system formulates a query based on the initial input or prompt. This query is designed to retrieve information that is likely to be relevant to generating the desired output.

- Information Retrieval: The formulated query is used to fetch relevant information from an external database or knowledge base. The retrieval system may return one or more documents, passages, or data entries that match the query.

- Content Integration: The retrieved information, along with the original input, is provided to the generative model. The model then integrates this information to produce a coherent and contextually enriched output.

- Generation: The generative model synthesizes the final text, taking into account both the input and the retrieved external information. This step ensures that the output is not only relevant and informative but also maintains a natural and fluent language style.

Let's get started!

## Library installation
- Create a virtual environment and install the necessary python libraries
- `pip install transformers sentence-tranformers langchain torch faiss-cpu numpy`

## Library configuration

In [1]:
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain_core.vectorstores import VectorStoreRetriever
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import os
from urllib.request import urlretrieve
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


## Document preparation
**We are going to download 4 publications from United States Census Bureau on the following topics:**
- Occupation, Earnings, and Job Characteristics: July 2022
- Household Income in States andMetropolitan Areas: 2022
- Poverty in States and Metropolitan Areas: 2022
- Health Insurance Coverage Status and Type by Geography: 2021 and 2022

We prepare this documents for the LLM to use as a knowledge base.

In [2]:
# Download documents from U.S. Census Bureau to local directory.
os.makedirs("us_census", exist_ok=True)
files = [
    "https://www.census.gov/content/dam/Census/library/publications/2022/demo/p70-178.pdf",
    "https://www.census.gov/content/dam/Census/library/publications/2023/acs/acsbr-017.pdf",
    "https://www.census.gov/content/dam/Census/library/publications/2023/acs/acsbr-016.pdf",
    "https://www.census.gov/content/dam/Census/library/publications/2023/acs/acsbr-015.pdf",
]
for url in files:
    file_path = os.path.join("us_census", url.rpartition("/")[2])
    urlretrieve(url, file_path)

**Split documents to smaller chunks** 

Documents should be: 
- large enough to contain enough information to answer a question, and 
- small enough to fit into the LLM prompt: Mistral-7B-v0.1 input tokens limited to 4096 tokens
- small enough to fit into the embeddings model: all-mpnet-base-v2: input tokens limited to 384 tokens (roughly 1500 characters. Note: 1 token ~ 4 characters).

For this project, we are going to split documents to chunks of roughly 500 characters with an overlap of 50 characters.

In [3]:
# Load pdf files in the local directory
loader = PyPDFDirectoryLoader("./us_census/")

docs_before_split = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap  = 50,
)
docs_after_split = text_splitter.split_documents(docs_before_split)

In [4]:
docs_after_split[0]

Document(page_content='Health Insurance Coverage Status and Type \nby Geography: 2021 and 2022\nAmerican Community Survey Briefs\nACSBR-015Issued September 2023Douglas Conway and Breauna Branch\nINTRODUCTION\nDemographic shifts as well as economic and govern-\nment policy changes can affect people’s access to health coverage. For example, between 2021 and 2022, the labor market continued to improve, which may have affected private coverage in the United States \nduring that time.\n1 Public policy changes included', metadata={'source': 'us_census/acsbr-015.pdf', 'page': 0})

In [5]:
avg_doc_length = lambda docs: sum([len(doc.page_content) for doc in docs])//len(docs)
avg_char_before_split = avg_doc_length(docs_before_split)
avg_char_after_split = avg_doc_length(docs_after_split)

print(f'Before split, there were {len(docs_before_split)} documents loaded, with average characters equal to {avg_char_before_split}.')
print(f'After split, there were {len(docs_after_split)} documents (chunks), with average characters equal to {avg_char_after_split} (average chunk length).')

Before split, there were 63 documents loaded, with average characters equal to 3830.
After split, there were 576 documents (chunks), with average characters equal to 435 (average chunk length).


## Text Embeddings with Hugging Face Embedding Models
At the time of writing, there are 213 text embeddings models for English on the [Massive Text Embedding Benchmark (MTEB) leaderboard](https://huggingface.co/spaces/mteb/leaderboard). For our project, we are using LangChain's HuggingFaceEmbeddings, which only supports **sentence-transformers** embedding models. Currently, the best sentence-transformers embedding model on MTEB is **all-mpnet-base-v2** (max sequence length: 384 tokens, dimensions: 768, size: 420MB). Another sentence-transformers embedding model **all-MiniLM-L6-v2** (max sequence length: 256 tokens, dimensions: 384, size: 80MB) provides close quality but 5 times faster. 

In [6]:
huggingface_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",  # alternatively use "sentence-transformers/all-MiniLM-l6-v2" for a light and faster experience.
    model_kwargs={'device':'cpu'}, 
    encode_kwargs={'normalize_embeddings': False}
)

Now we can see how a sample embedding would look like for one of those chunks.

In [7]:
sample_embedding = np.array(huggingface_embeddings.embed_query(docs_after_split[0].page_content))
print("Sample embedding of a document chunk: ", sample_embedding)
print("Size of the embedding: ", sample_embedding.shape)

Sample embedding of a document chunk:  [-4.33103070e-02  9.73282605e-02 -1.74221173e-02 -1.16056398e-01
 -1.40246563e-03  2.92262491e-02  4.38641682e-02  4.99014333e-02
  8.32912177e-02 -4.58104117e-03  7.65514746e-02  2.53201351e-02
  1.26149943e-02  5.29901013e-02  3.35386060e-02 -4.15676124e-02
 -1.62367355e-02  5.84275201e-02 -4.08476032e-03  5.90218976e-03
 -7.34952539e-02  1.56117259e-02 -4.89762537e-02  3.38563621e-02
  3.42362896e-02  9.21768323e-03  2.33672373e-02 -1.53755620e-02
 -1.24781756e-02 -4.80748639e-02  6.90151379e-02 -2.54305564e-02
 -2.47002710e-02 -8.32483321e-02  1.91394065e-06 -4.36883047e-02
  4.59129643e-03  4.91034091e-02  1.85815338e-02 -4.82816249e-02
 -1.48528116e-02 -7.85710961e-02 -6.41152114e-02  1.56047996e-02
  2.89277155e-02 -3.18426788e-02 -9.23720840e-03  9.53771267e-03
 -6.15626387e-02  1.08385680e-03 -3.05190459e-02  1.53597463e-02
 -8.13634545e-02 -2.76066065e-02 -1.24553833e-02 -2.58729444e-03
  1.27911975e-03 -1.16974479e-02  3.72260176e-02 -8

## Retrieval System for vector embeddings
Once we have a embedding model, we are ready to vectorize all our documents and store them in a vector store to construct a retrieval system. With specifically designed searching algorithms, a retrieval system can do similarity searching efficiently to retrieve relevant documents.

FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions (nearest-neighbor search implementations).

In [8]:
vectorstore = FAISS.from_documents(docs_after_split, huggingface_embeddings)

In [9]:
query = """What were the trends in median household income across different states in the United States between 2021 and 2022."""  # Sample question, change to other questions you are interested in.
relevant_documents = vectorstore.similarity_search(query)
print(f'There are {len(relevant_documents)} documents retrieved which are relevant to the query. Display the first one:\n')
print(relevant_documents[0].page_content)

There are 4 documents retrieved which are relevant to the query. Display the first one:

programs-surveys/acs>.
HIGHLIGHTS
• Median household income in the United States was $74,755 in 2022, a decline of 0.8 percent from last year, after adjusting for inflation.
6
• Real median household income increased in five states and decreased in 17 states from 2021 to 2022. Twenty-eight states, the District of Columbia, and 
Puerto Rico showed no statisti-
cally significant differences. 
⁶ All income estimates in this report 
are inflation-adjusted to 2022 dollars.


### Create a retriever interface using vector store, we'll use it later to construct Q & A chain using LangChain.

In [10]:
# Use similarity searching algorithm and return 3 most relevant documents.
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

**Now we have our vector store and retrieval system ready. We then need a large language model (LLM) to process information and answer the question.**

## Open-source LLMs from Hugging Face
**There two ways to utilize Hugging Face LLMs: online and local.**

### Hugging Face Hub
The Hugging Face Hub is an platform with over 350k models, 75k datasets, and 150k demo apps (Spaces), all open source and publicly available, in an online platform where people can easily collaborate and build ML together. 

- To use, we should have the huggingface_hub python package installed.
- Set an environment variable called HUGGINGFACEHUB_API_TOKEN with your Hugging Face access token in it.
- Currently, HuggingFace LangChain integration doesn't support the question-answering task, so we can't select HuggingFace QA models for this project. Instead, we select LLMs from the text-generation task category.  

In [11]:
# from langchain_community.llms import HuggingFaceHub

# hf = HuggingFaceHub(
#     repo_id="EleutherAI/gpt-neo-2.7B",
#     model_kwargs={"temperature":0.1, "max_length":500})

#query = """What were the trends in median household income across different states in the United States between 2021 and 2022."""  # Sample question, change to other questions you are interested in.
# hf.invoke(query)

Hugging Face Hub will be slow when you run large models. You can get around this by downloading the model and run it on your local machine. This is the way we use LLM in our project.

### Hugging Face Local Pipelines

Hugging Face models can be run locally through the HuggingFacePipeline class.

- We need to install transformers python package.
- The Mistral-7B-v0.1 Large Language Model (LLM) is a pretrained generative text model with 7 billion parameters. Mistral-7B-v0.1 outperforms Llama-2-13B on all benchmarks tested. Read the [paper](https://arxiv.org/abs/2310.06825).
- Mistral-7B-v0.1's model size is 3.5GB, while Llama-2–13B has 13 billion parameters and 25GB model size.
- In order to use Llama2, you need to request access from Meta. Mistral-7B-v0.1 is publicly available already.

In [12]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

hf = HuggingFacePipeline.from_model_id(
    model_id="mistralai/Mistral-7B-v0.1",
    task="text-generation",
    pipeline_kwargs={"temperature": 0, "max_new_tokens": 300}
)

llm = hf

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:19<00:00,  9.67s/it]


In [13]:
llm.invoke(query)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


'\n\n## Answer (1)\n\nThe data is available here.\n\nThe data is in the form of a table, so you can\'t use `ggplot2` directly. You can use `tidyverse` to convert the table to a data frame and then plot it.\n\n```\nlibrary(tidyverse)\n\n# Read the data\ndata <- read_csv("https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-income-households/cps-historical-income-households.csv")\n\n# Convert the table to a data frame\ndata <- data %>%\n  as_tibble() %>%\n  select(Year, State, Median_Household_Income) %>%\n  mutate(Year = as.numeric(Year))\n\n# Plot the data\nggplot(data, aes(x = Year, y = Median_Household_Income, color = State)) +\n  geom_line() +\n  labs(x = "Year", y = "Median Household Income")\n```\n\nThis will produce the following plot:\n\nComment: Thank you so much! I\'m still learning R and I\'m not sure how to use the data frame.'

**At a glance, our LLM generates some output that might seem plausible but not accurate or factual. That is because it has not been trained on the census data of recent years.**

- OpenAI GPT-3.5 model (for test purpose only)

In [14]:
# from langchain_openai import ChatOpenAI
# chat = ChatOpenAI(temperature=0)
# chat.invoke(query)
# llm = chat

## Q & A chain 
Now we have both the retrieval system for relevant documents and LLM as QA chatbot ready.

We will take our initial query, together with the relevant documents retrieved based on the results of our similarity search, to create a prompt to feed into the LLM. The LLM will take the initial query as the question and relevant documents as the context information to generate a result.

Luckily, **LangChain** provides an abstraction of the whole pipeline - **RetrievalQA**

**Let's first construct a proper prompt for our task.**

Prompt engineering is another crucial factor in LLM's performance.

In [15]:
prompt_template = """Use the following pieces of context to answer the question at the end. Please follow the following rules:
1. If you don't know the answer, don't try to make up an answer. Just say "I can't find the final answer but you may want to check the following links".
2. If you find the answer, write the answer in a concise way with five sentences maximum.

{context}

Question: {question}

Helpful Answer:
"""

PROMPT = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

Call LangChain's RetrievalQA with the prompt above. 

In [16]:
retrievalQA = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

## Use RetrievalQA invoke method to execute the chain
Note that Input of [invoke method](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html#langchain.chains.retrieval_qa.base.RetrievalQA.invoke) needs to be a dictionary.

In [17]:
# Call the QA chain with our query.
result = retrievalQA.invoke({"query": query})

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [18]:
print(result['result'])


The median household income in the United States was $74,755 in 2022, a decline of 0.8 percent from last year, after adjusting for inflation. Real median household income increased in five states and decreased in 17 states from 2021 to 2022. Twenty-eight states, the District of Columbia, and Puerto Rico showed no statistically significant differences.

The District of Columbia had the highest median household income of all states ($110,000), followed by Maryland ($94,991), New Jersey ($96,346), and Massachusetts ($93,900). The lowest median household income was in Mississippi ($50,000), followed by Louisiana ($50,100), Arkansas ($50,200), and West Virginia ($50,300).

The median household income in the United States was $74,755 in 2022, a decline of 0.8 percent from last year, after adjusting for inflation. Real median household income increased in five states and decreased in 17 states from 2021 to 2022. Twenty-eight states, the District of Columbia, and Puerto Rico showed no statist

In [19]:
relevant_docs = result['source_documents']
print(f'There are {len(relevant_docs)} documents retrieved which are relevant to the query.')
print("*" * 100)
for i, doc in enumerate(relevant_docs):
    print(f"Relevant Document #{i+1}:\nSource file: {doc.metadata['source']}, Page: {doc.metadata['page']}\nContent: {doc.page_content}")
    print("-"*100)

There are 3 documents retrieved which are relevant to the query.
****************************************************************************************************
Relevant Document #1:
Source file: us_census/acsbr-017.pdf, Page: 1
Content: programs-surveys/acs>.
HIGHLIGHTS
• Median household income in the United States was $74,755 in 2022, a decline of 0.8 percent from last year, after adjusting for inflation.
6
• Real median household income increased in five states and decreased in 17 states from 2021 to 2022. Twenty-eight states, the District of Columbia, and 
Puerto Rico showed no statisti-
cally significant differences. 
⁶ All income estimates in this report 
are inflation-adjusted to 2022 dollars.
----------------------------------------------------------------------------------------------------
Relevant Document #2:
Source file: us_census/acsbr-017.pdf, Page: 2
Content: U.S. Census Bureau  3
Table 1.
Median Household Income and Gini Index in the Past 12 Months by State and P

## Conclusion

- Enhanced Accuracy and Relevance: By leveraging external sources, RAG models can generate content that is more accurate, detailed, and relevant to the given context.
- Factuality: It helps in improving the factuality of the generated text, as the information is directly sourced from reliable external databases or knowledge bases.
- Versatility: RAG can be applied to a wide range of NLP tasks, including question answering, text summarization, content creation, and more, enhancing their performance by providing access to a broader range of information.