# Project: Build a QA RAG System on Wikipedia Data

## Install OpenAI, and LangChain dependencies

In [None]:
!pip install langchain==0.3.11
!pip install langchain-openai==0.2.12
!pip install langchain-community==0.3.11
!pip install langchain-chroma==0.1.4
!pip install sentence-transformers==2.7.0

## Enter Open AI API Key

In [None]:
from getpass import getpass

OPENAI_KEY = getpass('Enter Open AI API Key: ')

## Setup Environment Variables

In [None]:
import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

### Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient `text-embedding-3-small` model, and a larger and more powerful `text-embedding-3-large` model.

In [4]:
from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

## Vector Databases

One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector database takes care of storing embedded data and performing vector search for you.

### Chroma Vector DB

[Chroma](https://docs.trychroma.com/getting-started) is a AI-native open-source vector database focused on developer productivity and happiness. Chroma is licensed under Apache 2.0.

### Get the wikipedia data

In [5]:
# if you can't download using the following code
# go to https://drive.google.com/file/d/1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW download it
# manually upload it on colab
!gdown 1oWBnoxBZ1Mpeond8XDUSO6J9oAjcRDyW

/bin/bash: line 1: gdown: command not found


In [6]:
import gzip
import json

wikipedia_filepath = 'simplewiki-2020-11-01.jsonl.gz'

docs = []
with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())

        #Add all paragraphs
        #passages.extend(data['paragraphs'])

        #Only add the first paragraph
        docs.append({
                        'metadata': {
                                        'title': data.get('title'),
                                        'article_id': data.get('id')
                        },
                        'data': ' '.join(data.get('paragraphs')[0:3]) # restrict data to first 3 paragraphs to run later modules faster
        })

print("Passages:", len(docs))

Passages: 169597


In [7]:
# We subset our data so we only use a subset of wikipedia documents to run things faster
docs = [doc for doc in docs for x in ['linguistics', 'india', 'cheetah']
              if x in doc['data'].lower().split()]

In [8]:
len(docs)

1364

In [9]:
docs[:3]

[{'metadata': {'title': 'Kurgan hypothesis', 'article_id': '72554'},
  'data': 'The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'},
 {'metadata': {'title': 'Marija Gimbutas', 'article_id': '72558'},
  'data': 'Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'},
 {'metadata': {'title': 'Basil', 'article_id': '73985'},
  'data': 'Basil ("Ocimum basilic

### Create LangChain Documents

In [10]:
from langchain.docstore.document import Document

docs = [Document(page_content=doc['data'],
                 metadata=doc['metadata']) for doc in docs]

In [11]:
docs[:3]

[Document(metadata={'title': 'Kurgan hypothesis', 'article_id': '72554'}, page_content='The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'),
 Document(metadata={'title': 'Marija Gimbutas', 'article_id': '72558'}, page_content='Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'),
 Document(metadata={'title': 'Basil', 'article_id': '73985'}, page_content

In [12]:
len(docs)

1364

### Split larger documents into smaller chunks

In [13]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=300)
chunked_docs = splitter.split_documents(docs)

In [14]:
chunked_docs[:3]

[Document(metadata={'title': 'Kurgan hypothesis', 'article_id': '72554'}, page_content='The Kurgan model of Indo-European origins is about both the people and their Proto-Indo-European language. It uses both archaeology and linguistics to show the history of their culture at different stages of the Indo-European expansion. The Kurgan model is the most widely accepted theory on the origins of Indo-European.'),
 Document(metadata={'title': 'Marija Gimbutas', 'article_id': '72558'}, page_content='Marija Gimbutas (Lithuanian: Marija Gimbutienė, born Marija Birutė Alseikaitė) (Vilnius, Lithuania, January 23, 1921 – Los Angeles, United States February 2, 1994), was a Lithuanian-American archeologist, known for her research into the Neolithic and Bronze Age cultures of "Old Europe" and the theories that she introduced. Between 1946 and 1971, her writings merged traditional spadework with linguistics and mythologies.'),
 Document(metadata={'title': 'Basil', 'article_id': '73985'}, page_content

In [15]:
len(chunked_docs)

1388

### Create a Vector DB and persist on disk

Here we initialize a connection to a Chroma vector DB client, and also we want to save to disk, so we simply initialize the Chroma client and pass the directory where we want the data to be saved to.

In [16]:
from langchain_chroma import Chroma

# create vector DB of docs and embeddings - takes < 30s on Colab
chroma_db = Chroma.from_documents(documents=chunked_docs,
                                  collection_name='rag_wikipedia_db',
                                  embedding=openai_embed_model,
                                  # need to set the distance function to cosine else it uses euclidean by default
                                  # check https://docs.trychroma.com/guides#changing-the-distance-function
                                  collection_metadata={"hnsw:space": "cosine"},
                                  persist_directory="./wikipedia_db")

### Load Vector DB from disk

This is just to show once you have a vector database on disk you can just load and create a connection to it anytime

In [17]:
# load from disk
chroma_db = Chroma(persist_directory="./wikipedia_db",
                   collection_name='rag_wikipedia_db',
                   embedding_function=openai_embed_model)

In [18]:
chroma_db

<langchain_chroma.vectorstores.Chroma at 0x73f90f3d9ad0>

## Load Connection to LLM

Here we create a connection to ChatGPT to use later in our chains

In [19]:
from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name='gpt-4o-mini', temperature=0)

## Chained Retrieval Pipeline

This strategy uses a chain of multiple retrievers sequentially to get to the most relevant documents. The following is the flow

Similarity Retrieval → Compression Filter → Reranker Model Retrieval

![](https://i.imgur.com/KriNRDJ.gif)


###
Multi-Stage Document Retrieval Pipeline

This script implements a robust document retrieval strategy using a combination of:
1. Fast vector similarity-based retrieval
2. LLM-based filtering
3. Cross-encoder-based reranking

The pipeline improves retrieval quality through:
- Initial recall of semantically relevant documents using cosine similarity.
- Intelligent filtering using an LLM to reason about contextual fit.
- Precise reranking with a cross-encoder that scores full query-document pairs.

Benefits:
- Balances speed and accuracy.
- Modular and interpretable.
- Especially effective for RAG systems, chatbots, and QA tasks where high-quality document retrieval is critical.


In [20]:



from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from langchain.retrievers.document_compressors import CrossEncoderReranker, LLMChainFilter
from langchain.retrievers import ContextualCompressionRetriever

# Step 1: Simple similarity-based retrieval (fast and cheap)
# Retrieves top-k (k=5) documents most similar to the query using cosine similarity on embeddings
similarity_retriever = chroma_db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# Step 2: LLM-based filtering
# Creates a filter using an LLM (e.g., ChatGPT) to evaluate which of the retrieved documents
# are contextually relevant to the query. Helps reduce noise from vector search.
_filter = LLMChainFilter.from_llm(llm=chatgpt)

# Step 3: Apply the LLM filter on top of the initial retriever
# This wraps the previous retriever and filters its output using the LLM logic.
compressor_retriever = ContextualCompressionRetriever(
    base_compressor=_filter,
    base_retriever=similarity_retriever
)

# Step 4: Load cross-encoder reranker (more accurate but slower)
# Downloads a pre-trained cross-encoder model that jointly encodes query and document
# for precise scoring. Very useful to fine-tune top results further.
reranker = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-large")

# Step 5: Wrap the reranker to keep top_n documents after reranking
reranker_compressor = CrossEncoderReranker(model=reranker, top_n=3)

# Step 6: Final retriever combines all the above stages
# This retriever performs:
# 1. Fast similarity search
# 2. LLM filtering
# 3. Cross-encoder reranking
# Result: You get the top 3 highest-quality, most relevant documents for your query
final_retriever = ContextualCompressionRetriever(
    base_compressor=reranker_compressor,
    base_retriever=compressor_retriever
)


  from tqdm.autonotebook import tqdm, trange
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.


model.safetensors:   6%|5         | 126M/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

In [21]:
query = "what is the old capital of India?"
docs = final_retriever.invoke(query)
docs

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content="Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capital of India during the British Raj until 1911. Kolkata was once the center of industry and education. However, it has witnessed political violence and economic problems since 1954. Since 2000, Kolkata has grown due to economic growth. Like other metropolitan cities in India, Kolkata struggles with poverty, pollution and traffic congestion. The discovery of the nearby Chandraketugarh, an archaeological site has proved that people have lived there for over two millennia. The history of Kolkata b

In [22]:
query = "what is the fastest animal?"
docs = final_retriever.invoke(query)
docs

[Document(metadata={'article_id': '9800', 'title': 'Cheetah'}, page_content='A cheetah ("Acinonyx jubatus") is a medium large cat which lives in Africa. It is the fastest land animal and can run up to 112 kilometers per hour for a short time. Most cheetahs live in the savannas of Africa. There are a few in Asia. Cheetahs are active during the day, and hunt in the early morning or late evening. The cheetah compared to other big cats is light and slimly built. Its long thin legs and long spotted tail are necessary for fast running. Its lightly built, thin form is in sharp contrast with the robust build of other big cats. The head-and-body length ranges from . The cheetah stands 70 to 90\xa0cm at the shoulder, and weighs . The slightly curved claws are only weakly retractable (semi-retractable). This is a major point of difference between the cheetah and the other big cats, which have fully retractable claws.'),
 Document(metadata={'article_id': '528308', 'title': 'South African cheetah'}

## Build a QA RAG Chain

To build a RAG chain we need a prompt template which instructs the LLM to not answer questions beyond the scope of the retrieved context documents, there are various such prompts out there, we craft one ourselves below

In [23]:
from langchain_core.prompts import ChatPromptTemplate

prompt = """You are an assistant for question-answering tasks.
            Use the following pieces of retrieved context to answer the question.
            If no context is present or if you don't know the answer, just say that you don't know.
            Do not make up the answer unless it is there in the provided context.
            Give a detailed answer with regard to the question.

            Question:
            {question}

            Context:
            {context}

            Answer:
         """

prompt_template = ChatPromptTemplate.from_template(prompt)

## LCEL Syntax for QA RAG Chain - Recommended

Here we show you how to create the RAG chain using LangChain's recommended LCEL

In [24]:
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

qa_rag_chain = (
    {
        "context": (final_retriever
                      |
                    format_docs),
        "question": RunnablePassthrough()
    }
      |
    prompt_template
      |
    chatgpt
)

In [25]:
from IPython.display import Markdown, display

query = "What is the financial capital of India?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The financial capital of India is Mumbai. It is not only the largest city in India but also plays a crucial role in the country's economy. Mumbai generates more than 6% of India's GDP and is responsible for 25% of the industrial output, 40% of sea trade, and 70% of capital to India's economy. The city is home to key financial institutions such as the Reserve Bank of India, the Bombay Stock Exchange, and the National Stock Exchange of India, as well as numerous Indian companies and multinational corporations. Additionally, Mumbai is known for its vibrant cultural scene, including the Hindi film and television industry, commonly referred to as Bollywood.

In [26]:
query = "What is the old capital of India?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The old capital of India was Kolkata, which was known as Calcutta until 1 January 2001. Kolkata served as the capital during the British Raj until 1911. It became the capital of British India in 1772 and remained so until the capital was moved to New Delhi. Kolkata was a significant center for industry and education during its time as the capital and played a crucial role in the history of British colonial rule in India.

In [27]:
query = "Tell me about the fastest animal?"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

The fastest animal is the cheetah, scientifically known as "Acinonyx jubatus." This medium-large cat is primarily found in Africa, with a few populations in Asia. Cheetahs are renowned for their incredible speed, capable of reaching up to 112 kilometers per hour (about 70 miles per hour) for short bursts. This remarkable speed is facilitated by their light and slim build, which is in stark contrast to the more robust physiques of other big cats.

Cheetahs have long, thin legs and a long spotted tail, both of which are essential for maintaining balance and stability while running at high speeds. They are active during the day and typically hunt in the early morning or late evening. The cheetah's claws are slightly curved and only weakly retractable, which is another distinguishing feature compared to other big cats that possess fully retractable claws.

Among the different subspecies, the South African Cheetah ("Acinonyx jubatus jubatus") is the most abundant, with an estimated population of over 6,000 individuals in the wild. This subspecies is native to Southern Africa and has seen population increases in certain areas, such as Namibia, where numbers rose from approximately 2,500 individuals in 1990 to more than 3,500 by 2015.

In [28]:
query = "Explain linguistics in simple terms"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

Linguistics is the study of language and how it works. People who specialize in this field are called linguists. There are five main areas of linguistics:

1. **Phonology**: This is the study of sounds in language. It looks at how sounds are organized and used in different languages.

2. **Morphology**: This area focuses on the structure of words, including parts of words like prefixes (e.g., "un-") and suffixes (e.g., "-ing").

3. **Syntax**: Syntax deals with how words are arranged to form sentences. It examines the rules that govern sentence structure.

4. **Semantics**: This part studies the meaning of words and how they combine to create meaning in sentences.

5. **Pragmatics**: Pragmatics looks at the unspoken aspects of communication, such as how context influences the interpretation of what is said. For example, if someone says "I'm cold," they might be hinting that they want someone to turn off a fan.

Linguistics can be divided into two main branches: theoretical linguistics and applied linguistics. Theoretical linguists explore the underlying principles of language, including how languages have evolved over time (historical linguistics) and how different social groups use language (sociolinguistics). On the other hand, applied linguists use linguistic knowledge in practical ways, such as in forensic linguistics for crime investigations or computational linguistics to help computers understand human language, like in speech recognition systems.

In summary, linguistics is a broad field that examines all aspects of language, from sounds and word structures to meanings and social usage.

In [29]:
query = "Who won the champions league in 2021"
result = qa_rag_chain.invoke(query)
display(Markdown(result.content))

I don't know.