# RAG + LLM: Document Ingestion, Embeddings & Contextual Generation

This notebook demonstrates a Retrieval-Augmented Generation (RAG) pipeline integrated with Large Language Models (LLMs), covering document ingestion, embeddings, vector search, and context-aware answering.



## Check GPU Availability

In [3]:
!nvidia-smi

Tue Jul 29 08:55:55 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

## Install Required Packages

In [4]:
!pip install -U chromadb langchain langchain-groq langchain-community \
    langchain-chroma langchain-text-splitters transformers \
    sentence-transformers unstructured "unstructured[pdf]"



## Install Poppler Utilities

In [5]:
!apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.8).
0 upgraded, 0 newly installed, 0 to remove and 35 not upgraded.


## Import Neccesary Modules for LLM Pipeline

In [6]:
import os

from langchain.document_loaders import UnstructuredFileLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_groq import ChatGroq
from langchain.chains import RetrievalQA

## Upgrade Unstructured Package for Local Inference

In [7]:
!pip install --upgrade unstructured[local-inference]



## Set API Key Environment Variable

In [None]:
os.environ["GROQ_API_KEY"]="YOUR_API_KEY"

## Download IPL Season Schedule PDF

In [9]:
import requests
url="https://documents.iplt20.com/smart-images/1739621485265_IPL%20Season%20Schedule%202025-1.pdf"

response = requests.get(url)

In [10]:
response

<Response [200]>

##  Save PDF Locally

In [11]:
# Save the pdf in local file
with open("IPL  Schedule Season 2025.pdf","wb")as f:
  f.write(response.content)

## Load PDF Document into LangChain

In [12]:
from langchain_community.document_loaders import UnstructuredFileLoader

loader = UnstructuredFileLoader("IPL  Schedule Season 2025.pdf")
documents = loader.load()

documents   # This will be a list of Document Objects





[Document(metadata={'source': 'IPL  Schedule Season 2025.pdf'}, page_content='Match No 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37\n\nMatch Day 1 2 2 3 4 5 6 7 8 9 9 10 11 12 13 14 15 15 16 16 17 18 19 20 21 22 22 23 23 24 25 26 27 28 29 29 30\n\nTITLE SPONSOR\n\n2025 SEASON SCHEDULE IPLT20.COM\n\nDate 22-Mar-25 23-Mar-25 23-Mar-25 24-Mar-25 25-Mar-25 26-Mar-25 27-Mar-25 28-Mar-25 29-Mar-25 30-Mar-25 30-Mar-25 31-Mar-25 01-Apr-25 02-Apr-25 03-Apr-25 04-Apr-25 05-Apr-25 05-Apr-25 06-Apr-25 06-Apr-25 07-Apr-25 08-Apr-25 09-Apr-25 10-Apr-25 11-Apr-25 12-Apr-25 12-Apr-25 13-Apr-25 13-Apr-25 14-Apr-25 15-Apr-25 16-Apr-25 17-Apr-25 18-Apr-25 19-Apr-25 19-Apr-25 20-Apr-25\n\nDay Sat Sun Sun Mon Tue Wed Thu Fri Sat Sun Sun Mon Tue Wed Thu Fri Sat Sat Sun Sun Mon Tue Wed Thu Fri Sat Sat Sun Sun Mon Tue Wed Thu Fri Sat Sat Sun\n\nStart 7:30 PM 3:30 PM 7:30 PM 7:30 PM 7:30 PM 7:30 PM 7:30 PM 7:30 PM 7:30 PM 3:30 PM 7:30 PM 7:30 PM 7:30 PM 

## Create Text Splitter

In [13]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

In [14]:
text_splitter

<langchain_text_splitters.character.CharacterTextSplitter at 0x78425c21a390>

## Split Documents into Text Chunks

In [15]:
texts = text_splitter.split_documents(documents)

##  Initialize Embeddings Model

In [16]:
embeddings = HuggingFaceEmbeddings()

  embeddings = HuggingFaceEmbeddings()
  embeddings = HuggingFaceEmbeddings()
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Create Persistent Vector Database

In [17]:
persist_directory = "vector_db"

In [18]:
vectordb=Chroma.from_documents(texts,embeddings,persist_directory=persist_directory)

### Create Retriever Interface

In [19]:
retriever=vectordb.as_retriever()

## Initialize Large Language Model (LLM)

In [20]:
llm=ChatGroq(model="llama-3.3-70b-versatile",temperature=0)

### Create RetrievalQA Chain

In [21]:
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever,return_source_documents=True)

In [23]:
query='List all venues of RCB'
response=qa_chain({'query':query})

In [24]:
print(response)

{'query': 'List all venues of RCB', 'result': 'Based on the provided context, the venues where Royal Challengers Bengaluru (RCB) will play are:\n\n1. Bengaluru \n2. Mumbai\n3. Lucknow\n4. Hyderabad\n5. Chennai\n6. Kolkata\n7. Jaipur\n8. Ahmedabad\n9. Dharamsala\n10. Delhi', 'source_documents': [Document(id='685157dd-9ee0-42f7-b843-ded5e0d02823', metadata={'source': 'IPL  Schedule Season 2025.pdf'}, page_content='Venue Mumbai Kolkata Lucknow Hyderabad Bengaluru Chennai Kolkata Mumbai Delhi Jaipur Delhi Chennai Jaipur Ahmedabad Bengaluru Kolkata Dharamsala Hyderabad Mumbai Kolkata Dharamsala Lucknow Hyderabad Dharamsala Delhi Chennai Bengaluru Ahmedabad Mumbai Jaipur Bengaluru Ahmedabad Lucknow Hyderabad Hyderabad Kolkata Kolkata\n\nOFFICIAL DIGITAL STREAMING PARTNER'), Document(id='bdaccc94-7f1f-4003-a7a4-3e3a33409844', metadata={'source': 'IPL  Schedule Season 2025.pdf'}, page_content='Away\n\nRoyal Challengers Bengaluru Rajasthan Royals Mumbai Indians Lucknow Super Giants Punjab Kings

In [25]:
print(response['result'])

Based on the provided context, the venues where Royal Challengers Bengaluru (RCB) will play are:

1. Bengaluru 
2. Mumbai
3. Lucknow
4. Hyderabad
5. Chennai
6. Kolkata
7. Jaipur
8. Ahmedabad
9. Dharamsala
10. Delhi


In [27]:
query='Total runs scored by Virat Kohli'
response=qa_chain({'query':query})

In [28]:
print(response['result'])

I don't know the total runs scored by Virat Kohli as the provided context does not contain this information. The context appears to be a schedule for the 2025 IPL season, with team names, venues, and match dates, but it does not include any statistics about individual player performances.


In [31]:
query='Who is the Prime Minister of India'
response=qa_chain({'query':query})

In [32]:
print(response['result'])

I don't know the current Prime Minister of India based on the provided context, as it appears to be related to the Indian Premier League (IPL) schedule and does not contain information about the Prime Minister of India.
