<a href="https://colab.research.google.com/github/Thanos29992/LLM-Course-NCE-/blob/main/day2_session1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objectives:
- To develop a document loader to insert a custom knowledgebase to LLM
- To conver text into corresponding numeric values called as Embeddings
- To store embeddings into vectorstore
- To perform QA model based on the custom knowledgebase

## Step 1: Document Loader

In [None]:
!pip install langchain-community
!pip install pypdf

In [2]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("/content/drive/MyDrive/LLM_Workshop/PM2.pdf")
docs = loader.load()
print(docs)

[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2025-07-15T20:33:04+05:45', 'author': 'Shlok 777', 'moddate': '2025-07-15T20:33:04+05:45', 'source': '/content/drive/MyDrive/LLM_Workshop/PM2.pdf', 'total_pages': 4, 'page': 0, 'page_label': '1'}, page_content='PM2.5 Satellite Dataset Reference Guide  \nThis document explains all the Earth Engine datasets used in our PM2.5 tracking project — what they \nmean, how trustworthy they are for air quality estimation, and how they can fit into modeling pipelines.  \n \n           Dataset Overview \n1.             Sentinel-5P (COPERNICUS/S5P/OFFL ) \nBand Meaning Typical Range PM2.5 Use Notes \nNO₂ Nitrogen Dioxide 0–500 µmol/m²       High Strong urban emission indicator  \nSO₂ Sulfur Dioxide 0–5 mmol/m²      \nMedium Relevant near power/industrial zones  \nCO Carbon Monoxide 0–50 mmol/m² ✅ High Good for tracking fires, combustion  \nO₃ Ozone 0–0.1 mol/m²       

In [3]:
from langchain.prompts import PromptTemplate

In [4]:
template = """
Answer the question, based on the context below:
If the context isnt relevant, just reply 'I have no idea, blud 💀'
Context : {context}
Question : {question}
"""
prompt = PromptTemplate(template=template)

print(prompt.format(context="Here is some context", question="here is a question"))


Answer the question, based on the context below:
If the context isnt relevant, just reply 'I have no idea, blud 💀'
Context : Here is some context
Question : here is a question



In [None]:
# Create an LLM Model
!pip install langchain_google_genai
import google.generativeai as genai
from google.colab import userdata


In [6]:
from langchain_google_genai import ChatGoogleGenerativeAI
api_key = userdata.get("gemini_api_key")
print(api_key)

AIzaSyChKyIjl-JxEOYPuMDPzash5okrtKQhNZI


In [7]:
llm = ChatGoogleGenerativeAI(
    google_api_key = api_key,
    model = "gemini-2.5-flash",
    temperature = 0.2,
    max_output_tokens = 2000,
    # remaining put yourself
    top_k = 40,
    top_p = 0.95,
)

In [8]:
llm_chain = prompt | llm
response = llm_chain.invoke({
    "context" : "NO2 is provided by the Sentinel-5P Satellite. NO2 is about 70% correlated with the amount of PM2.5 in the air. PM2.5 is the primary source of air pollution.",
"question" : "What is my name?"
})
print(response.content)

I have no idea, blud 💀


## Step 2: Load the document, split it, and store in vector database
- In this case, we are using ChromaDB as vector store

In [None]:
!pip install langchain_chroma

In [10]:
from langchain_chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings

In [11]:
emb = GoogleGenerativeAIEmbeddings(
  google_api_key = api_key,
  model = "models/embedding-001"
)

In [12]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
splits = text_splitter.split_documents(
    docs
)
print(splits)

[Document(metadata={'producer': 'Microsoft® Word for Microsoft 365', 'creator': 'Microsoft® Word for Microsoft 365', 'creationdate': '2025-07-15T20:33:04+05:45', 'author': 'Shlok 777', 'moddate': '2025-07-15T20:33:04+05:45', 'source': '/content/drive/MyDrive/LLM_Workshop/PM2.pdf', 'total_pages': 4, 'page': 0, 'page_label': '1'}, page_content='PM2.5 Satellite Dataset Reference Guide  \nThis document explains all the Earth Engine datasets used in our PM2.5 tracking project — what they \nmean, how trustworthy they are for air quality estimation, and how they can fit into modeling pipelines.  \n \n           Dataset Overview \n1.             Sentinel-5P (COPERNICUS/S5P/OFFL ) \nBand Meaning Typical Range PM2.5 Use Notes \nNO₂ Nitrogen Dioxide 0–500 µmol/m²       High Strong urban emission indicator  \nSO₂ Sulfur Dioxide 0–5 mmol/m²      \nMedium Relevant near power/industrial zones  \nCO Carbon Monoxide 0–50 mmol/m² ✅ High Good for tracking fires, combustion  \nO₃ Ozone 0–0.1 mol/m²       

In [13]:
vector_store = Chroma.from_documents(
    splits, embedding=emb
)

## Step 3: Retrival and generate the relevant snippets from the document

In [14]:
from langchain import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

In [18]:
parser = StrOutputParser()

In [19]:
retriver = vector_store.as_retriever()
def format_docs(docs):
  return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
    {"context": retriver | format_docs, "question": RunnablePassthrough()}
    | llm_chain
    | parser
)

In [22]:
rag_chain.invoke("What is the strongest correlative factor for PM2.5?")

'The strongest correlative factor for PM2.5 mentioned in the context is **Optical_Depth_055 (Aerosol Optical Depth at 550nm)**, which is described as the "Top Indicator" and "Strongest satellite proxy for PM2.5".'