# Basic RAG pipeline
Vector DB - Chroma DB

Embeddings Model - Sentence Transformers - all-mpnet-base-v2

Loaders to load External Data

LLM Integration to generate final response using user query along with retrieved chunks

In [1]:
# installations
%pip install langchain langchain-community chromadb langchain-huggingface protobuf langchain-google-genai BeautifulSoup4



In [2]:
import os
os.environ['USER_AGENT'] = 'myagent'
from langchain.document_loaders import WebBaseLoader
URL = ["https://education.nationalgeographic.org/resource/global-warming/", "https://en.wikipedia.org/wiki/Climate_change", "https://www.nrdc.org/stories/global-warming-101"]
#load the data
data = WebBaseLoader(URL)
#extract the content
content = data.load()

In [3]:
#Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=60)
chunks = text_splitter.split_documents(content)

In [4]:
len(chunks)

290

In [5]:
#Downloading the embedding model
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
#Define the vector DB
from langchain.vectorstores import Chroma
vectorstore = Chroma.from_documents(chunks, embeddings)

In [7]:
# Step 1: Retrieval
query = "What are the different causes of global warming?"
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k":4})
docs_retrieved = retriever.get_relevant_documents(query)
print(docs_retrieved)


  docs_retrieved = retriever.get_relevant_documents(query)


[Document(metadata={'title': 'Climate change - Wikipedia', 'language': 'en', 'source': 'https://en.wikipedia.org/wiki/Climate_change'}, page_content="Causes of recent global temperature rise\nMain article: Causes of climate change\nPhysical drivers of global warming that has happened so far. Future global warming potential for long lived drivers like carbon dioxide emissions is not represented. Whiskers on each bar show the possible error range.\nThe climate system experiences various cycles on its own which can last for years, decades or even centuries. For example, El Niño events cause short-term spikes in surface temperature while La Niña events cause short term cooling.[98] Their relative frequency can affect global temperature trends on a decadal timescale.[99] Other changes are caused by an imbalance of energy from external forcings.[100] Examples of these include changes in the concentrations of greenhouse gases, solar luminosity, volcanic eruptions, and variations in the Earth'

In [8]:
# Creating LLM object
from langchain_google_genai import ChatGoogleGenerativeAI
from google.colab import userdata
GEMINI_API_KEY = userdata.get('GOOGLE_API_KEY')
llm = ChatGoogleGenerativeAI(model='gemini-2.0-flash', api_key = GEMINI_API_KEY)

In [13]:
# Augment

query = 'What are the causes of global warming?'
system_prompt = f"""
You are an AI assistant that responds to a given user query. Please keep your answers relevant to the context you have.

User Query: {query}
"""

In [14]:
#Generation
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, chain_type="stuff")
response = qa(system_prompt)
print(response['result'])

Global warming refers to the increase in the planet’s overall average temperature in recent decades. These rapid changes are due to human activities and the widespread use of fossil fuels for energy like coal, oil, and natural gas. Burning fossil fuels causes the “greenhouse effect” in Earth’s atmosphere.
