# Assignment: RAG Workflow with LangChain, Groq API, and FAISS

In this assignment, you will:

1. Load web content using LangChain.
2. Store embeddings of the content in FAISS, a vector database.
3. Use Groq API as your LLM to query the stored embeddings via Retrieval-Augmented Generation (RAG).

## Objectives:

- Learn how to use LangChain for data loading from the web.
- Understand embeddings and how to store/retrieve them with FAISS.
- Implement RAG workflow using Groq LLM API.

## Setup and Installation


In [1]:
# Install required libraries
!pip install langchain groq faiss-cpu sentence-transformers beautifulsoup4 langchain-community langchain-groq

Collecting groq
  Downloading groq-0.28.0-py3-none-any.whl.metadata (15 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.25-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-groq
  Downloading langchain_groq-0.3.2-py3-none-any.whl.metadata (2.6 kB)
Collecting langchain-core<1.0.0,>=0.3.58 (from langchain)
  Downloading langchain_core-0.3.65-py3-none-any.whl.metadata (5.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->la

### Beautiful Soup for Webscraping:
BeautifulSoup is for (X)HTML parsing and building, whereas Selenium is made for end-to-end testing. Selenium launches a browser and can be controlled to interact with the UI. That's the main goal of the tool. If you only want to parse web pages without interacting with them, you would probably use requests with or without beautifulsoup.

## Import Libraries

In [2]:
from langchain.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq
import os



### WebBaseLoader:
- Parses and extracts the main readable text (usually with the help of tools like BeautifulSoup or readability behind the scenes).
- Returns it as a Document object that can be used with LangChain components (like vector stores, retrievers, LLM chains).

### RetrievelQA:
- The purpose of RetrievalQA in LangChain is to create a question-answering system that can look up relevant information from a set of documents before answering a question.
- It’s like giving your language model a smart "memory" — instead of relying only on its internal knowledge, it retrieves relevant content from your documents or database and then generates an answer based on that.

## Step 1: Load and Prepare Data

https://python.langchain.com/docs/integrations/document_loaders/web_base/

https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html

https://huggingface.co/blog/getting-started-with-embeddings

https://python.langchain.com/docs/integrations/vectorstores/faiss/

# Documentation

In [7]:
#url = "https://en.wikipedia.org/wiki/LangChain"
url = "https://www.aljazeera.com/where/israel/"  # Example URL, replace with your choice
loader = WebBaseLoader(url)
docs = loader.load()

In [None]:
docs[0]

In [15]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=50)
documents = text_splitter.split_documents(docs)

In [22]:
documents[0]


Document(metadata={'source': 'https://www.aljazeera.com/where/israel/', 'title': "Israel | Today's latest from Al Jazeera", 'description': 'Stay on top of Israel latest developments on the ground with Al Jazeera’s fact-based news, exclusive video footage, photos and updated maps.', 'language': 'en'}, page_content="Israel | Today's latest from Al Jazeera")

## Step 2: Generate Embeddings and Store in FAISS

In [23]:
embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  embeddings_model = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [24]:
vectorstore = FAISS.from_documents(documents, embeddings_model)

## Step 3: Setup Groq LLM

In [25]:
os.environ['GROQ_API_KEY'] = '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'

In [29]:
llm = ChatGroq(model_name="gemma2-9b-it")

## Step 4: Retrieval-Augmented Generation (RAG)

In [30]:
retriever = vectorstore.as_retriever()
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

## Step 5: Querying the System

In [37]:
query = "Tell me details of Iran-Israel current conflict?"
answer = qa_chain.run(query)

In [38]:
print(f"Question: {query}\n")
print(f"Answer: {answer}\n")

Question: Tell me details of Iran-Israel current conflict?

Answer: I can give you some details about the current Israel-Iran conflict based on the provided text:

* **The conflict is intensifying:** This suggests a growing escalation of violence and tension.
* **There have been bombardments and strikes:** This indicates that both sides are engaged in military action.
* **There have been deaths:** This highlights the serious human cost of the conflict.
* **Israel launched a preemptive attack:** This means Israel initiated the military action, claiming it was necessary to prevent a future attack by Iran.
* **The attack targeted both military and civilian infrastructure:** This suggests a broad and potentially devastating campaign.

**The text does not provide specific details about:**

* **The exact nature of the military action:** What types of weapons were used?
* **The locations of the attacks:** Where did the strikes take place?
* **The number of casualties:** How many people have b