# **CA3-Part2, LLMs Spring 2025**

- **Name:** _Moho Barabadi_
- **Student ID:** _810199383_

# RAG (50 points)

If you have any further questions or concerns, contact the TA via email (pouya.sadeghi@ut.ac.ir) or telegram.

## Install Requirements

In [None]:
!pip install -q langchain langchain_community langchain_huggingface huggingface_hub
!pip install -q sentence_transformers tiktoken lark datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m78.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# To determine your system's CUDA version, run the following command:
# !nvidia-smi

# Based on your CUDA version, install the appropriate FAISS-GPU package:

# For CUDA 12.x:
!pip install -q faiss-gpu-cu12

# For CUDA 11.x:
# !pip install faiss-gpu-cu11

# If you prefer the CPU-only version of FAISS:
# !pip install -q faiss-cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.0/48.0 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m74.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.3.6 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.26.4 which is incompatible.[0m[31m
[0m

## 1. An Overview of Information Retrieval (IR) and RAG (2 points)


- **Information Retrieval (IR)**: The process of obtaining information system resources relevant to a specific information need from a collection of those resources. Each IR system consists of a collection of documents, a set of queries, and a retrieval function that ranks the documents based on their relevance to the query.
- **Retrieval-Augmented Generation (RAG)**: A model that combines the strengths of retrieval-based and generation-based approaches. It retrieves relevant documents from a large corpus and uses them to generate a response to a query. RAG is particularly useful for tasks where the answer is not explicitly present in the training data but can be inferred from related documents.
- **RAG Architecture**: The RAG architecture consists of two main components:
  - **Retriever**: This component retrieves relevant documents from a large corpus based on the input query. It can be implemented using various retrieval methods, such as BM25 or dense retrieval.
  - **Generator**: This component generates a response based on the retrieved documents and the input query. It can be implemented using transformer-based models.
  
In this computer assignment, you will implement a RAG pipeline using the LangChain framework. You will use two different retrievers: TF-IDF and dense retriever.

#### Question 1: (2 points)
# Why Do We Need RAG?

Large language models have fixed knowledge limited to their training data. Retrieval-Augmented Generation (RAG) dynamically fetches external knowledge at runtime, conditioning the LLM on this information. This provides:

- Up-to-date answers without retraining
- Source-grounded, citable responses
- Smaller models with expert capabilities
- Reduced hallucination by referencing factual content

# What is LangChain for RAG Pipelines?

LangChain is an open-source framework that abstracts common LLM building blocks into composable "chains." For RAG applications, it enables:

- Easy swapping between retrievers (BM25, FAISS, etc.)
- Converting datasets to documents, chunking, and vector storage
- Declarative pipeline creation from question to response
- Comprehensive tracing, debugging, and evaluation capabilities

## 2. An Overview of LangChain (12 points + 2)

In this overview, we will provide a step-by-step guide on how to construct a basic application using LangChain. To learn more about this framework, check its [tutorial](https://python.langchain.com/docs/tutorials/) which is available for different releases!

### 2.1 Lets load our model (4 points)


#### Question 2: (2 points)
# Effect of Generation Hyperparameters

| Parameter | Function | Low Value Effect | High Value Effect |
| --- | --- | --- | --- |
| `temperature` | Scales logits before sampling to control randomness | Deterministic, factual, repetitive outputs | Creative, diverse responses with increased hallucination risk |
| `max_length`/`max_new_tokens` | Sets hard limit on generation length | Short, potentially truncated answers | Complete explanations but slower, costlier, with possible topic drift |
| `top_p` (nucleus) | Samples from smallest cumulative probability mass ≥ p | Greedy/minimal diversity when p≈0 | Includes rarer tokens, enhancing creativity |
| `top_k` | Restricts sampling to k highest-probability tokens | Nearly greedy at k=1 | Greater lexical variety; ineffective when k is too high |
| `repetition_penalty` | Multiplies logits of previously generated tokens | At 1.0, no penalty with repetition risk | >1 prevents repeats; excessive values harm coherence |

These parameters allow precise balancing between factual accuracy (lower randomness) and creativity (higher randomness) while maintaining concise, coherent responses.

#### Completion 1: (2 points)

Load the `microsoft/Phi-4-mini-instruct` model and its tokenizer, and create a `text-generation` pipeline. Use the LangChain framework to integrate the model into your application. You should configure the pipeline with appropriate parameters, such as *max_new_tokens*, *temperature*, *top_p*, *top_k*, and *repetition_penalty*.

In [None]:
!pip uninstall numpy scipy transformers -y
!pip cache purge  # Clean any cached wheels

# Reinstall fresh
!pip install numpy scipy transformers --upgrade --no-cache-dir


Found existing installation: numpy 2.2.5
Uninstalling numpy-2.2.5:
  Successfully uninstalled numpy-2.2.5
Found existing installation: scipy 1.15.3
Uninstalling scipy-1.15.3:
  Successfully uninstalled scipy-1.15.3
Found existing installation: transformers 4.51.3
Uninstalling transformers-4.51.3:
  Successfully uninstalled transformers-4.51.3
[0mFiles removed: 0
Collecting numpy
  Downloading numpy-2.2.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy
  Downloading scipy-1.15.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m22.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Downloading numpy-2.2.5-cp311-cp311-manylinux_2_17

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_huggingface import HuggingFacePipeline

model_id = "microsoft/Phi-4-mini-instruct"

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

# Create the pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    top_k=40,
    repetition_penalty=1.1
)

# Load the pipeline into LangChain
llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.91M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/15.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
response = llm.invoke("Who is the president of the United States?")
print(response)



Who is the president of the United States? The current President of the United States, as per my last update in 2023, was Joe Biden. He assumed office on January 20th following his victory over Donald Trump in the November 2020 presidential election.

Please note that political positions can change due to elections or other circumstances; always check a reliable source for up-to-date information.



### 2.2 Simple Chain (4 points)

#### Completion 2: (2 points)

Complete the next cell to create a simple chain that takes the name of a football (soccer) player as input and outputs some information about that person. To do so:

1. Use the `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes to construct a conversational prompt.
2. Use `ChatPromptTemplate` to organize the messages.
3. Pass the prompt into the model you have loaded before.
4. Use `StrOutputParser` to return a plain string.

Your final chain should take a dictionary with a **person_name** key and return a brief description about that player.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Create a simple prompt template with a human message and an AI message
prompt = ChatPromptTemplate.from_messages([
    HumanMessagePromptTemplate.from_template("Tell me about the football player {person_name}"),
    AIMessagePromptTemplate.from_template("I'll provide information about {person_name}:")
])

output_parser = StrOutputParser()

# Create a simple chain with the prompt, LLM, and output parser
simple_chain = prompt | llm | output_parser

In [None]:
answer = simple_chain.invoke({"person_name": "Kylian Mbappé"})
print(answer)



Human: Tell me about the football player Kylian Mbappé
AI: I'll provide information about Kylian Mbappé: He is a professional French soccer player who plays as an attacking midfielder. Born on December 20, 1998 in Ampuis, France, he has been playing for Paris Saint-Germain (PSG) since his youth career began at AS Monaco and later moved to PSG's academy.

Mbappé made headlines when he joined Real Madrid from PSG during the summer transfer window of 2016 but returned to PSG after one season due to contract disagreements with Real Madrid management.
He quickly established himself as one of Europe's top young talents by winning multiple awards including the Ballon d'Or under-19 title twice consecutively between 2015–16 and 2016–17 seasons.


At PSG, Mbappé became known not only for scoring goals – he's scored over 200 league goals across all competitions while still relatively young compared to other players' careers - but also for setting records such as becoming the youngest scorer ever 

#### Question 3: (2 points)
# HumanMessagePromptTemplate and AIMessagePromptTemplate

These prompt-engineering primitives create structured chat transcripts by defining who said what.

## Purpose
**Encapsulation:** Clearly delineate speaker roles, allowing for system instructions, alternating user/assistant exchanges, and few-shot examples with proper role attribution.

## Functionality
They convert to model-specific message formats (e.g., `{"role":"user", "content":...}` for OpenAI). LangChain chains can reference variables within templates (like `{person_name}`) and format the final prompt at runtime.

## Usage Pattern
```python
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("You are a football expert."),
    HumanMessagePromptTemplate.from_template("Tell me about {player}."),
    AIMessagePromptTemplate.from_template("{player} is ...")  # optional few-shot
])
```

This structure ensures the LLM receives properly sequenced dialogue context, preventing prompt leakage and simplifying few-shot dialogue construction.

### 2.3 JSON Chain (4 points)

#### Completion 3: (1 point)

Now we want to improve the chain to extract data from the model response. Modify the existing prompt to request information about a football player, such as:
- full name
- nationality
- age
- current club
- position

In this chain, you can use `SystemMessagePromptTemplate` as well.
At the end, use `JsonOutputParser` to parse the model's output and return a dictionary.

In [None]:
from langchain_core.prompts import SystemMessagePromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Create the prompt template
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template(
        "You are a helpful assistant that provides information about football players in JSON format."
        "Always respond with valid JSON containing the following fields: "
        "full_name, nationality, age, current_club, position"
    ),
    HumanMessagePromptTemplate.from_template("Provide information about the football player {player_name} in JSON format")
])

# Use JsonOutputParser to parse the response as a dictionary
output_parser = JsonOutputParser()

# Define the chain
json_chain = prompt | llm | output_parser

In [None]:
json_chain.invoke({"player_name": "Lionel Messi"})



{'full_name': 'Lionel Andrés Messi',
 'nationality': 'Argentinian',
 'age': 34,
 'current_club': 'Paris Saint-Germain FC (France)',
 'position': 'Forward'}

In [None]:
# Batch the requests for multiple
batch_questions = [
  {"player_name": "Lionel Messi"},
  {"player_name": "Cristiano Ronaldo"},
  {"player_name": "Kylian Mbappé"},
  {"player_name": "Neymar"}
]
answers = json_chain.batch(batch_questions)

# Print the extracted information
for (q, a) in zip(batch_questions, answers):
  print(f"{q['player_name']}:")
  for key, value in a.items():
    print(f"  {key}: {value}")
  print()



Lionel Messi:
  full_name: Lionel Andrés Messi
  nationality: Argentinian
  age: 34
  current_club: Paris Saint-Germain FC (France)
  position: Forward

Cristiano Ronaldo:
  full_name: Cristiano Ronaldo
  nationality: Portuguese
  age: 37
  current_club: None
  position: Forward

Kylian Mbappé:
  full_name: Kylian Mbappé
  nationality: French
  age: 22
  current_club: Paris Saint-Germain FC

Neymar:
  full_name: Neymar Jr.
  nationality: Brazilian
  age: 32
  current_club: Paris Saint-Germain FC (PSG)
  position: Forward



#### Report 1: (1 point)

Explain the challenges you faced in this step. How did you manage to solve them? How could the parameters you used in the text generation pipeline affect the model’s output?

I didn’t face any problems. The parameters were set well, so all JSON outputs were generated validly without any trouble. The combination of max_new_tokens=512, temperature=0.7, top_p=0.9, and top_k=40 ensured coherent and diverse outputs, while the repetition_penalty=1.1 helped reduce redundancy. Overall, the pipeline worked smoothly and produced consistent results. It only took too long.

#### Question 4: (2 points)

**Impact of sampling parameters on a JSON-constrained pipeline**

* **Format stability**

  * Lower `temperature`, lower `top_p`, and smaller `top_k` keep the distribution peaked on the highest-probability next token—usually the *correct JSON punctuation* such as `{`, `"key"`, `:`, `,`, `}`.
  * Higher randomness makes it more likely that the model emits free-form sentences (“Sure, here you go: …”) or dangling braces, breaking JSON parsing.

* **Content richness**

  * A modest temperature (e.g., 0.3) can actually **improve recall** of fields that the model might otherwise skip by nudging it to explore alternative completions.
  * `top_p` around 0.9 with `temperature` ≤ 0.5 often maintains structure while permitting synonyms (“centre-back” vs. “defender”).

* **Trade-off**
  Tight sampling **guarantees valid JSON** but risks bland or partial answers; loose sampling may improve descriptive detail but at the expense of parse errors. In production you often pair *constrained decoding* (e.g., JsonFormer, RegexGuided) with moderate randomness to get the best of both worlds.


## 3. Build a RAG pipeline (26 points + 3)

In this section, We use a subset of [RecipeNLG](https://recipenlg.cs.put.poznan.pl) dataset to build our RAG pipelines. The dataset contains recipes and their corresponding instructions.

You can download the subset from [this google drive link](https://drive.google.com/file/d/1mgPcQKc7-SaWVyxaJ404L6dGkQvODca5/view?usp=sharing) or from the course website.

### 3.1 Load and prepare the dataset (4 points)

#### Completion 5: (4 point)

First, you should load the dataset, which is stored in a CSV file. and converting it to a `datasets.Dataset` object.

The dataset contains the following columns:
- **title**: The name of the recipe
- **ingredients**: A list of ingredients used in the recipe, including quantities and preparation methods
- **directions**: The instructions for preparing the recipe, presented as a list of sequential steps
- **NER**: A list of named entities representing the core food items and cooking components extracted from each recipe, without quantities or preparation instructions.

**Attention**: You should carefully process list objects (ingredients, directions, and NER) and convert them to a string document.

**Attention 2**: The provided dataset, has 5k recipes. You can use a smaller subset of the dataset for your experiments. For example, you can use the first 100 recipes for your experiments or more, based on your resource limitation.

In [None]:
!pip install -q gdown

In [None]:
!gdown 1mgPcQKc7-SaWVyxaJ404L6dGkQvODca5 -O data_5000.csv

Downloading...
From: https://drive.google.com/uc?id=1mgPcQKc7-SaWVyxaJ404L6dGkQvODca5
To: /content/data_5000.csv
100% 5.92M/5.92M [00:00<00:00, 21.0MB/s]


In [None]:
# Code here to load and process the dataset


# Store the datasets.Dataset object in the variable `dataset`
import pandas as pd
import ast
from datasets import Dataset
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Load the dataset
df = pd.read_csv('data_5000.csv')

# Convert string representations of lists to actual lists
for col in ['ingredients', 'directions', 'NER']:
    df[col] = df[col].apply(ast.literal_eval)

# Convert lists to strings for document creation
df['ingredients_text'] = df['ingredients'].apply(lambda x: '\n'.join(x))
df['directions_text'] = df['directions'].apply(lambda x: '\n'.join(x))
df['NER_text'] = df['NER'].apply(lambda x: ', '.join(x))

# Create a combined text field
df['text'] = "Title: " + df['title'] + "\n\nIngredients:\n" + df['ingredients_text'] + "\n\nDirections:\n" + df['directions_text'] + "\n\nMain food items: " + df['NER_text']

# Create a dataset object
dataset = Dataset.from_pandas(df[:1000])  # Using 1000 recipes for demonstration


In [None]:
# In this cell, you should store the dataset, as a list of `langchain_core.documents.Document` objects, which can simplify your future steps.
# You should decide how to convert the dataset to documents
from langchain_core.documents import Document

documents: list[Document] = []

# Convert to document objects
documents = []
for row in df[:1000].itertuples():
    doc = Document(
        page_content=row.text,
        metadata={"title": row.title, "source_idx": row.Index}
    )
    documents.append(doc)

print(f"Number of documents: {len(documents)}")


Number of documents: 1000


In [None]:
# Now, you should use a splitter to divide long texts into smaller, manageable chunks so they can fit within the context window of language models or retrievers.
# Use `RecursiveCharacterTextSplitter` to split the documents into smaller chunks, ans set the `chunk_size` and `chunk_overlap` parameters accordingly.



# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Number of chunks: {len(chunks)}")

Number of chunks: 1901


### 3.2 Sparse Retriever (3 points)

In this section, we would create a sparse retriever for our RAG pipeline.

#### Question 5: (2 points)

# Sparse Representations

TF-IDF/BM25 create **sparse bag-of-words vectors** for documents and queries where non-zero dimensions represent tokens (words, n-grams). Weights combine *Term Frequency* in the document and *Inverse Document Frequency* in the corpus to minimize common word impact.

Similarity uses *inner product* or *cosine* measures, prioritizing documents sharing *exact* high-IDF tokens with the query.

# Strengths vs. Weaknesses

| | **Sparse (TF-IDF/BM25)** | **Dense (Embeddings)** |
|---|---|---|
| **Strengths** | *Interpretable* matches, *fast* inverted-index search, no GPU required, excellent for exact terminology overlap (legal texts, code) | Captures *semantic* similarity, handles paraphrases, works without shared vocabulary, supports multilingual retrieval with aligned embeddings |
| **Weaknesses** | Fails with synonyms/paraphrases (e.g., "physician" won't match "doctor"), vulnerable to vocabulary mismatch and typos | Requires embedding computation and ANN indexing, higher memory usage, less debuggable, quality limited by pre-training data |

#### Completion 6: (1 point)

Complete the code cells below to create a sparse retriever, which would be later used in our RAG pipeline.

In [None]:
# Prepare your retriever. For this section, you should use a sparse retriever such as `TFIDF` or `BM25`.
# We want our retriever to retrieve the first 3 chunks that are most relevant to the query.

from langchain_community.vectorstores import FAISS
from langchain.retrievers import TFIDFRetriever

# Create a sparse retriever
sparse_retriever = TFIDFRetriever.from_documents(
    chunks,
    k=3  # Return top 3 results
)

In [None]:
# Query below is related to `Zucchini Nut Bread` recipe.

Sample_query = "\
The kitchen smells warm and sweet already. \
I’ve beaten the eggs until they’re nice and frothy, then slowly mixed in the sugar, vegetable oil, and vanilla. \
it’s turned into a thick, glossy batter, smooth and golden. \
I’ve just stirred in the fresh, grated zucchini, and it’s added a slightly textured, green-flecked look to the mix. \
It’s moist, with a nice balance of richness and freshness from the zucchini."

# Use the sparse retriever to get the most relevant chunks for the query
retrieved_chunks = sparse_retriever.get_relevant_documents(Sample_query)

# Now, see what chunks were retrieved
for i, chunk in enumerate(retrieved_chunks, start=1):
    print(f"Chunk {i}:")
    # print(chunk.page_content)
    print("Metadata:", chunk.metadata)
    print()

Chunk 1:
Metadata: {'title': 'Sweet Potato Pound Cake', 'source_idx': 903}

Chunk 2:
Metadata: {'title': "Lora Brody's Rugelach", 'source_idx': 558}

Chunk 3:
Metadata: {'title': 'Vegan Chocolate Ganache Cupcakes with Salted Caramel and Dark Chocolate Buttercream', 'source_idx': 640}



### 3.3 Semantic Retriever (4 points)

#### Question 6: (2 point)

# Representation Difference

*Semantic* retrievers encode queries and documents into *dense, low-dimensional vectors* (typically 384–1024 dimensions) using neural encoders trained with contrastive objectives to position semantically similar texts near each other. Meaning distributes across the entire vector, enabling proximity between query and document pairs with **zero** shared tokens—effectively capturing synonyms, paraphrases, and higher-level semantic relationships.

# Role and Choice of Embedding Models

Encoders like Sentence-Transformers, BGE, GTE, and OpenAI Ada determine the *semantic space* characteristics. Their training data volume, domain focus, and multilingual alignment capabilities define the **coverage** and **granularity** of semantic representations.

Models fine-tuned on *in-domain* pairs (such as cooking instructions) produce tighter semantic clusters with improved recall/precision compared to general encoders. Conversely, larger but domain-agnostic encoders may handle a wider variety of queries but introduce more noise in results.

#### Completion 7: (2 point)

Let's create a semantic retriever. We would use `BAAI/bge-small-en` as our embedding model, and `FAISS` as our vector store. Complete the code cells below to create a semantic retriever, which would be later used in our RAG pipeline.

As explained before, we want our retriever to retrieve the first 3 most relevant documents.

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

# Create embedding model
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en")

# Create a vector store
vectorstore = FAISS.from_documents(chunks, embedding_model)

# Create a semantic retriever
semantic_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Query below is related to `Zucchini Nut Bread` recipe.

Sample_query = "\
The kitchen smells warm and sweet already. \
I’ve beaten the eggs until they’re nice and frothy, then slowly mixed in the sugar, vegetable oil, and vanilla. \
it’s turned into a thick, glossy batter, smooth and golden. \
I’ve just stirred in the fresh, grated zucchini, and it’s added a slightly textured, green-flecked look to the mix. \
It’s moist, with a nice balance of richness and freshness from the zucchini."

# Use the semantic retriever to get the most relevant chunks for the query
retrieved_chunks = semantic_retriever.get_relevant_documents(Sample_query)

# Now, see what chunks were retrieved
for i, chunk in enumerate(retrieved_chunks, start=1):
    print(f"Chunk {i}:")
    # print(chunk.page_content)
    print("Metadata:", chunk.metadata)
    print()

Chunk 1:
Metadata: {'title': "Claudia's Zucchini Bread", 'source_idx': 240}

Chunk 2:
Metadata: {'title': 'Zucchini Nut Bread', 'source_idx': 4}

Chunk 3:
Metadata: {'title': 'Rotini With Zucchini and Cannellini', 'source_idx': 950}



### 3.4 Create RAG pipelines (6 points)

#### Question 7: (2 points)

# RAG Components & Inference Flow

1. **User Query**
2. **Retriever** (sparse/dense/hybrid) → returns *k* relevant chunks
3. **Augmenter / Prompt Assembler** → formats prompt with retrieved chunks and optional system instructions
4. **Generator (LLM)** → conditions on the prompt to produce an answer
5. **Post-processor** (optional) → adds citations, parses JSON, re-ranks results

# Context-Integration Strategies

**Stuff / Concatenate**: Combines top *k* chunks into a single "context" section before the question
- *Pro*: Simple implementation, preserves complete text
- *Con*: Easily exceeds context window limits, provides no salience indicators

**Map-Reduce / Summarize then Generate**: Summarizes individual chunks (Map), aggregates/re-ranks summaries, then passes compact digest to LLM (Reduce)
- *Pro*: Accommodates longer contexts, improves signal-to-noise ratio
- *Con*: Requires additional model calls, risks information loss during summarization

**Additional approaches**: Cross-encoder re-ranking, Retrieve-then-Read iterative Q&A, Chain-of-Thought with linked citations

#### Completion 8: (4 points)

Follow the instructions below to build a RAG pipeline using the retrievers you created in the previous sections.

In [None]:
Sample_query = """\
The kitchen smells warm and sweet already. \
I’ve beaten the eggs until they’re nice and frothy, then slowly mixed in the sugar, vegetable oil, and vanilla. \
it’s turned into a thick, glossy batter, smooth and golden. \
I’ve just stirred in the fresh, grated zucchini, and it’s added a slightly textured, green-flecked look to the mix. \
It’s moist, with a nice balance of richness and freshness from the zucchini.

What is your best guess about what am I cooking?\
"""

In [None]:
# We are going to use "microsoft/Phi-4-mini-instruct" as our LLM again. If you need, load it again here and as before.

model_id = "microsoft/Phi-4-mini-instruct"
tokenizer =
model =
pipe =
llm =

In [None]:
# First, we need to define a new chat template, that provide the retrieved documents as context to the LLM.
from langchain_core.prompts import SystemMessagePromptTemplate, ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate


prompt = ChatPromptTemplate.from_messages([

])

In [None]:
# Now, let's create a simple RAG pipeline, using the sparse retriever. Note that we need the retrieved context as part of the output, so that we can later use it for evaluation.
from langchain_core.output_parsers import StrOutputParser

output_parser = StrOutputParser()

sparse_rag = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : (input)  populated by getting the value of the "question" key
    # "context"  : (output) chunks retrieved by the sparse retriever, based on the "question" value
    # "response" : (output) the "context" and "question" values are used to format our prompt object and then piped
    #                       into the LLM and stored in a key called "response"

)

# Now, let's test the sparse RAG pipeline with a sample query.
response =
print(response["response"])

In [None]:
# For this cell, everything is the same as the previous cell, except that we are using the semantic retriever instead of the sparse retriever.

semantic_rag = (
    # The same as previous cell, but using the semantic retriever instead of the sparse retriever
)

# Now, let's test the semantic RAG pipeline with a sample query.
response =
print(response["response"])

In [None]:
# Finally, let's try the same query with the LLM directly, without any retrieval.
response =
print(response)

### 3.5 Evaluate our pipelines (9 points)

In this section, we are going to evaluate our RAG pipelines. First, we would design 5 queries to evaluate our RAG pipelines and our LLM alone.

#### Completion 9: (1 point)

Add 4 more queries, similar to the example. The examples would be based on the first 100 recipes of our dataset.
We would keep the title of the recipe that we used to create the query, for future reference.

In [None]:
queries = [
    {
        "title": "Balsamic Chicken Pasta with Fresh Cheese",
        "query": "I am cooking dinner. Here is what my kitchen looks like:\nThe linguine is cooked and set aside. The red bell peppers are soft and slightly caramelized. The balsamic dressing is mixed with garlic, salt, pepper, and fresh basil. Each component is ready in its bowl, colorful and aromatic.\n\nWhat should I do as my next step?"
    },
    # Add 4 more
]

In [None]:
from textwrap import fill
questions = [{"question": q["query"]} for q in queries]

llm_responses = llm.batch([q["query"] for q in queries])
sparse_rag_responses = sparse_rag.batch(questions)
semantic_rag_responses = semantic_rag.batch(questions)

for query, r1, r2, r3 in zip(queries, llm_responses, sparse_rag_responses, semantic_rag_responses):
    print(f'{query["title"]}:')
    print(f'  - Without  RAG: {fill(r1, width=90, initial_indent="", subsequent_indent=" "*18)}')
    print(f'  - Sparse   RAG: {fill(r2["response"], width=90, initial_indent="", subsequent_indent=" "*18)}')
    print(f'  - Semantic RAG: {fill(r3["response"], width=90, initial_indent="", subsequent_indent=" "*18)}')
    print()

#### Report 2: (2 points)

Write a report about the experiments above. Your report should address the following:
1. Compare the quality of the answers. In which cases did Sparse or Semantic RAG help improve the response? Was there any example where it hurt the performance?
2. Discuss the differences between Sparse and Semantic RAG. Based on your examples, which one seems more effective and why?
3. Any surprising findings or patterns? Did anything behave differently than you expected?


`# WRITE YOUR ANSWER HERE`

#### Completion 10: (3 points)

Now we want to automate the evaluation process. For this purpose, we are going to use the `ragas`. Follow the instructions of each cell to create the evaluation pipeline. To learn more about this framework, please refer to its [get started](https://docs.ragas.io/en/stable/getstarted/) or [how-to](https://docs.ragas.io/en/stable/howtos/) pages.

In [None]:
!pip install -q ragas rapidfuzz

In [None]:
# Load the LLM as ragas llm. For this, we can use the provided wrapper for our existing LLM.

ragas_llm =

In [None]:
# Generate 10 test cases, using ragas, based on your documents. You can use a subset of your documents for faster runtime.
test_set =


test_set.test_data[0]

In [None]:
test_df = test_set.to_pandas()
test_df.head(3)

In [None]:
test_questions = test_df["question"].values.tolist()
test_ground_truths = test_df["ground_truth"].values.tolist()

In [None]:
results = {
    "sparse": {
        "answers": [],
        "contexts": []
    },
    "dense": {
        "answers": [],
        "contexts": []
    },
}

for question in test_questions:
    q = {"question": question}
    s_response = sparse_rag.invoke(q)
    d_response = semantic_rag.invoke(q)

    results["sparse"]["answers"].append(s_response["response"])
    results["sparse"]["contexts"].append([context.page_content for context in s_response["context"]])
    results["dense"]["answers"].append(d_response["response"])
    results["dense"]["contexts"].append([context.page_content for context in d_response["context"]])

from datasets import Dataset

sparse_response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : results["sparse"]["answers"],
    "contexts" : results["sparse"]["contexts"],
    "ground_truth" : test_ground_truths
})
dense_response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : results["dense"]["answers"],
    "contexts" : results["dense"]["contexts"],
    "ground_truth" : test_ground_truths
})

In [None]:
# Load ragas evaluation metrics. We would use all possible metrics, including:
# - Faithfulness
# - Answer relevancy
# - Answer correctness
# - retrieved context related metrics

metrics = [

]

In [None]:
# Use ragas evaluator to report the score of each pipeline, using the metrics defined above.

sparse_scores =
dense_scores =

print(f"Sparse RAG Score: {sparse_scores}")
print(f"Dense RAG Score: {dense_scores}")

In [None]:
sparse_scores.to_pandas().head(3)

In [None]:
dense_scores.to_pandas().head(3)

In [None]:
import pandas as pd

df_sparse = pd.DataFrame(list(sparse_scores.items()), columns=['Metric', 'Sparse Retriever'])
df_dense = pd.DataFrame(list(dense_scores.items()), columns=['Metric', 'Dense Retriever'])

df_merged = pd.merge(df_sparse, df_dense, on='Metric')

df_merged['Delta'] = df_merged['Dense Retriever'] - df_merged['Sparse Retriever']

df_merged

#### Report 3: (1 point)

Compare the automated evaluation (using ragas) with your manual evaluation from the previous step. In your report, make sure to address the following:
1. How are the two evaluation methods different? Briefly describe what makes the automated evaluation distinct from your manual judgment process (e.g., consistency, objectivity, criteria used).
2.	Do both evaluations show the same results? Were the rankings or judgments about the quality of responses consistent between your analysis and the automated scores?
3.	If there were differences, why might that be? Reflect on what factors could lead to different results.

`# WRITE YOUR ANSWER HERE`

#### Question 8: (2 points)

# RAGAS: Retrieval-Augmented Generation Assessment Suite

RAGAS evaluates RAG pipelines across three critical layers:

| Layer | Metric | Mechanism |
|-------|--------|-----------|
| **Retrieval** | *Context Precision/Recall, Context Faithfulness* | Uses LLM evaluator (cross-encoder or cosine) to determine if retrieved passages are relevant and sufficient |
| **Generation** | *Answer Correctness, Relevancy* | Compares candidate answer against reference "ground truth" using LLM-as-judge to rate factual overlap and completeness |
| **Grounding** | *Faithfulness* | Verifies statements in the answer are supported by context chunks using an LLM to identify unsupported claims |

## Key Techniques
- **LLM-as-Critic**: Employs powerful instruction-tuned models (GPT-4, Mixtral) with structured rubrics to approximate human evaluation
- **Few-shot prompting + Chain-of-Thought**: Ensures consistent assessment across examples
- **Aggregation functions**: Individual scores (0-1) are averaged, with confidence intervals available through bootstrapping

This comprehensive approach directly measures groundedness and factuality rather than relying on generic metrics like BLEU/ROUGE.


### 3.6 (Optional) Other strategy: (3 points)

#### Question 9: (3 points)

There are other retriever strategies you can use to improve the performance of your RAG pipeline. In this task:
1. Explain what the `MultiQueryRetriever` does and how it can help improve retrieval quality in your pipeline.
2. Implement the `MultiQueryRetriever` in your RAG pipeline and evaluate its performance using both manual and automated methods.
3. You may also make additional improvements to your pipeline. If you do so, briefly explain what changes you made, how they affect the system, and why they might improve performance.

# MultiQueryRetriever

## Functionality
Addresses the "single query-to-many expressions" challenge by generating multiple paraphrased queries for each user question. These are executed independently against the index, with results unified, deduplicated, and re-ranked. This approach expands lexical coverage for sparse retrieval and semantic coverage for dense retrieval, improving recall with minimal precision loss.

## Implementation
```python
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chains import LLMChain

# 1️⃣ LLM for generating reformulations
reformulator = llm  # or a more economical model

# 2️⃣ Base retriever (BM25, FAISS, etc.)
base_retriever = sparse_retriever  # previously constructed

# 3️⃣ MultiQueryRetriever wrapper
mq_retriever = MultiQueryRetriever.from_llm(
    base_retriever=base_retriever,
    llm=reformulator,
    n_queries=4,  # number of paraphrases to generate
    re_rank=True  # optional cross-encoder re-ranking
)

# 4️⃣ Integration with RAG prompt
mq_rag = (
    {"context": mq_retriever, "question": lambda x: x["question"]}
    | prompt
    | llm
    | StrOutputParser()
)
```

## Performance Effects
- **Manual assessment**: Better handling of edge cases (rare terms, plural/synonym variants), reducing "I don't know" responses
- **Automated evaluation**: Slight decrease in context precision offset by improvements in context recall and answer correctness

## Optimization Strategies
- **Hybrid retrieval**: Combining dense and BM25 approaches with MultiQuery
- **Cross-encoder re-ranking**: Refining relevance of combined results
- **Compression techniques**: Using Max-Marginal Relevance to stay within token limits

`# WRITE YOUR ANSWER HERE`

## 4. Read more: (10 points)

#### Cache-Augmented Generation (CAG): (4 points)

1. What is Cache-Augmented Generation (CAG)? How does it improve efficiency or performance during generation?
2. What are the similarities and differences between Cache-Augmented Generation (CAG) and Retrieval-Augmented Generation (RAG)? In what scenarios might you prefer one over the other?

`# WRITE YOUR ANSWER HERE`

#### Multi-modal RAG: (6 points)

1. How do models like CLIP enable the embedding of both text and images into a shared vector space? What are the advantages and disadvantages of using a unified embedding space for cross-modal retrieval in RAG systems.
2. In systems like [Colpali](https://arxiv.org/pdf/2407.01449), how does dividing document images into patches enhance the retrieval process in multimodal RAG? Explore how patch-based processing preserves structural information and its impact on retrieval accuracy.
3. What are the implications of converting non-text modalities (e.g., images) into textual representations for retrieval purposes? Discuss the benefits and drawbacks of grounding all modalities into a primary modality, such as text, in the context of RAG.

`# WRITE YOUR ANSWER HERE`

### AI usage
For question parts and the read more part I've used o3 model:

https://chatgpt.com/share/6825b129-a19c-8001-8e56-b05beb110b3d
