In [24]:
!pip install -q sentence-transformers faiss-cpu transformers torch pypdf rouge-score scikit-learn nltk


In [25]:
!pip install -q accelerate bitsandbytes


In [26]:
from pypdf import PdfReader
import os

all_docs = []

for file in os.listdir():
    if file.lower().endswith(".pdf"):
        print("Reading:", file)
        reader = PdfReader(file)

        for page in reader.pages:
            text = page.extract_text()
            if text:
                all_docs.append(text)

print("Total pages extracted:", len(all_docs))


Reading: cs229-notes2.pdf
Reading: cs229-notes1.pdf
Reading: cs229-notes8.pdf
Reading: cs229-notes7a.pdf
Reading: cs229-notes4.pdf
Reading: cs229-notes3.pdf
Total pages extracted: 91


In [27]:
import re

def advanced_chunk(text, chunk_size=350):
    # Split sentences using regex
    sentences = re.split(r'(?<=[.!?])\s+', text)

    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= chunk_size:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks


chunk_texts = []

for doc in all_docs:
    chunks = advanced_chunk(doc)
    chunk_texts.extend(chunks)

print("Initial chunks:", len(chunk_texts))


Initial chunks: 564


In [28]:
clean_chunks = []

for chunk in chunk_texts:

    if len(chunk) < 250:
        continue

    # Remove extremely formula-heavy chunks
    if chunk.count("=") > 5:
        continue

    if chunk.count("∑") > 2:
        continue

    clean_chunks.append(chunk)

chunk_texts = clean_chunks

print("Cleaned chunks:", len(chunk_texts))


Cleaned chunks: 332


In [29]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

embedding_model = SentenceTransformer("BAAI/bge-base-en-v1.5")


embeddings = embedding_model.encode(
    chunk_texts,
    convert_to_numpy=True,
    show_progress_bar=True
)

# Normalize for cosine similarity
faiss.normalize_L2(embeddings)

dimension = embeddings.shape[1]

index = faiss.IndexFlatIP(dimension)
index.add(embeddings)

print("FAISS index size:", index.ntotal)

id_to_text = {i: chunk_texts[i] for i in range(len(chunk_texts))}


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: BAAI/bge-base-en-v1.5
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

FAISS index size: 332


In [30]:
def retrieve_context(question, top_k_dense=40, final_k=8):

    query_embedding = embedding_model.encode(
        [question],
        convert_to_numpy=True
    )

    faiss.normalize_L2(query_embedding)

    D, I = index.search(query_embedding, top_k_dense)

    candidates = [id_to_text[int(i)] for i in I[0]]

    # Keyword re-ranking
    keywords = question.lower().split()

    scored_chunks = []

    for chunk in candidates:
        score = sum(chunk.lower().count(word) for word in keywords)
        scored_chunks.append((score, chunk))

    scored_chunks.sort(reverse=True, key=lambda x: x[0])

    return [chunk for score, chunk in scored_chunks[:final_k]]


In [31]:
print("\n\n".join(retrieve_context("What is gradient descent?")))


}
The reader can easily verify that the quantity in the summati on in the update
rule above is just ∂J (θ)/∂θ j (for the original deﬁnition of J). So, this is
simply gradient descent on the original cost function J. This method looks
at every example in the entire training set on every step, and is called batch
gradient descent .

19
Above, we used the fact that g′(z) = g(z)(1 − g(z)). This therefore gives us
the stochastic gradient ascent rule
θj := θj + α
(
y(i) − hθ(x(i))
)
x(i)
j
If we compare this to the LMS update rule, we see that it looks i dentical; but
this is not the same algorithm, because hθ(x(i)) is now deﬁned as a non-linear
function of θT x(i).

Here is an example of gradient descent as it is run to minimize a quadratic
function. 1We use the notation “ a := b” to denote an operation (in a computer program) in
which we set the value of a variable a to be equal to the value of b. In other words, this
operation overwrites a with the value of b.

Note that, while gradient d

In [23]:
def generate_answer(question, context_chunks):
    context = "\n\n".join(context_chunks)

    prompt = f"""
You are an AI/ML professor.

Context:
{context}

Question:
{question}

Answer:
"""

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=250,
        temperature=0.2,
        do_sample=True
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [32]:
def rag_pipeline(question):
    context = retrieve_context(question)
    return generate_answer(question, context)


In [22]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



In [33]:
print(rag_pipeline("What is gradient descent?"))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



You are an AI/ML professor.

Context:
}
The reader can easily verify that the quantity in the summati on in the update
rule above is just ∂J (θ)/∂θ j (for the original deﬁnition of J). So, this is
simply gradient descent on the original cost function J. This method looks
at every example in the entire training set on every step, and is called batch
gradient descent .

19
Above, we used the fact that g′(z) = g(z)(1 − g(z)). This therefore gives us
the stochastic gradient ascent rule
θj := θj + α
(
y(i) − hθ(x(i))
)
x(i)
j
If we compare this to the LMS update rule, we see that it looks i dentical; but
this is not the same algorithm, because hθ(x(i)) is now deﬁned as a non-linear
function of θT x(i).

Here is an example of gradient descent as it is run to minimize a quadratic
function. 1We use the notation “ a := b” to denote an operation (in a computer program) in
which we set the value of a variable a to be equal to the value of b. In other words, this
operation overwrites a with the v

In [34]:
def baseline_answer(question):

    prompt = f"""
You are an AI/ML expert.

Answer clearly and concisely:

Question:
{question}

Answer:
"""

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=512
    )

    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        num_beams=4,
        do_sample=False
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)


In [35]:
questions = [
    "What is gradient descent?",
    "What is overfitting?",
    "Explain logistic regression.",
    "What is the EM algorithm?",
    "What is stochastic gradient descent?"
]

reference_answers = [
    "Gradient descent is an optimization algorithm used to minimize a loss function by iteratively updating parameters in the direction of the negative gradient.",

    "Overfitting occurs when a model learns noise and specific details from training data, resulting in poor generalization to unseen data.",

    "Logistic regression is a classification algorithm that models the probability of a binary outcome using a sigmoid function.",

    "The EM algorithm is an iterative method used for maximum likelihood estimation in models with latent variables.",

    "Stochastic gradient descent is a variant of gradient descent that updates model parameters using one training example at a time."
]


In [36]:
from rouge_score import rouge_scorer
from sentence_transformers import util

questions = [
    "What is gradient descent?",
    "What is overfitting?",
    "Explain logistic regression.",
    "What is the EM algorithm?",
    "What is stochastic gradient descent?"
]

reference_answers = [
    "Gradient descent is an optimization algorithm used to minimize a loss function by iteratively updating parameters in the direction of the negative gradient.",
    "Overfitting occurs when a model learns noise and specific details from training data, resulting in poor generalization to unseen data.",
    "Logistic regression is a classification algorithm that models the probability of a binary outcome using a sigmoid function.",
    "The EM algorithm is an iterative method used for maximum likelihood estimation in models with latent variables.",
    "Stochastic gradient descent is a variant of gradient descent that updates model parameters using one training example at a time."
]

scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

# Generate answers once
rag_outputs = [rag_pipeline(q) for q in questions]
baseline_outputs = [baseline_answer(q) for q in questions]

# Batch embeddings
ref_embeddings = embedding_model.encode(reference_answers, convert_to_numpy=True)
rag_embeddings = embedding_model.encode(rag_outputs, convert_to_numpy=True)
base_embeddings = embedding_model.encode(baseline_outputs, convert_to_numpy=True)

rag_rouge = []
baseline_rouge = []
rag_sem = []
baseline_sem = []

for i in range(len(questions)):

    rag_score = scorer.score(reference_answers[i], rag_outputs[i])['rougeL'].fmeasure
    base_score = scorer.score(reference_answers[i], baseline_outputs[i])['rougeL'].fmeasure

    rag_rouge.append(rag_score)
    baseline_rouge.append(base_score)

    rag_sim = util.cos_sim(ref_embeddings[i], rag_embeddings[i]).item()
    base_sim = util.cos_sim(ref_embeddings[i], base_embeddings[i]).item()

    rag_sem.append(rag_sim)
    baseline_sem.append(base_sim)

print("===== RESULTS =====\n")

print("Average ROUGE-L (RAG):", sum(rag_rouge)/len(rag_rouge))
print("Average ROUGE-L (Baseline):", sum(baseline_rouge)/len(baseline_rouge))

print("\nAverage Semantic Similarity (RAG):", sum(rag_sem)/len(rag_sem))
print("Average Semantic Similarity (Baseline):", sum(baseline_sem)/len(baseline_sem))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


===== RESULTS =====

Average ROUGE-L (RAG): 0.047702417123380296
Average ROUGE-L (Baseline): 0.190165544332211

Average Semantic Similarity (RAG): 0.73734050989151
Average Semantic Similarity (Baseline): 0.8603696823120117


In [None]:
for i, q in enumerate(questions):
    print("\n===========================")
    print("Question:", q)
    print("\nRAG Answer:\n", rag_pipeline(q))
    print("\nBaseline Answer:\n", baseline_answer(q))


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Question: What is gradient descent?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



RAG Answer:
 
You are an AI/ML professor.

Context:
}
The reader can easily verify that the quantity in the summati on in the update
rule above is just ∂J (θ)/∂θ j (for the original deﬁnition of J). So, this is
simply gradient descent on the original cost function J. This method looks
at every example in the entire training set on every step, and is called batch
gradient descent .

19
Above, we used the fact that g′(z) = g(z)(1 − g(z)). This therefore gives us
the stochastic gradient ascent rule
θj := θj + α
(
y(i) − hθ(x(i))
)
x(i)
j
If we compare this to the LMS update rule, we see that it looks i dentical; but
this is not the same algorithm, because hθ(x(i)) is now deﬁned as a non-linear
function of θT x(i).

Here is an example of gradient descent as it is run to minimize a quadratic
function. 1We use the notation “ a := b” to denote an operation (in a computer program) in
which we set the value of a variable a to be equal to the value of b. In other words, this
operation overwrite

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Baseline Answer:
 
You are an AI/ML expert.

Answer clearly and concisely:

Question:
What is gradient descent?

Answer:
Gradient Descent is an optimization algorithm used to minimize the function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. It's commonly used in machine learning to find the optimal parameters for a model, such as the weights and biases in a neural network. The goal is to find the minimum of the cost function, which represents the error between the predicted and actual values. Gradient Descent updates the parameters based on the gradient of the cost function with respect to those parameters. The learning rate determines the size of the steps taken during each iteration.

Question: What is overfitting?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



RAG Answer:
 
You are an AI/ML professor.

Context:
this is the probability that, if we now draw a new example ( x, y) from
the distribution D, h will misclassify it. Note that we have assumed that the training data was drawn from t he
same distribution D with which we’re going to evaluate our hypotheses (in
the deﬁnition of generalization error).

(Later in this class, when we talk about learning
theory we’ll formalize some of these notions, and also deﬁne more carefully
just what it means for a hypothesis to be good or bad.)
As discussed previously, and as shown in the example above, th e choice of
features is important to ensuring good performance of a lear ning algorithm.

4
This is just the fraction of training examples that h misclassiﬁes. When we
want to make explicit the dependence of ˆ ε(h) on the training set S, we may
also write this a ˆεS(h). We also deﬁne the generalization error to be
ε(h) = P(x,y)∼D (h(x) ̸= y). I.e.

Since our training set was drawn iid from D, Z and t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Baseline Answer:
 
You are an AI/ML expert.

Answer clearly and concisely:

Question:
What is overfitting?

Answer:
Overfitting is a common issue in machine learning where a model learns the training data too well, including its noise and outliers, to the extent that it negatively impacts the model's ability to generalize and make accurate predictions on new, unseen data. Overfitting occurs when a model has too many parameters relative to the amount and complexity of the training data, leading to memorization of the training data rather than learning the underlying patterns and relationships. Regularization techniques, such as L1 and L2 regularization, dropout, and early stopping, can be used to prevent overfitting.

Question: Explain logistic regression.


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



RAG Answer:
 
You are an AI/ML professor.

Context:
26
9.2 Logistic Regression
We now consider logistic regression. Here we are interested i n binary classiﬁ-
cation, so y ∈ { 0, 1}. Given that y is binary-valued, it therefore seems natural
to choose the Bernoulli family of distributions to model the conditional dis-
tribution of y given x.

For example, if x|y = 0 ∼ Poisson(λ 0), and
x|y = 1 ∼ Poisson(λ 1), then p(y|x) will be logistic. Logistic regression will
also work well on Poisson data like this. But if we were to use G DA on such
data—and ﬁt Gaussian distributions to such non-Gaussian da ta—then the
results will be less predictable, and GDA may (or may not) do w ell.

On one side of
the boundary, we’ll predict y = 1 to be the most likely outcome, and on the
other side, we’ll predict y = 0. 1.3 Discussion: GDA and logistic regression
The GDA model has an interesting relationship to logistic re gression. If we
view the quantity p(y = 1|x; φ, µ 0, µ 1, Σ) as a function of x, we’l

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Baseline Answer:
 
You are an AI/ML expert.

Answer clearly and concisely:

Question:
Explain logistic regression.

Answer:
Logistic Regression is a supervised machine learning algorithm used for binary classification problems. It works by estimating probabilities using a logistic function, which maps any input to a value between 0 and 1. This output can be interpreted as the probability of a given instance belonging to a certain class. The logistic function is defined as:

f(z) = 1 / (1 + e^(-z))

Where z is the weighted sum of features: z = w0 * x0 + w1 * x1 + w2 * x2 + ... + wn * xn

The goal of logistic regression is to find the best set of coefficients (w0, w1, w2, ..., wn) that maximize the likelihood of the observed data given the model. This is typically done by minimizing the log loss function, which measures the difference between the predicted probabilities and the true labels.



Question: What is the EM algorithm?
