# Group Project / Assignment 4: Instruction finetuning a Llama-3.2 model
**Assignment due 21 April 11:59pm**

Welcome to the fourth and final assignment for 50.055 Machine Learning Operations. The third and fourth assignment together form the course group project. You will continue the work on a chatbot which can answer questions about SUTD to prospective students.


**This assignment is a group assignment.**

- Read the instructions in this notebook carefully
- Add your solution code and answers in the appropriate places. The questions are marked as **QUESTION:**, the places where you need to add your code and text answers are marked as **ADD YOUR SOLUTION HERE**. The assignment is more open-ended than previous assignments, i.e. you have more freedom how to solve the problem and how to structure your code.
- The completed notebook, including your added code and generated output will be your submission for the assignment.
- The notebook should execute without errors from start to finish when you select "Restart Kernel and Run All Cells..". Please test this before submission.
- Use the SUTD Education Cluster to solve and test the assignment. If you work on another environment, minimally test your work on the SUTD Education Cluster.

**Rubric for assessment**

Your submission will be graded using the following criteria.
1. Code executes: your code should execute without errors. The SUTD Education cluster should be used to ensure the same execution environment.
2. Correctness: the code should produce the correct result or the text answer should state the factual correct answer.
3. Style: your code should be written in a way that is clean and efficient. Your text answers should be relevant, concise and easy to understand.
4. Partial marks will be awarded for partially correct solutions.
5. Creativity and innovation: in this assignment you have more freedom to design your solution, compared to the first assignments. You can show of your creativity and innovative mindset.
6. There is a maximum of 310 points for this assignment.

**ChatGPT policy**

If you use AI tools, such as ChatGPT, to solve the assignment questions, you need to be transparent about its use and mark AI-generated content as such. In particular, you should include the following in addition to your final answer:
- A copy or screenshot of the prompt you used
- The name of the AI model
- The AI generated output
- An explanation why the answer is correct or what you had to change to arrive at the correct answer

**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.



### Finetuning LLMs

The goal of the assignment is to build a more advanced chatbot that can talk to prospective students and answer questions about SUTD.

We will finetune a smaller 1B LLM on question-answer pairs which we synthetically generate. Then we will compare the finetuned and non-finetuned LLMs with and without RAG to see if we were able to improve the SUTD chatbot answer quality.

We'll be leveraging `langchain`, `llama 3.2` and `Google AI STudio with Gemini 2.0`.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [Llama 3.2](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/)
- [Google AI Studio](https://aistudio.google.com/)

Note: Google AI Studio provides a lot of free tokens but has certain rate limits. Write your code in a way that it can handle these limits.

# Install dependencies
Use pip to install all required dependencies of this assignment in the cell below. Make sure to test this on the SUTD cluster as different environments have different software pre-installed.  

In [1]:
# QUESTION: Install and import all required packages
# The rest of your code should execute without any import or dependency errors.

# --- ADD YOUR SOLUTION HERE (10 points) ---
# After running this cell once, restart runtime and re-run again.

!pip install peft streamlit flask_cors langchain langchain-google-genai google-generativeai==0.5.0 transformers sentence-transformers langchain-community huggingface_hub datasets fitz tools

import os
import json
from langchain.prompts import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate
from langchain_google_genai import GoogleGenerativeAI
from langchain_core.output_parsers import JsonOutputParser
import google.generativeai as genai
import time
from datasets import Dataset, DatasetDict, load_dataset
from huggingface_hub import login
import random
import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from langchain.llms import HuggingFacePipeline

!pip install unstructured langchain pdfminer.six pi_heif unstructured-inference pdf2image tesseract pymupdf sentence-transformers faiss-cpu accelerate rank-bm25 nltk langchain-community

import requests
from urllib.parse import urljoin, urlparse, quote
from collections import deque
import hashlib
import tqdm
from bs4 import BeautifulSoup
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document, BaseRetriever
import fitz
from rank_bm25 import BM25Okapi
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
import re
from typing import Any, List



  from .autonotebook import tqdm as notebook_tqdm




[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [None]:
# gemini API key
api_key = ""
genai.configure(api_key=api_key)

# use a huggingface write token
hf_token = ""
login(token=hf_token)

# Generate training data
The first step of the assignment is generating synthetic question-answer pairs which can be used for finetuning an LLM model.
Use the Google AI studio with the Gemini models to create -high-quality QA training data.


In [3]:
# QUESTION: Use langchain and the Google AI Studio APIs and a model from the Gemini 2.0 family
# to create a text-generation chain that can produce and parse JSON output.
# Test it by having the LLM generate a JSON array of 3 fruits

#--- ADD YOUR SOLUTION HERE (20 points)---
parser = JsonOutputParser()
prompt = PromptTemplate(
    template=
    """Generate a JSON array of 3 fruits.
    {format_instructions}""",
    input_variables=[],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

model = GoogleGenerativeAI(model="gemini-2.0-flash", google_api_key=api_key, temperature=0.7)
chain = prompt | model | parser
response = chain.invoke({})
print(response)

{'fruits': [{'name': 'Apple', 'color': 'Red', 'taste': 'Sweet'}, {'name': 'Banana', 'color': 'Yellow', 'taste': 'Sweet'}, {'name': 'Strawberry', 'color': 'Red', 'taste': 'Sweet and slightly tart'}]}


## Generate topics
When generating data, it is often helpful to guide the generation process through some hierachical structure.
Before we create question-answer pairs, let's generate some topics which the questions should be about.



In [4]:
# QUESTION: Create a function 'generate_topics' which generates topics which prospective students might care about.
#
# Generate a list of 20 topics

#--- ADD YOUR SOLUTION HERE (20 points)---
def generate_topics(num_topics=20):
  parser = JsonOutputParser()

  prompt = PromptTemplate(
    template=
    """You are an expert educational advisor for the Singapore University of Technology and Design (SUTD).
    Generate {num_topics} topics that prospective students might care about when considering SUTD.
    Cover areas like academics, campus life, career opportunities, admissions, financial aid, housing, and unique aspects of SUTD.

    Output ONLY a JSON array of strings, like this:
    ["Topic 1", "Topic 2", "Topic 3", "..."]
    No explanations, no keys, just the JSON array.

    {format_instructions}""",
    input_variables=["num_topics"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
  )

  model = GoogleGenerativeAI(model="gemini-2.0-flash", google_api_key=api_key, temperature=0.7)
  chain = prompt | model | parser
  response = chain.invoke({"num_topics": num_topics})
  return response

In [5]:
# test topic generation

topics = generate_topics(3)
for topic in topics:
  print(topic)

SUTD's Design-Centric Curriculum & Opportunities for Interdisciplinary Collaboration: How does SUTD's unique pedagogy prepare students for future careers and innovation, and what are the specific opportunities to work across disciplines like engineering, architecture, and computing?
Career Pathways & Industry Connections at SUTD: What types of job opportunities are available to SUTD graduates, which companies actively recruit from SUTD, and how does the university support students in securing internships and full-time positions?
SUTD's Campus Life & Residential Experience: What are the on-campus housing options, what student clubs and activities are available, and how does SUTD foster a vibrant and supportive community for its students?


In [6]:
# Generate a list of 20 topics
# We save a copy to disk and reload it from there if the file exists
topics = generate_topics()
for topic in topics:
  print(topic)

with open("sutd_topics.json", "w") as f:
  json.dump({"topics": topics}, f, indent=2)

SUTD's Design-Centric Curriculum and Pedagogy
The 5 Pillars of SUTD's Academic Programs
SUTD's Undergraduate Degree Programs: A Detailed Overview
Research Opportunities for Undergraduate Students at SUTD
SUTD's Industry Connections and Internship Programs
Career Prospects and Graduate Outcomes for SUTD Graduates
SUTD's Admission Requirements and Application Process
Scholarships and Financial Aid Options at SUTD
SUTD's Tuition Fees and Cost of Living in Singapore
Student Housing and Residential Life at SUTD
Student Clubs and Organizations at SUTD
SUTD's Campus Facilities and Resources
SUTD's Global Opportunities Program (e.g., exchange programs)
The SUTD-MIT International Design Centre (IDC)
SUTD's Focus on Sustainability and Innovation
SUTD's Capstone Project and Design Thinking Approach
SUTD's Culture of Collaboration and Interdisciplinary Learning
The SUTD Alumni Network and its Benefits
SUTD's Location and Accessibility
Comparing SUTD to Other Universities: What Makes SUTD Unique?


## Generate questions
Now generate a set of questions about each topic

In [7]:
# QUESTION: Create a function 'generate_questions' which generates quetions about a given topic.
# Generate a list of 10 questions per topics. In total you should have 200 questions.
#

#--- ADD YOUR SOLUTION HERE (20 points)---
def generate_questions(topic, num_questions=10):
  parser = JsonOutputParser()

  prompt = PromptTemplate(
    template="""You are an expert educational advisor for the Singapore University of Technology and Design (SUTD).
    Generate {num_questions} specific and varied questions that prospective students might ask about the topic: "{topic}".
    The questions should:
    - Be specific and detailed
    - Cover different aspects of the topic
    - Reflect what prospective students would genuinely want to know
    - Include both factual and opinion based questions
    - Range from basic to more complex inquiries

    Output ONLY a JSON array of strings like this:
    ["Question 1", "Question 2", "Question 3", "..."]
    Do not include any extra text or keys.

    {format_instructions}""",
    input_variables=["topic", "num_questions"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
  )

  model = GoogleGenerativeAI(model="gemini-2.0-flash", google_api_key=api_key, temperature=0.7)
  chain = prompt | model | parser
  response = chain.invoke({"topic": topic, "num_questions": num_questions})
  return response

In [8]:
# test it
# print(generate_questions("Academic Reputation and Program Quality", 3))
questions = generate_questions("Academic Reputation and Program Quality", 3)
for question in questions:
  print(question)

Beyond rankings, what specific industry recognition or accreditation does SUTD's programs, particularly in [specific program of interest, e.g., Architecture and Sustainable Design], hold that directly benefits graduates in terms of career opportunities and professional development? Are there quantifiable metrics, such as placement rates in top firms or success rates in professional licensing exams, available to demonstrate this benefit?
SUTD emphasizes a hands-on, project-based learning approach. How does the university ensure the rigor and academic depth of these projects, particularly in comparison to traditional lecture-based courses? What mechanisms are in place to assess the learning outcomes and intellectual contribution of students in these project-based settings, and how does this assessment inform continuous improvement of the curriculum?
How does SUTD foster interdisciplinary collaboration between different academic pillars, and what opportunities are available for students t

In [9]:
# # QUESTION: Now let's put it together and generate 10 questions for each topic. Save the questions in a local file.

#--- ADD YOUR SOLUTION HERE (20 points)---
topics = None
all_questions = {}

if os.path.exists("sutd_topics.json"):
  with open("sutd_topics.json", "r") as f:
    topics_json = json.load(f)
    topics = topics_json["topics"]
else:
  topics = generate_topics()

for topic in topics:
  questions = generate_questions(topic, 10)
  all_questions[topic] = questions

  with open("sutd_questions.json", "w") as f:
      json.dump(all_questions, f, indent=2)

  print(f"Generated 10 questions for topic '{topic}'.")
  time.sleep(10)

Generated 10 questions for topic 'SUTD's Design-Centric Curriculum and Pedagogy'.
Generated 10 questions for topic 'The 5 Pillars of SUTD's Academic Programs'.
Generated 10 questions for topic 'SUTD's Undergraduate Degree Programs: A Detailed Overview'.
Generated 10 questions for topic 'Research Opportunities for Undergraduate Students at SUTD'.
Generated 10 questions for topic 'SUTD's Industry Connections and Internship Programs'.
Generated 10 questions for topic 'Career Prospects and Graduate Outcomes for SUTD Graduates'.
Generated 10 questions for topic 'SUTD's Admission Requirements and Application Process'.
Generated 10 questions for topic 'Scholarships and Financial Aid Options at SUTD'.
Generated 10 questions for topic 'SUTD's Tuition Fees and Cost of Living in Singapore'.
Generated 10 questions for topic 'Student Housing and Residential Life at SUTD'.
Generated 10 questions for topic 'Student Clubs and Organizations at SUTD'.
Generated 10 questions for topic 'SUTD's Campus Faci

## Generate Answers

Now create answers for the questions.

You can use the Google AI Studio Gemini model (assuming that they are good enough to generate good answers), your RAG system from assignment 3 or any other method you choose to generate answers for your question dataset.

Note: it is normal that some LLM calls fail, even with retry, so maybe you end up with less than 200 QA pairs but it should be at least 160 QA pairs.

In [None]:
# QUESTION: Generate answers to al your questions using Gemini, your SUTD RAG system or any other method.
# Split your dataset in to 80% training and 20% test dataset.
# Store all questions and answer pairs in a huggingface dataset `sutd_qa_dataset` and push it to your Huggingface hub.

#--- ADD YOUR SOLUTION HERE (40 points)---
def generate_answer(question):
  prompt = f"""You are an expert educational advisor for the Singapore University of Technology and Design (SUTD).
      Answer the following question as if you are helping a prospective student who is curious about SUTD.
      Be clear, accurate, friendly, and informative.
      Provide a helpful, clear, and concise answer for the following question: {question}"""

  model = GoogleGenerativeAI(model="gemini-2.0-flash", google_api_key=api_key, temperature=0.7)
  response = model.invoke(prompt)
  return response

if os.path.exists("sutd_questions.json"):
  with open("sutd_questions.json", "r") as f:
    topic_questions = json.load(f)

qa_pairs = []
qn_no = 0

for topic, questions in topic_questions.items():
  for question in questions:
    qn_no += 1
    response = generate_answer(question)
    qa_pairs.append({"topic": topic, "question": question, "answer": response})
    print(f"Generated answer for question number {qn_no}.")
    time.sleep(2)

with open("sutd_qa_pairs.json", "w") as file:
    json.dump(qa_pairs, file, indent=4)

random.shuffle(qa_pairs)
index = int(0.8 * len(qa_pairs))
train_data = qa_pairs[:index]
test_data = qa_pairs[index:]

dataset = DatasetDict({
  "train": Dataset.from_list(train_data),
  "test": Dataset.from_list(test_data)
})

dataset.save_to_disk("sutd_qa_dataset")
dataset.push_to_hub("{YOUR_HF_NAME}/sutd_qa_dataset")

Generated answer for question number 1.
Generated answer for question number 2.
Generated answer for question number 3.
Generated answer for question number 4.
Generated answer for question number 5.
Generated answer for question number 6.
Generated answer for question number 7.
Generated answer for question number 8.
Generated answer for question number 9.
Generated answer for question number 10.
Generated answer for question number 11.
Generated answer for question number 12.
Generated answer for question number 13.
Generated answer for question number 14.
Generated answer for question number 15.
Generated answer for question number 16.
Generated answer for question number 17.
Generated answer for question number 18.
Generated answer for question number 19.
Generated answer for question number 20.
Generated answer for question number 21.
Generated answer for question number 22.
Generated answer for question number 23.
Generated answer for question number 24.
Generated answer for ques

Saving the dataset (1/1 shards): 100%|██████████| 160/160 [00:00<00:00, 17624.98 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 40/40 [00:00<00:00, 6215.40 examples/s]
Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 187.81ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:03<00:00,  3.03s/it]
Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 564.51ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:02<00:00,  2.13s/it]


CommitInfo(commit_url='https://huggingface.co/datasets/ARM6423/sutd_qa_dataset/commit/e1a9190273398cb44db20386a4b6456b57611365', commit_message='Upload dataset', commit_description='', oid='e1a9190273398cb44db20386a4b6456b57611365', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/ARM6423/sutd_qa_dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='ARM6423/sutd_qa_dataset'), pr_revision=None, pr_num=None)

In [11]:
# test the chain
question = "When was SUTD founded?"

# Now run the answer generation chain
response = generate_answer(question)
print(response)

Hi there! Great question! The Singapore University of Technology and Design (SUTD) was founded in **2009**. While that makes us a relatively young university, we've accomplished a lot in a short amount of time and are proud of our innovative approach to education and research. We're excited you're interested in learning more about SUTD! Do you have any other questions I can help you with?



In [12]:
# now run the chain for all questions to collect context and generate answers
# done in previous cell


# Finetune Llama 3.2 1B model

Now use your SUTD QA dataset training data set to finetune a smaller Llama 3.2 1B LLM using parameter-efficient finetuning (PEFT).
We recommend the unsloth library but you are free to choose other frameworks. You can decide the parameters for the finetuning.
Push your finetuned model to Huggingface.

Then we will compare the finetuned and non-finetuned LLMs with and without RAG to see if we were able to improve the SUTD chatbot answer quality.


In [None]:
# QUESTION: Finetune a Llama 3.2 1B model on the training split of your SUTD QA dataset.
# You need to prepare your dataset accordingly and set the hyperparameters for the training.
# Push your finetuned model to the Hugginface model hub {YOUR_HF_NAME}/llama-3.2-1B-sutdqa

#--- ADD YOUR SOLUTION HERE (50 points)---
dataset = load_dataset("{YOUR_HF_NAME}/sutd_qa_dataset")

model_name = "meta-llama/Llama-3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

def tokenize(example):
  combined_text = f"### Question:\n{example['question']}\n\n### Answer: {example['answer']}"
  return tokenizer(combined_text, truncation=True, padding="max_length", max_length=1024)

tokenized_dataset = dataset["train"].map(tokenize)

training_args = TrainingArguments(
    output_dir="./llama3-sutdqa",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=False,
    logging_steps=10,
    save_steps=100,
    save_total_limit=1,
    report_to="none"
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator
)

trainer.train()

model.push_to_hub("{YOUR_HF_NAME}/llama-3.2-1B-sutdqa")
tokenizer.push_to_hub("{YOUR_HF_NAME}/llama-3.2-1B-sutdqa")

# to free gpu
del model
del tokenizer
del trainer
del tokenized_dataset
del dataset
del data_collator
del training_args
del lora_config
torch.cuda.empty_cache()

Generating train split: 100%|██████████| 160/160 [00:00<00:00, 10392.39 examples/s]
Generating test split: 100%|██████████| 40/40 [00:00<00:00, 7295.71 examples/s]
Map: 100%|██████████| 160/160 [00:00<00:00, 396.06 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
10,1.5037
20,1.4456
30,1.407
40,1.3556


adapter_model.safetensors: 100%|██████████| 3.42M/3.42M [00:01<00:00, 2.61MB/s]
tokenizer.json: 100%|██████████| 17.2M/17.2M [00:01<00:00, 11.2MB/s]


In [None]:
# QUESTION: Load a non-finetuned Llama 3.2 1B model and your finetuned SUTD QA Llama 3.2 1B model
# Ask it a simple test question (e.g. "What is special about SUTD?") to check that both models can generated answers

#--- ADD YOUR SOLUTION HERE (10 points)---
base_model_name = "meta-llama/Llama-3.2-1B"
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_tokenizer.pad_token = base_tokenizer.eos_token
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
base_pipe = pipeline("text-generation", model=base_model, tokenizer=base_tokenizer, device_map="auto", max_new_tokens=512, return_full_text=False, temperature = 0.5)
llm_base = HuggingFacePipeline(pipeline=base_pipe)

finetuned_model_name = "{YOUR_HF_NAME}/llama-3.2-1B-sutdqa"
finetuned_tokenizer = AutoTokenizer.from_pretrained(finetuned_model_name)
finetuned_tokenizer.pad_token = finetuned_tokenizer.eos_token
finetuned_model = AutoModelForCausalLM.from_pretrained(finetuned_model_name)
finetuned_pipe = pipeline("text-generation", model=finetuned_model, tokenizer=finetuned_tokenizer, device_map="auto", max_new_tokens=512, return_full_text=False,temperature = 0.5)
llm_finetune = HuggingFacePipeline(pipeline=finetuned_pipe)

Device set to use cuda:0
Device set to use cuda:0


In [15]:
# try out the llms

query = "What is special about SUTD?"

print("Question:", query)
response_base = llm_base.invoke(query,  pipeline_kwargs={"max_new_tokens": 512})
print("Answer base:", response_base)

print("---------")
response_finetune = llm_finetune.invoke(query, pipeline_kwargs={"max_new_tokens": 512})
print("Answer finetune:", response_finetune)


Question: What is special about SUTD?
Answer base:  The answer is simple: SUTD is a world-class university with a unique mission. It is a university that is committed to building a new generation of leaders who are ready to tackle the challenges of the future. It is a university that is dedicated to creating a better world through its research and education. It is a university that is committed to making a difference in the lives of its students, its staff, and its community. SUTD is a university that is committed to building a better future for all. What is special about SUTD? The answer is simple: SUTD is a world-class university with a unique mission. It is a university that is committed to building a new generation of leaders who are ready to tackle the challenges of the future. It is a university that is dedicated to creating a better world through its research and education. It is a university that is committed to making a difference in the lives of its students, its staff, and i

# Integrate and evaluate

Now integrate both the non-finetuned Llama 3.2 1B model and your finetuned model into your SUTD chatbot RAG system.
Generate responses to the 20 questions you have collected in assignment 3 using these 4 appraoches
1. non-finetuned Llama 3.2 1B model without RAG
2. finetuned Llama 3.2 1B SUTD QA model without RAG
3. non-finetuned Llama 3.2 1B model with RAG
4. finetuned Llama 3.2 1B SUTD QA model with RAG

Compare the responses and decide what system produces the most accurate and high quality responses

# From Assignment 3

In [16]:
answers_to_compare = {}
device = "cuda"
questions = ["What are the admissions deadlines for SUTD?",
             "Is there financial aid available?",
             "What is the minimum score for the Mother Tongue Language?",
             "Do I require reference letters?",
             "Can polytechnic diploma students apply?",
             "Do I need SAT score?",
             "How many PhD students does SUTD have?",
             "How much are the tuition fees for Singaporeans?",
             "How much are the tuition fees for international students?",
             "Is there a minimum CAP?",
             "What are the application requirements for international students?",
             "What is the difference between CSD and DAI?",
             "What kind of careers can I pursue with a degree in ESD from SUTD?",
             "What programs or majors does SUTD offer?",
             "What are the available student exchange programs at SUTD and which partner universities can I go to?",
             "Are interviews part of the admissions process?",
             "Are there scholarships for international students?",
             "What is campus life like at SUTD?",
             "Can an international student study with tuition grant at SUTD?",
             "Why is SUTD a Design AI university"
             ]
for q in questions:
  answers_to_compare[q] = {}
  answers_to_compare[q]["RAG"] = {}
  answers_to_compare[q]["NO RAG"] = {}

In [17]:
# QUESTION: Re-create the RAG chatbot system you have created in assignment 3 but with the Llama 3.2 1B (non-tuned and finetuned) models

#--- ADD YOUR SOLUTION HERE (40 points)---

if "downloaded_docs" not in os.listdir():
    os.system("unzip downloaded_docs.zip")
download_folder = "downloaded_docs"

# Clean documents to remove whitespace + newline
def clean_text(text):
    soup = BeautifulSoup(text, "html.parser")
    text = soup.get_text(separator=" ")
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Process PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

documents = []
pdf_files = [
    os.path.join(download_folder, f)
    for f in os.listdir(download_folder)
    if f.lower().endswith(".pdf")
]
for file in pdf_files:
    text = extract_text_from_pdf(file)
    text = clean_text(text)
    document = Document(page_content=text)
    documents.append(document)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=256,
    chunk_overlap=50
)
split_docs = text_splitter.split_documents(documents)
print(f"Total number of chunks created: {len(split_docs)}")

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
vector_store = FAISS.from_documents(split_docs, embedding_model)
vector_store.save_local("faiss_index")

Archive:  downloaded_docs.zip
   creating: downloaded_docs/
  inflating: downloaded_docs/www.sutd.edu.sg_esd_about_highlights_achievements_%3Fpaged%3D2%23general-listing.html  
  inflating: downloaded_docs/www.sutd.edu.sg_course_50-035-computer-vision_.html  
  inflating: downloaded_docs/www.sutd.edu.sg_course_02-102-ht-the-world-since-1400_.html  
  inflating: downloaded_docs/www.sutd.edu.sg_education_undergraduate_minors.html  
  inflating: downloaded_docs/www.sutd.edu.sg_epd_education_undergraduate_.html  
  inflating: downloaded_docs/www.sutd.edu.sg_course_30-203-topics-in-biomedical-and-healthcare-engineering_.html  
  inflating: downloaded_docs/www.sutd.edu.sg_innovation_davincisutd.html  
  inflating: downloaded_docs/www.sutd.edu.sg_topics_home-delivery-43226.html  
  inflating: downloaded_docs/www.sutd.edu.sg_about_happenings_events_%3Fevent-level%3D872%26period%3Dupcoming%23tabs.html  
  inflating: downloaded_docs/www.sutd.edu.sg_course_02-170ht-history-of-surveillance-in-mode

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [18]:
for query in questions:
    print(f"RUNNING FOR QUERY: {query}")
    dense_results = vector_store.similarity_search(query, k=20)
    candidate_texts = [doc.page_content for doc in dense_results]
    tokenized_candidates = [word_tokenize(text.lower()) for text in candidate_texts]
    tokenized_query = word_tokenize(query.lower())
    bm25 = BM25Okapi(tokenized_candidates)
    scores = bm25.get_scores(tokenized_query)
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:5]

    context_chunks = []
    for rank, idx in enumerate(top_indices):
        context_chunks.append(candidate_texts[idx])
    context = "\n".join(context_chunks)

    prompt = (
      # "Use the following pieces of context to answer the question at the end. "
      # "If you don't know the answer, just say that you don't know, don't try to make up an answer. If the context is somewhat relevant, make your best effort to give a coherent answer.\n\n"
      """You are an expert educational advisor for the Singapore University of Technology and Design (SUTD).
        Answer the following question as if you are helping a prospective student who is curious about SUTD.
        Be clear, accurate, friendly, and informative.
        Provide a helpful, clear, and concise answer for the following question"""
      f"Context:\n{context}\n\n"
      f"Question: {query}\n"
      "Answer:"
    )
    response_finetune = finetuned_pipe(prompt, max_new_tokens=512)
    response_base = base_pipe(prompt, max_new_tokens=512)
    answers_to_compare[query]["RAG"]["finetuned"] = response_finetune[0]["generated_text"]
    answers_to_compare[query]["RAG"]["non-finetuned"] = response_base[0]["generated_text"]

RUNNING FOR QUERY: What are the admissions deadlines for SUTD?
RUNNING FOR QUERY: Is there financial aid available?
RUNNING FOR QUERY: What is the minimum score for the Mother Tongue Language?
RUNNING FOR QUERY: Do I require reference letters?
RUNNING FOR QUERY: Can polytechnic diploma students apply?
RUNNING FOR QUERY: Do I need SAT score?
RUNNING FOR QUERY: How many PhD students does SUTD have?
RUNNING FOR QUERY: How much are the tuition fees for Singaporeans?
RUNNING FOR QUERY: How much are the tuition fees for international students?


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


RUNNING FOR QUERY: Is there a minimum CAP?
RUNNING FOR QUERY: What are the application requirements for international students?
RUNNING FOR QUERY: What is the difference between CSD and DAI?
RUNNING FOR QUERY: What kind of careers can I pursue with a degree in ESD from SUTD?
RUNNING FOR QUERY: What programs or majors does SUTD offer?
RUNNING FOR QUERY: What are the available student exchange programs at SUTD and which partner universities can I go to?
RUNNING FOR QUERY: Are interviews part of the admissions process?
RUNNING FOR QUERY: Are there scholarships for international students?
RUNNING FOR QUERY: What is campus life like at SUTD?
RUNNING FOR QUERY: Can an international student study with tuition grant at SUTD?
RUNNING FOR QUERY: Why is SUTD a Design AI university


In [19]:
for query in questions:
  print(f"RUNNING FOR QUERY: {query}")
  response_finetune = finetuned_pipe(query, max_new_tokens=512)
  response_base = base_pipe(query, max_new_tokens=512)
  answers_to_compare[query]["NO RAG"]["finetuned"]= response_finetune[0]["generated_text"]
  answers_to_compare[query]["NO RAG"]["non-finetuned"] = response_base[0]["generated_text"]

RUNNING FOR QUERY: What are the admissions deadlines for SUTD?
RUNNING FOR QUERY: Is there financial aid available?
RUNNING FOR QUERY: What is the minimum score for the Mother Tongue Language?
RUNNING FOR QUERY: Do I require reference letters?
RUNNING FOR QUERY: Can polytechnic diploma students apply?
RUNNING FOR QUERY: Do I need SAT score?
RUNNING FOR QUERY: How many PhD students does SUTD have?
RUNNING FOR QUERY: How much are the tuition fees for Singaporeans?
RUNNING FOR QUERY: How much are the tuition fees for international students?
RUNNING FOR QUERY: Is there a minimum CAP?
RUNNING FOR QUERY: What are the application requirements for international students?
RUNNING FOR QUERY: What is the difference between CSD and DAI?
RUNNING FOR QUERY: What kind of careers can I pursue with a degree in ESD from SUTD?
RUNNING FOR QUERY: What programs or majors does SUTD offer?
RUNNING FOR QUERY: What are the available student exchange programs at SUTD and which partner universities can I go to?


In [20]:
## to free gpu
del base_tokenizer
del base_model
del base_pipe
del llm_base
del finetuned_tokenizer
del finetuned_model
del finetuned_pipe
del llm_finetune
del embedding_model
torch.cuda.empty_cache()

In [21]:
with open("answers_to_compare.json", "w") as json_file:
    json.dump(answers_to_compare, json_file, indent=4)

# Bonus points: LLM-as-judge evaluation

Implement an LLM-as-judge pipeline to assess the quality of the different system (finetuned vs. non-fintuned, RAG vs no RAG)



In [22]:
# QUESTION: Implement an LLM-as-judge pipeline to assess the quality of the different system (finetuned vs. non-fintuned, RAG vs no RAG)

#--- ADD YOUR SOLUTION HERE (40 points)---
judge_parser = JsonOutputParser()

judge_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an AI evaluator judging answers to FAQs on Singapore University of Technology and Design (SUTD)."
    "Do not provide an explanations, comments, or reasons."
    "Just answer in the required format"),
    ("human",
     "Question: {question}\n\n"
     "answer_1: {answer_1}\n\n" # finetuned_no_rag
     "answer_2: {answer_2}\n\n" # base_no_rag
     "answer_3: {answer_3}\n\n" # finetuned_rag
     "answer_4: {answer_4}\n\n" # base_rag
     """For each answer, rate on a scale of 1(worst) to 5(best)
- Correctness
- Completeness
- Relevance
- Clarity """
     "then choose best overall answer. "
     "Format your response as JSON:\n\n{format_instructions}")
])

judge_prompt = judge_prompt.partial(format_instructions=judge_parser.get_format_instructions())
judge_model = GoogleGenerativeAI(model="gemini-2.0-flash", google_api_key=api_key, temperature=0.7)
judge_chain = judge_prompt | judge_model | judge_parser

with open("answers_to_compare.json", "r") as f:
    eval_data = json.load(f)

judging_results = {}
for query in eval_data:
    answers = judge_chain.invoke({
        "question": query,
        "answer_1": eval_data[query]["NO RAG"]["finetuned"],
        "answer_2": eval_data[query]["NO RAG"]["non-finetuned"],
        "answer_3": eval_data[query]["RAG"]["finetuned"],
        "answer_4": eval_data[query]["RAG"]["non-finetuned"]
      })
    key_model_pair = {"answer_1":"finetuned_no_rag","answer_2":"base_no_rag","answer_3":"finetuned_rag","answer_4":"base_rag","best_answer":"best_answer"}
    try:
      modified_answers = {}
      for i in key_model_pair:
        modified_answers[key_model_pair[i]] = answers[i]
      modified_answers["best_answer"] = key_model_pair[answers["best_answer"]]
      judging_results[query] = modified_answers
      print(f"Judgement for question '{query}': {modified_answers}")
    except:
      answers = json.dumps(answers)
      for key in key_model_pair:
        answers = answers.replace(key, key_model_pair[key])
      answers = json.loads(answers)
      judging_results[query] = answers
      print(f"Judgement for question '{query}': {answers}")#{modified_answers}")
    time.sleep(10)

with open("sutd_chatbot_judging_results.json", "w") as f:
    json.dump(judging_results, f, indent=2)

print("Completed LLM-as-judge evaluation.")
# scroll to the right most to see 'best_answer'

Judgement for question 'What are the admissions deadlines for SUTD?': {'finetuned_no_rag': {'correctness': 3, 'completeness': 3, 'relevance': 4, 'clarity': 4}, 'base_no_rag': {'correctness': 1, 'completeness': 1, 'relevance': 1, 'clarity': 1}, 'finetuned_rag': {'correctness': 1, 'completeness': 1, 'relevance': 1, 'clarity': 1}, 'base_rag': {'correctness': 1, 'completeness': 1, 'relevance': 1, 'clarity': 1}, 'best_answer': 'finetuned_no_rag'}
Judgement for question 'Is there financial aid available?': {'finetuned_no_rag': {'correctness': 1, 'completeness': 2, 'relevance': 1, 'clarity': 3}, 'base_no_rag': {'correctness': 1, 'completeness': 1, 'relevance': 1, 'clarity': 2}, 'finetuned_rag': {'correctness': 1, 'completeness': 1, 'relevance': 1, 'clarity': 1}, 'base_rag': {'correctness': 3, 'completeness': 2, 'relevance': 4, 'clarity': 3}, 'best_answer': 'base_rag'}
Judgement for question 'What is the minimum score for the Mother Tongue Language?': {'finetuned_no_rag': {'correctness': 1, 'c

# Bonus points: chatbot UI

Implement a web UI frontend for your chatbot that you can demo in class.


In [29]:
# runs the Finetuned model + RAG chatbot -> streamlit run app.py
# runs the Finetuned model only chatbot -> streamlit run app2.py

In [None]:
# QUESTION: Implement a web UI frontend for your chatbot that you can demo in class.

# #--- ADD YOUR SOLUTION HERE (40 points)---
# # Finetuned model + RAG chatbot -> for viewing code only

# #%%writefile app.py

# import streamlit as st
# import random
# import time
# import requests
# import os
# from bs4 import BeautifulSoup
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.embeddings import HuggingFaceEmbeddings
# from langchain.vectorstores import FAISS
# from langchain.schema import Document, BaseRetriever
# import fitz
# from rank_bm25 import BM25Okapi
# import nltk
# nltk.download('punkt_tab')
# from nltk.tokenize import word_tokenize
# import re
# from typing import Any, List
# from huggingface_hub import login
# from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# from flask_cors import CORS
# from flask import Flask, request, jsonify

# st.title("SUTD Chatbot")

# if "messages" not in st.session_state:
#   st.session_state.messages = []

# if "vector_store" not in st.session_state:
#   if "downloaded_docs" not in os.listdir():
#     os.system("unzip downloaded_docs.zip")
#   download_folder = "downloaded_docs"

#   # Clean documents to remove whitespace + newline
#   def clean_text(text):
#     soup = BeautifulSoup(text, "html.parser")
#     text = soup.get_text(separator=" ")
#     text = re.sub(r'\s+', ' ', text)
#     return text.strip()

#   # Process PDF
#   def extract_text_from_pdf(pdf_path):
#     doc = fitz.open(pdf_path)
#     text = ""
#     for page in doc:
#         text += page.get_text()
#     return text

#   documents = []
#   pdf_files = [
#     os.path.join(download_folder, f)
#     for f in os.listdir(download_folder)
#     if f.lower().endswith(".pdf")
#   ]
#   for file in pdf_files:
#     text = extract_text_from_pdf(file)
#     text = clean_text(text)
#     document = Document(page_content=text)
#     documents.append(document)
#   text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=256,
#     chunk_overlap=50
#   )

#   split_docs = text_splitter.split_documents(documents)
#   embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
#   st.session_state.vector_store = FAISS.from_documents(split_docs, embedding_model)
#   st.session_state.vector_store.save_local("faiss_index")

# if "model" not in st.session_state:
#   hf_token = ""
#   login(token=hf_token)

#   finetuned_model_name = "{YOUR_HF_NAME}/llama-3.2-1B-sutdqa"
#   finetuned_tokenizer = AutoTokenizer.from_pretrained(finetuned_model_name)
#   finetuned_tokenizer.pad_token = finetuned_tokenizer.eos_token
#   finetuned_model = AutoModelForCausalLM.from_pretrained(finetuned_model_name)
#   st.session_state.model = pipeline("text-generation", model=finetuned_model, tokenizer=finetuned_tokenizer, device_map="auto", max_new_tokens=512, return_full_text=False)


# for message in st.session_state.messages:
#   with st.chat_message(message["role"]):
#       st.markdown(message["content"])

# if query := st.chat_input("Ask a question about SUTD"):
#   st.session_state.messages.append({"role": "user", "content": query})
#   with st.chat_message("user"):
#       st.markdown(query)

#   with st.chat_message("assistant"):
#     dense_results = st.session_state.vector_store.similarity_search(query, k=20)
#     candidate_texts = [doc.page_content for doc in dense_results]
#     tokenized_candidates = [word_tokenize(text.lower()) for text in candidate_texts]
#     tokenized_query = word_tokenize(query.lower())
#     bm25 = BM25Okapi(tokenized_candidates)
#     scores = bm25.get_scores(tokenized_query)
#     top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:5]

#     context_chunks = []
#     for rank, idx in enumerate(top_indices):
#         context_chunks.append(candidate_texts[idx])
#     context = "\n".join(context_chunks)

#     prompt = (
#     # "Use the following pieces of context to answer the question at the end. "
#     # "If you don't know the answer, just say that you don't know, don't try to make up an answer. If the context is somewhat relevant, make your best effort to give a coherent answer.\n\n"
#     """You are an expert educational advisor for the Singapore University of Technology and Design (SUTD).
#       Answer the following question as if you are helping a prospective student who is curious about SUTD.
#       Be clear, accurate, friendly, and informative.
#       Provide a helpful, clear, and concise answer for the following question"""
#     f"Context:\n{context}\n\n"
#     f"Question: {query}\n"
#     "Answer:"
#     )
#     response = st.session_state.model(prompt, max_new_tokens=512)[0]["generated_text"]
#     st.write(response)

#   st.session_state.messages.append({"role": "assistant", "content": response})

In [None]:
# Finetuned model chatbot -> for viewing code only

# #%%writefile app2.py

# import streamlit as st
# import random
# import time
# import requests
# import os
# from bs4 import BeautifulSoup
# from langchain.text_splitter import RecursiveCharacterTextSplitter
# from langchain.embeddings import HuggingFaceEmbeddings
# from langchain.vectorstores import FAISS
# from langchain.schema import Document, BaseRetriever
# import fitz
# from rank_bm25 import BM25Okapi
# import nltk
# nltk.download('punkt_tab')
# from nltk.tokenize import word_tokenize
# import re
# from typing import Any, List
# from huggingface_hub import login
# from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# from flask_cors import CORS
# from flask import Flask, request, jsonify

# st.title("SUTD Chatbot")

# if "messages" not in st.session_state:
#   st.session_state.messages = []


# if "model" not in st.session_state:
#   hf_token = ""
#   login(token=hf_token)

#   finetuned_model_name = "{YOUR_HF_NAME}/llama-3.2-1B-sutdqa"
#   finetuned_tokenizer = AutoTokenizer.from_pretrained(finetuned_model_name)
#   finetuned_tokenizer.pad_token = finetuned_tokenizer.eos_token
#   finetuned_model = AutoModelForCausalLM.from_pretrained(finetuned_model_name)
#   st.session_state.model = pipeline("text-generation", model=finetuned_model, tokenizer=finetuned_tokenizer, device_map="auto", max_new_tokens=512, return_full_text=False)


# for message in st.session_state.messages:
#   with st.chat_message(message["role"]):
#       st.markdown(message["content"])

# if query := st.chat_input("Ask a question about SUTD"):
#   st.session_state.messages.append({"role": "user", "content": query})
#   with st.chat_message("user"):
#       st.markdown(query)

#   with st.chat_message("assistant"):
#     prompt = (
#     # "Use the following pieces of context to answer the question at the end. "
#     # "If you don't know the answer, just say that you don't know, don't try to make up an answer. If the context is somewhat relevant, make your best effort to give a coherent answer.\n\n"
#     """You are an expert educational advisor for the Singapore University of Technology and Design (SUTD).
#       Answer the following question as if you are helping a prospective student who is curious about SUTD.
#       Be clear, accurate, friendly, and informative.
#       Provide a helpful, clear, and concise answer for the following question"""
#     f"Question: {query}\n"
#     "Answer:"
#     )
#     response = st.session_state.model(prompt, max_new_tokens=512)[0]["generated_text"]
#     st.write(response)

#   st.session_state.messages.append({"role": "assistant", "content": response})

# End

This concludes assignment 4.

Please submit this notebook with your answers and the generated output cells as a **Jupyter notebook file** via github.


Every group member should do the following submission steps:
1. Create a private github repository **sutd_5055mlop** under your github user.
2. Add your instructors as collaborator: ddahlmeier and lucainiaoge
3. Save your submission as assignment_04_GROUP_NAME.ipynb where GROUP_NAME is the name of the group you have registered.
4. Push the submission files to your repo
5. Submit the link to the repo via eDimensions



**Assignment due 21 April 2025 11:59pm**