# Assignment 3: Retrieval-Augmented Generation Question Answering
**Assignment due 2 April 11:59pm**

Welcome to the third assignment for 50.055 Machine Learning Operations. These assignments give you a chance to practice the methods and tools you have learned. 

**This assignment is a group assignment.**

- Read the instructions in this notebook carefully
- Add your solution code and answers in the appropriate places. The questions are marked as **QUESTION:**, the places where you need to add your code and text answers are marked as **ADD YOUR SOLUTION HERE**
- The completed notebook, including your added code and generated output will be your submission for the assignment.
- The notebook should execute without errors from start to finish when you select "Restart Kernel and Run All Cells..". Please test this before submission.
- Use the SUTD Education Cluster to solve and test the assignment.

**Rubric for assessment** 

Your submission will be graded using the following criteria. 
1. Code executes: your code should execute without errors. The SUTD Education cluster should be used to ensure the same execution environment.
2. Correctness: the code should produce the correct result or the text answer should state the factual correct answer.
3. Style: your code should be written in a way that is clean and efficient. Your text answers should be relevant, concise and easy to understand.
4. Partial marks will be awarded for partially correct solutions.
5. There is a maximum of 178 points for this assignment.

**ChatGPT policy** 

If you use AI tools, such as ChatGPT, to solve the assignment questions, you need to be transparent about its use and mark AI-generated content as such. In particular, you should include the following in addition to your final answer:
- A copy or screenshot of the prompt you used
- The name of the AI model
- The AI generated output
- An explanation why the answer is correct or what you had to change to arrive at the correct answer

**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.



### Retrieval-Augmented Generation (RAG) 

In this assignment you will be building a Retrieval-Augmented Generation (RAG) question answering system which can answer questions about SUTD.

Retrieval-Augmented Generation (RAG) is a natural language processing (NLP) model that combines both retrieval and generation techniques. It involves retrieving relevant information from a large external knowledge source, such as a document database, and then utilizing that information to generate coherent and contextually appropriate responses. RAG models are designed to enhance the performance of language generation tasks by leveraging the power of pre-existing knowledge during the generation process.

The SUTD website already allows chatting with current students or submissions of questions via a web form. 

- https://www.sutd.edu.sg/Admissions/Undergraduate/AskAdmissions/Prospective-student-parent
- https://www.sutd.edu.sg/Admissions/chat


### RECAP: Conduct user research

What are the questions that prospective and current students have about SUTD? Before you start building a question-answering system, let us first try to understand the users.

### QUESTION: 

Conduct user research by interviewing minimal 3 first-year students at SUTD about what questions they had when they were considering SUTD and what questions they had and have now that they are at SUTD. 

Enter your interview notes (not full transcripts, just bullet point notes).


**--- ADD YOUR SOLUTION HERE (20 points) ---**

**Finances, Financial Aid & Scholarships**

1. How do I apply for scholarships at SUTD?
2. How do I apply for financial aid?
3. How much are the tuition fees?

**Hostel**

1. Must I stay in hostel for my Freshmore term?
2. How difficult is it to secure further hostel stays after the first two terms?
3. Is there financial aid for hostel?
4. How much does it cost to stay in hostel?
 
**Overseas Opportunities**

1. What are the overseas opportunities in SUTD?
2. What are the subsidies available for us to participate in the Summer & GLP programmes?
3. What are the key differences between Summer and other overseas programmes? Why should I even consider Summer?
4. What is GEXP?
5. What is FACT?
6. Who can apply for FACT?

**Pillars**
1. When do I choose my pillar in SUTD?
2. When is the start of term/academic year?
3. What is ASD?
4. Is ASD recognised or accredited?
5. What are job prospects for ASD?
6. What is ESD?
7. Is there a lot of math and programming in ESD?
8. What is EPD?
9. What is ISTD?
10. What is DAI?
11. Why is SUTD launching the new Design and AI (DAI) degree?

**Special Programmes**
1. What is STEP?
2. What is SHARP?
3. What is SUTD Duke-NUS Special Track?
3. What are the admission requirements for SUTD-Duke-NUS Special Track or SHARP or STEP?


**Others**
1. How do I travel to SUTD?
2. What can I expect during an admissions/scholarship interview?

------------------------------

### RECAP: Value Proposition Canvas


### QUESTION: 

Summarize what you have learned in a value proposition canvas. 

List the "jobs to be done" of your customer (i.e. the students) together with their "pains" and "gains" on the right side of the canvas. Then design the value proposition for an automatic 
question-answering system which could address these needs. Include features of this system in the section "products and services", "gain creators" and "pain relievers".
Add your points to the value proposition canvas template below by downloading the image, adding your points using Preview, Powerpoint or any image editing tool you like and then replacing the canvas image in this notebook.

You can find our more about the value proposition canvas under https://www.strategyzer.com/library/the-value-proposition-canvas

**--- ADD YOUR SOLUTION HERE (10 points) ---**

Refer to the file titled "VPC.png" to read our Value proposition Canvas
![VPC.png](./VPC.png)

------------------------------

### QUESTION: 

Now try to improve the question answering system to do better according to the value proposition you have formulated. You are free to choose how you want to improve the system: you can add more data sources, change the LLM models, change the data pre-processing, etc. 

Add additional code cells below as needed (do not change the code cells above).
Try up to 3 different improvement strategies. 
Then repeat the manual evaluation and compare your results.


**--- ADD YOUR SOLUTION HERE (50 points) ---**


------------------------------

# Install and Import Relevant Libraries

In [1]:
# Installing all required packages
# Note: Do not add to this list.
# ----------------
! pip install -U "langchain==0.1.6" "transformers==4.32.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1"
! pip install -U faiss-cpu==1.7.4
! pip install tiktoken==0.6.0
! pip install sentence-transformers==2.3.1
! pip install pypdf==4.0.1
! pip install protobuf==4.25.2
! pip install lxml==5.1.0
# ----------------




In [2]:
# Importing all required packages
# ----------------
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
from langchain_community.document_loaders import BSHTMLLoader

import transformers
import torch
import timeit
import re
# ----------------

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Setting seed for transformer library for reproducibility
from transformers import set_seed
set_seed(42)

# Improvement 1 - Added more Data Sources

We utilised all scraped several webpages as HTML files after crawling the SUTD Domain. The HTML files are stored in `data/`. The downloaded files are based on links from `crawled_links/cleaned_urls.txt`. These links were extracted via the `scrapy` package. Refer to `Readme.md` for more information.

## Load & Split documents
Load the PDF documents and HTML files. Then use LangChain to split the documents into smaller text chunks.

In [3]:
from concurrent.futures import ProcessPoolExecutor
from glob import glob

pdf_filenames = [
    'SUTD_AnnualReport_2020.pdf',
    'SUTD_AnnualReport_2021.pdf',
    'SUTD_AnnualReport_2022_23.pdf',
]

# Use all webscraped HTML files after crawling SUTD Domain
html_filenames = list(glob('./data/*.html'))

def process_filename(fname):
    return fname.split("/")[-1]

with ProcessPoolExecutor() as executor:
    html_filenames = list(executor.map(process_filename, html_filenames))

pdf_metadata = [
    dict(year=2020, source=pdf_filenames[0]),
    dict(year=2021, source=pdf_filenames[1]),
    dict(year=2023, source=pdf_filenames[2])
]

html_metadata = [dict(source=html_filenames[ind]) for ind in range(len(html_filenames))]

In [4]:
# load pdf files, attach meta data
data_root = "./data/"

documents = []
for idx, file in enumerate(pdf_filenames):
    print("Load file", file)
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = pdf_metadata[idx]
    documents += document

# load html files, attach meta data
for idx, file in enumerate(html_filenames):
    print("Load file", file)
    loader = BSHTMLLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        # remove duplicate whitespace
        document_fragment.page_content = repr(re.sub(r"(?<=\n)(\s+)",r" ", document_fragment.page_content))
        document_fragment.metadata = html_metadata[idx]
    documents += document


# recursively split the documents into chunks of 100 tokens with an overlap of 10 tokens between chunks
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=10
)
docs = text_splitter.split_documents(documents)

print(docs[0], docs[1])
#------------------------------
print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

Load file SUTD_AnnualReport_2020.pdf
Load file SUTD_AnnualReport_2021.pdf
Load file SUTD_AnnualReport_2022_23.pdf
Load file https_wearesutd_sutd_edu_sg_uncategorized_one_course_down_.html
Load file https_wearesutd_sutd_edu_sg_category_exchange_other_summer_programmes_page_4_.html
Load file https_wearesutd_sutd_edu_sg_sutd_ents_class_uc_berkeley_extension_attachment_img_20180131_113002_.html
Load file https_wearesutd_sutd_edu_sg_tag_inaugural_exhibition_.html
Load file https_wearesutd_sutd_edu_sg_author_alp2016_themee_.html
Load file https_wearesutd_sutd_edu_sg_category_sutd_ians_alumni_.html
Load file https_wearesutd_sutd_edu_sg_uncategorized_short_getaway_to_zhangjiajie_and_finishing_up_the_app_.html
Load file https_www_sutd_edu_sg_Campus_Life_Sports_and_Recreation_Centre.html
Load file https_wearesutd_sutd_edu_sg_exchange_day_10_actuator_testing_.html
Load file https_wearesutd_sutd_edu_sg_exchange_global_exchange_programme_whats_deepavali_without_family_hong_kong_week_10_.html
Load f

## Create FAISS Vector Store

We will add the document chunks into the FAISS Vector Store. We will utilise the same embedding model `sentence-transformers/all-MiniLM-L6-v2`. Due to the sheer size of the Document Chunks (180,000+), we will load the cached embeddings and indices that we have downloaded after running `setup.sh`. Refer to `Readme.md` for more information.

In [5]:
import os

# Create embeddings of document chunks and store them in vector store for fast lookup
store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

index_dir = 'faiss_index'

if os.path.exists(index_dir):
    # Load the vector store from a local directory
    vector_store = FAISS.load_local(index_dir, embedder)
else:
    # Create a vector store with the document chunk embeddings using the Facebook FAISS library
    vector_store = FAISS.from_documents(docs, embedder)
    # Save the vector store to a local directory
    vector_store.save_local(index_dir)


print(vector_store.index.ntotal)
#------------------------------

183827


In [6]:
# Execute a query against the vector store

query = "What is the vision and mission of SUTD?"
embedding_vector = core_embeddings_model.embed_query(query)

# run the query against the vector store, print the top 5 search results
results = vector_store.similarity_search_by_vector(embedding_vector, k=5)

for result in results:
    print(result, end="\n\n")
#------------------------------

page_content='Annual Report 2020/20212 Vision, Mission and About SUTD' metadata={'year': 2020, 'source': 'SUTD_AnnualReport_2020.pdf'}

page_content='Annual Report  2020/20213 Vision, Mission and About SUTD\nEmbracing this tenet as a call to action, SUTD is a leading research-intensive \nglobal university focused on technology and all elements of technology-based \ndesign.\nIt will educate technically-grounded leaders who are steeped in the \nfundamentals of science, mathematics and technology; are creative and' metadata={'year': 2020, 'source': 'SUTD_AnnualReport_2020.pdf'}

page_content='leaders and innovators to serve societal needs. To achieve this mission in this complex and volatile world, SUTD’s education is not just about the accumulation of knowledge. We also develop your ability to learn new things multiple times in your career, to keep pace with fast-evolving industry trends.\\n SUTD is committed to supporting your journey to become lifelong learners, leaders and innovators.

In [7]:
query = "When was SUTD founded?"
embedding_vector = embedder.embed_query(query)

# QUESTION: run the query against the vector store with top 3 retrieved results. Measure the average latency over 100 runs.
# Print average retrieval latency in milliseconds

#--- ADD YOUR SOLUTION HERE (10 points)---
def run_query(vector=embedding_vector):
    results = vector_store.similarity_search_by_vector(vector, k=3)

# Measure latency over 100 runs
n_runs = 100
total_time = timeit.timeit(run_query, number=n_runs)

average_latency_ms = (total_time / n_runs) * 1000
print(f"Average retrieval latency: {average_latency_ms:.2f} milliseconds")
#------------------------------
# Hint: use the timeit library

Average retrieval latency: 31.88 milliseconds


# Improvement 2 - Replaced Llama 2 with Vicuna LLM

## Load Vicuna LLM

In [8]:
# Load Model from Huggingface

model_id ="lmsys/vicuna-7b-v1.5"
# model_id = "NousResearch/Llama-2-13b-chat-hf"


bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_config = transformers.AutoConfig.from_pretrained(
    model_id
)

import warnings
warnings.filterwarnings("ignore")

model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    config=model_config,
    quantization_config=bnb_config
)

print(model)
#------------------------------

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear4bit(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )


In [9]:
# Setup LLM Pipeline
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline


tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
pipe = transformers.pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, temperature=0)
llm = HuggingFacePipeline(pipeline=pipe)

# Improvement 3 - Utilised Hypothetical Document Embeddings (HyDe) as part of our Retrieval Augmented Generation (RAG) Pipeline

In [10]:
from pprint import pprint
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser

handler = StdOutCallbackHandler()

hyde_template = """
"You are a university ambassador for the Singapore University of Technology & Design.
Your goal is to encourage students to enrol into the university.
Write a detailed response to answer the query below.
Do not include points if you are unsure of the answer.
Question: {query}"
"""

config = {
    "callbacks": [handler]
}

hyde_prompt = ChatPromptTemplate.from_template(hyde_template)

# Version 1
hyde_chain = (
    hyde_prompt 
    | llm 
    | StrOutputParser()
)

hyde_chain = RunnableParallel(
    {"query": RunnablePassthrough()}
).assign(answer=hyde_chain)
#------------------------------

In [11]:
# Example questions
pprint(hyde_chain.invoke("What courses are available in SUTD?",
                                    config=config
                                   ))



[1m> Entering new RunnableSequence chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnablePassthrough chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new RunnableAssign chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnableSequence chain...[0m


[1m> Entering new ChatPromptTemplate chain...[0m

[1m> Finished chain.[0m


[1m> Entering new StrOutputParser chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m
{'answer': '\n'
           'As an ambassador for SUTD, I would be happy to answer your query '
           'about the courses available in our university. SUTD offers a wide '
           'range of undergraduate and postgraduate programmes in various '
           'fields of study.\n'
           '\n'
           'For undergraduate programmes, SUTD offers a 4-year Bachelor of 

# Improvement 4 - Utilised Maximal Marginal Relevance (MMR) in our Retriever to increase diversity of sources fetched from the FAISS Vector Store

We will first fetch 25 sources from the FAISS Vector Store. Among the fetched 25, we will select 5 highly diverse sources that will be later used in the LLM generation.

In [12]:
# Integrate HyDe Chain into RAG Pipeline

# Maximum Marginal Relevance
retriever = vector_store.as_retriever(search_type="mmr", \
    search_kwargs={'k': 5, 'fetch_k': 25, 'lambda_mult':0.25}) # Look into search_kwargs
handler = StdOutCallbackHandler()

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

config = {
    "callbacks": [handler]
}

# rag_template = """
# "You are a university ambassador for the Singapore University of Technology & Design.
# Your goal is to encourage students to enrol into the university.
# Write a detailed response to answer the query below.
# Context: {hyde} \nSources: {sources} \nAnswer:"
# """

rag_template="""
"You are an assistant for question-answering tasks. Use the following pieces of retrieved sources to improve the answer. 
If you don't know the answer, just say that you don't know. Use five sentences maximum and keep the answer concise.
Answer: {hyde} \nSources: {sources} \nAnswer:"
"""

prompt = ChatPromptTemplate(input_variables=['sources', 'hyde'], messages=[
    HumanMessagePromptTemplate(
        prompt=PromptTemplate(input_variables=['sources', 'hyde'], template=rag_template)
    )
])

# Version 1
qa_with_sources_chain = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["sources"])))
    | prompt
    | llm
    | StrOutputParser()
)

qa_with_sources_chain = RunnableParallel(
    {"sources": retriever, "hyde": RunnablePassthrough()}
).assign(answer=qa_with_sources_chain)

## Integrated HyDe Chain into the RAG Pipeline

In [13]:
def run_hyde_rag(question, hyde_chain, qa_chain, config):
    hyde_response = hyde_chain.invoke(question)['answer']
    result = qa_chain.invoke(hyde_response, config=config)
    return result

In [14]:
question="What courses are available in SUTD?"
pprint(run_hyde_rag(question, hyde_chain, qa_with_sources_chain, config))



[1m> Entering new RunnableSequence chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnablePassthrough chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new RunnableAssign chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnableSequence chain...[0m


[1m> Entering new RunnableAssign chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnableLambda chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new ChatPromptTemplate chain...[0m

[1m> Finished chain.[0m


[1m> Entering new StrOutputParser chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m
{'answer': 'As an ambassador for SUTD, I would be happy to answer your query '
           'about the courses available in our university. SUTD o

### Run Test Questions

In [15]:
# QUESTION: Below is set of test questions. Add another 10 test questions based on your user interviews and your value proposition canvas.
# Run the compelte set of test questions against the RAG question answering system.

questions = ["What are the admissions deadlines for SUTD?",
             "Is there financial aid available?",
             "What is the minimum score for the Mother Tongue Language?",
             "Do I require reference letters?",
             "Can polytechnic diploma students apply?",
             "Do I need SAT score?",
             "How many PhD students does SUTD have?",
             "How much are the tuition fees for Singaporeans?",
             "How much are the tuition fees for international students?",
             "Is there a minimum CAP?"
             ]

student_questions = [
    "How do I apply for scholarshops at SUTD?",
    "How can I apply for a scholarship if I am not offered one at admission?",
    "Do SUTD students all graduate with a single degree?",
    "How are students accessed in SUTD?",
    "Must I stay in hostel for my Freshmore term, and must I really pay for it?",
    "How much does it cost to stay on campus?",
    "Is there financial aid for hostel?",
    "What are the key differences between Summer and other overseas programmes? Why should I even consider Summer?",
    "What are the Summer Programmes available? Is GLP same as Summer?",
    "What are the subsidies available for us to participate in the Summer & GLP programmes?"
]
#---------------------------------


questions += student_questions

responses = []
for question in questions:
    res = run_hyde_rag(question, hyde_chain, qa_with_sources_chain, config)
    res["query"] = question
    pprint(res)
    responses.append(res)



[1m> Entering new RunnableSequence chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnablePassthrough chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new RunnableAssign chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnableSequence chain...[0m


[1m> Entering new RunnableAssign chain...[0m


[1m> Entering new RunnableParallel chain...[0m


[1m> Entering new RunnableLambda chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


[1m> Entering new ChatPromptTemplate chain...[0m

[1m> Finished chain.[0m


[1m> Entering new StrOutputParser chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m
{'answer': 'As an ambassador for SUTD, I would be happy to provide you with '
           'the admissions deadlines for the university. The admis

In [16]:
import json
import os
import time

# timestr = time.strftime("%Y%m%d-%H%M%S")
path = f'responses-Part2.json'
file_exists = os.path.isfile(path)
print(file_exists)
if not file_exists:
    with open(path, 'w') as fout:
        json.dump(responses, fout, default=vars)

False


### QUESTION: 


Manually inspect each answer, fact check whether the answer is correct (use Google or any other method) and check the retrieved documents

- How accurate is the answer (1-5, 5 best)?
- How relevant is the retrieved context (1-5, 5 best)?
- How grounded is the answer in the retrieved context (instead of relying on the LLM's internal knowledge) (1-5, 5 best)?

**--- ADD YOUR SOLUTION HERE (20 points) ---**
For a comprehensive overview of the data, including external research used in evaluating the LLM answers, refer to 'Part 2 Eval.pdf'. That document contains detailed analyses and insights that complement the findings. <br>

In this notebook, we will be utilising 'Part_2_Eval.xlsx' as our primary data source to showcase key information and summary score derived from the evaluation process.

------------------------------



In [2]:
import pandas as pd

pd.set_option('display.max_colwidth', None)  

file_path = './Part_2_Eval.xlsx'  
file_path1 = './Part_1_Eval.xlsx'  

df = pd.read_excel(file_path, usecols=[0, 1, 6]) 
df1 = pd.read_excel(file_path1) 
df.columns = ['Question', 'LLM Answer', 'Average Score']
df['Part 1 Avg Score']=df1['Average Score']

df_styled = df.style.set_properties(**{'text-align': 'left'})
df_styled.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

df_styled

Unnamed: 0,Question,LLM Answer,Average Score,Part 1 Avg Score
0,What are the admissions deadlines for SUTD?,"""As an ambassador for SUTD, I would be happy to provide you with the admissions deadlines for the university. The admissions deadlines for SUTD are as follows:\n\n* Round 1: 31 October\n* Round 2: 15 January\n* Round 3: 15 March\n* Round 4: 15 May\n\nIt is important to note that these deadlines are subject to change and it is always best to check the SUTD website for the most up-to-date information. Additionally, SUTD offers a rolling admissions process, which means that applications will be reviewed on a continuous basis after the final deadline of 15 May. This means that students who apply after the final deadline may still be considered for admission, depending on the availability of spaces in the programme.\n\nI hope this information is helpful to you. If you have any further questions about the admissions process at SUTD, please do not hesitate to ask."",",Accuracy: 1.5/5 Relevance: 2.5/5 Gounded in retrieved context: 2/5,Accuracy: 3/5 Relevance: 4/5 Gounded in retrieved context: 3/5
1,"""Is there financial aid available?""","""Yes, financial aid is available for students at the Singapore University of Technology & Design (SUTD). SUTD offers a range of financial aid options, including scholarships, loans, and grants, to support students who may need assistance with financing their education. The SUTD Scholarship is awarded to outstanding students based on their academic achievements and potential to contribute to the university's community. This scholarship covers tuition fees and provides a stipend to help with living expenses. The Student Loan Scheme (SLS) is offered by the Singapore government to help students finance their education. The SLS provides loans to eligible students to cover their tuition fees, as well as other educational expenses such as books and transportation costs. In addition to these scholarships and loans, SUTD also offers bursaries and grants to students who may be facing financial hardship. These bursaries and grants are awarded on a needs-basis and are designed to help students who are struggling to cover the costs of their education.""",Accuracy: 4/5 Relevance: 5/5 Gounded in retrieved context: 4/5,Accuracy: 4/5 Relevance: 5/5 Gounded in retrieved context: 5/5
2,"""What is the minimum score for the Mother Tongue Language?""","""As an ambassador for the Singapore University of Technology & Design (SUTD), I would like to inform prospective students that the minimum score required for the Mother Tongue Language (MTL) may vary depending on the specific programme and level of study.\n\nFor undergraduate programmes, SUTD generally requires a minimum of B or higher for MTL. However, this may differ for specific programmes or if the student has taken additional MTL subjects in their secondary education.\n\nFor postgraduate programmes, the minimum score required for MTL may be higher, as it is one of the factors considered during the admission process.\n\nIt is important to note that while MTL is an important factor in the admission process, it is not the only criterion used to evaluate applicants. Other factors such as academic performance, relevant work experience, and personal statements may also be taken into consideration.\n\nIf you are interested in applying to SUTD, I would encourage you to review the admission requirements for your specific programme of interest and to reach out to the university's admissions office for further clarification on the minimum MTL score required.""",Accuracy: 2/5 Relevance: 2/5 Gounded in retrieved context: 2.5/5,"""Accuracy: 5/5 Relevance: 5/5 Gounded in retrieved context: 5/5"""
3,"""Do I require reference letters?""","""As an ambassador for the Singapore University of Technology & Design (SUTD), I can confirm that SUTD does not require reference letters for admission. However, we do require a personal statement and a resume/CV as part of our application process. The personal statement is an opportunity for you to showcase your interests, experiences, and motivations for pursuing a degree at SUTD. The resume/CV is a document that outlines your academic and work history, as well as any other relevant experiences. While reference letters are not required, we encourage you to reach out to your teachers, mentors, or supervisors to ask for a letter of recommendation if you feel it will strengthen your application. These letters can provide additional insight into your character, work ethic, and potential as a student. Overall, the admission process at SUTD is designed to evaluate each applicant's individual strengths and qualifications, so you can determine if you have the attributes to flourish in SUTD. The conversation (for selected candidates) with our faculty/leader will also provide SUTD with additional information to assess if you are a good fit.""",Accuracy: 2.5/5 Relevance: 3/5 Gounded in retrieved context: 5/5,Accuracy: 2.5/5 Relevance: 2/5 Gounded in retrieved context: 5/5
4,"""Can polytechnic diploma students apply?""","""Yes, polytechnic diploma students are eligible to apply for admission to the Singapore University of Technology & Design (SUTD). SUTD has a strong partnership with the Singapore Institute of Technology (SIT), which is the parent body for all polytechnics in Singapore. SUTD offers specialized programs that cater to students who have completed their polytechnic diplomas, providing them with the necessary skills and knowledge to excel in their chosen fields. SUTD also offers scholarships and financial aid options to support students who may require financial assistance.""",Accuracy: 2/5 Relevance: 2/5 Gounded in retrieved context: 2/5,Accuracy: 2/5 Relevance: 5/5 Grounded in Retrieved Context: 2/5
5,"Do I need SAT score?""","""As an ambassador for the Singapore University of Technology & Design (SUTD), I would like to inform prospective students that while SAT scores are not a requirement for admission to SUTD, they are highly recommended for students who wish to be considered for scholarships or for students who come from countries where English may not be the primary language. SAT scores can also be used to fulfill the English language proficiency requirement for admission. However, students who do not submit SAT scores will still be considered for admission, and they may be required to take an English language proficiency test as part of the admission process. In addition to SAT scores, applicants are required to submit other application materials, such as transcripts, letters of recommendation, and a personal statement. The admissions committee will review all application materials holistically to determine whether the applicant is a good fit for the university. Overall, while SAT scores are not a requirement for admission to SUTD, they can be beneficial for students who wish to be considered for scholarships or who may need to fulfill the English language proficiency requirement.""",Accuracy: 2/5 Relevance: 2.5/5 Gounded in retrieved context: 4/5,Accuracy: 4.5/5 Relevance: 4.5/5 Grounded in Retrieved Context: 5/5
6,"""How many PhD students does SUTD have?""","""As an ambassador for SUTD, I am happy to answer your query about the number of PhD students at our university.\n\nAt SUTD, we have a strong focus on research and innovation, and as such, we offer a range of research programmes at the doctoral level. Our PhD programme is highly competitive and selective, and we are proud to have a diverse and talented cohort of students from around the world.\n\nUnfortunately, I am unable to provide you with an exact number of PhD students at SUTD, as our student body is constantly evolving and we do not publicly disclose this information. However, I can tell you that our PhD programme is highly regarded and well-respected within the academic community, and we have a strong track record of producing world-class researchers and scholars.\n\nIf you are interested in pursuing a PhD at SUTD, I would encourage you to explore our website and learn more about our research areas, faculty, and application process. We are always happy to answer any questions you may have and provide guidance on the next steps towards becoming a member of our academic community.""",Accuracy: 2/5 Relevance: 4/5 Gounded in retrieved context: 1/5,Accuracy: 2.5/5 Relevance: 3/5 Grounded in Retrieved Context: 3.5/5
7,How much are the tuition fees for Singaporeans?,"""As a university ambassador for the Singapore University of Technology & Design (SUTD), I would be happy to provide you with information on the tuition fees for Singaporeans.\n\nFor Singaporean students, the tuition fees at SUTD are fully subsidised by the Singapore government. This means that Singaporean students will not have to pay any tuition fees for their studies at SUTD.\n\nIn addition to the tuition fees, there may be other fees and charges that students are required to pay, such as miscellaneous fees, lab fees, and other fees associated with specific programs or activities. However, these fees are typically much lower than the tuition fees and are also subsidised for Singaporean students.\n\nOverall, the tuition fees for Singaporean students at SUTD are fully subsidised, which means that they will not have to pay any tuition fees for their studies. Additionally, there may be other fees and charges that students are required to pay, but these fees are typically low and are also subsidised for Singaporean students.""",Accuracy: 1/5 Relevance: 2.5/5 Gounded in retrieved context: 1/5,Accuracy: 2.5/5 Relevance: 1.5/5 Grounded in Retrieved Context: 3.5/5
8,"""How much are the tuition fees for international students?""","""As an ambassador for the Singapore University of Technology & Design (SUTD), I would be happy to provide you with information on the tuition fees for international students. The tuition fees for international students at SUTD vary depending on the programme and level of study. For example, the tuition fees for undergraduate students range from SGD 18,000 to SGD 24,000 per year, while the tuition fees for graduate students range from SGD 24,000 to SGD 30,000 per year. It's important to note that these fees do not cover other expenses such as accommodation, transportation, and living expenses. International students are also required to have sufficient funds to cover these expenses during their studies at SUTD. In addition, SUTD offers a range of scholarships and financial aid options for international students to help offset the cost of tuition and other expenses. These include merit-based scholarships, need-based financial aid, and government-funded scholarships. Overall, the tuition fees for international students at SUTD are compet""",Accuracy: 4/5 Relevance: 5/5 Gounded in retrieved context: 4/5,Accuracy: 1/5 Relevance: 1/5 Grounded in Retrieved Context: 1/5
9,"""Is there a minimum CAP?""","""As an ambassador for the Singapore University of Technology & Design (SUTD), I can confidently answer that there is no minimum CAP required for admission into the university.\n\nSUTD considers a range of factors in their admission process, including academic achievements, extracurricular activities, and personal qualities. While academic performance is certainly an important factor, it is not the only one. The university recognizes that students come from diverse backgrounds and have different strengths and interests, and they strive to create a holistic and inclusive admissions process.\n\nAdditionally, SUTD offers a wide range of programs and resources to support students in achieving their academic goals. This includes access to state-of-the-art facilities, research opportunities, and a strong focus on interdisciplinary learning.\n\nOverall, if you are interested in studying at SUTD, I encourage you to apply and share your unique qualities and experiences with the university. The admissions team will carefully review your application and consider it in its entirety.""",Accuracy: 5/5 Relevance: 3/5 Gounded in retrieved context: 4/5,Accuracy: 5/5 Relevance: 5/5 Grounded in Retrieved Context: 5/5


### QUESTION: 

Now try to improve the question answering system to do better according to the value proposition you have formulated. You are free to choose how you want to improve the system: you can add more data sources, change the LLM models, change the data pre-processing, etc. 

Add additional code cells below as needed (do not change the code cells above).
Try up to 3 different improvement strategies. 
Then repeat the manual evaluation and compare your results.


**--- ADD YOUR SOLUTION HERE (50 points) ---**

The second part of the assignment contains our solution to trying different methods to improve the quality of RAG answers. Here's what we did:

Method 1: Adding data
- The model struggled to find relevant context for some of the questions so we hoped that adding more relevant data would help the model perform better.
- Web Scraping: We scraped all related links to https://sutd.edu.sg. This gave us a total of 7036 links, which can be found in the file scraped/sutd_full_url.txt. 


- Link Pruning: We pruned the links to 6204 links by sorting the links by domain and by tree depth level to observe the relevance of a URL link. The pruned links can be found in the file scraped/cleaned_urls.txt.

- Scrapy Scraper: We used a Scrapy scraper to scrape the web pages. The code for the scraper can be found in the SUTD-crawler folder.

Method 2: Change model
- Different Model: We tried a different model, vicuna-7b-v1.5, to see if it could improve the quality of the answers.

Method 3: HyDE

- Improved RAG Pipeline: We tried implementing a better RAG pipeline using Hypothetical Document Embeddings (HyDE). This approach generates hypothetical document embeddings to perform similarity search for relevant documents based on a query.

Method 4: MMR

- Improved RAG Pipeline: Improved Diversity of Retrieved Sources by utilising MMR for Natural Language Generation of Response to the given query/HyDe. It does so by finding sources with embeddings that have the greatest cosine similarity with the query, and then iteratively adding them while penalizing them for closeness to already selected sources.

#### Results Evaluation

Generally the improvements gave us mixed results. 

The improved model was able to give a coherent response for each question, even for questions the part 1 model ended up repeating the prompt for the response (e.g question 10). 

Generally, the questions with the highest scores were the straightforward questions like 'Is there financial aid available?' as it just requires a simple yes or no response to satisfy the question. Both models struggled to answer detail specific questions like 'How much does it cost to stay on campus?', either the retriever was unable to retrieve the correct values or the model was unable to pick the right numbers from the retrieved context. This may be due to LLM's struggle in dealing with numerical values. 

For questions 8,10,11,13 the improved model is able to perform better as it is able to give more specific details relating to sutd that answers the questions. For questions the improved model performed worse in, like questions 17 and 18, the improved model chose the wrong details to include in the answer despite the retriever retrieving some relevant results. So the issue seems to lie in the LLM's ability to parse the context and choose the more relevant details. 


------------------------------



# End

This concludes assignment 3.

Please submit this notebook with your answers and the generated output cells as a **Jupyter notebook file** (assignment_03_GROUP_NAME.ipynb) via the eDimensions tool, where GROUP_NAME is the name of the group you have registered. 

As this is a group assignment, each group member only needs to submit one file.



**Assignment due 5 April 11:59pm**