# Assignment 4: Instruction finetuning a Llama-2 7B model - part 1
**Assignment due 19 April 11:59pm**

Welcome to the fourth assignment for 50.055 Machine Learning Operations. These assignments give you a chance to practice the methods and tools you have learned. 

**This assignment is a group assignment.**

- Read the instructions in this notebook carefully
- Add your solution code and answers in the appropriate places. The questions are marked as **QUESTION:**, the places where you need to add your code and text answers are marked as **ADD YOUR SOLUTION HERE**
- The completed notebook, including your added code and generated output will be your submission for the assignment.
- The notebook should execute without errors from start to finish when you select "Restart Kernel and Run All Cells..". Please test this before submission.
- Use the SUTD Education Cluster to solve and test the assignment.

**Rubric for assessment** 

Your submission will be graded using the following criteria. 
1. Code executes: your code should execute without errors. The SUTD Education cluster should be used to ensure the same execution environment.
2. Correctness: the code should produce the correct result or the text answer should state the factual correct answer.
3. Style: your code should be written in a way that is clean and efficient. Your text answers should be relevant, concise and easy to understand.
4. Partial marks will be awarded for partially correct solutions.
5. There is a maximum of 200 (80 + 40 + 80) points for this assignment.

**ChatGPT policy** 

If you use AI tools, such as ChatGPT, to solve the assignment questions, you need to be transparent about its use and mark AI-generated content as such. In particular, you should include the following in addition to your final answer:
- A copy or screenshot of the prompt you used
- The name of the AI model
- The AI generated output
- An explanation why the answer is correct or what you had to change to arrive at the correct answer

**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.



### Finetuning LLMs

The goal of the assignment is to build a chatbot that can talk to prospective students and answer questions about SUTD, similar to the chat-with-a-student function on the SUTD website (https://www.sutd.edu.sg/Admissions/chat)

Instead of using a RAG approach, in this assignment, you will finetune an LLM to perform the task. We will fine-tune a LLama-2 7B LLM model on question-answer pairs which we synthetically generate with an LLM.

We will  be leveraging `langchain`, `llama 2`, and `LoRA` again.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [LLaMA 2](https://huggingface.co/blog/llama2)
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)



### Step 1: generate training data
The first step of the assignment is generating synthetic question-answer pairs which can be used for finetuning an LLM model. 
To do this, we first load an LLM and the RAG question-answering system about SUTD from assignment 3. Ideally we would use a very accurate LLM, like GPT-4, to generate the
the training data. For cost reasons, here we will use Llama-2 13B.

You can check the Stanford Alpaca project for some examples on data generation: https://crfm.stanford.edu/2023/03/13/alpaca.html


In [1]:
# Installing required packages
# ----------------
! pip install -q -U peft==0.6.2 transformers==4.35.2 datasets==2.15.0 bitsandbytes==0.41.2.post2 trl==0.7.4 accelerate==0.24.1 wandb==0.16.3
! pip install -q -U langchain==0.1.13 
! pip install -q -U safetensors>=0.3.1
! pip install -q -U faiss-cpu==1.7.4
! pip install -q tiktoken==0.6.0
! pip install -q sentence-transformers==2.3.1
! pip install -q pypdf==4.0.1
! pip install -q protobuf==4.25.2
! pip install -q lxml==5.1.0
! pip install -q rouge_score==0.1.2
# ----------------

In [2]:
# Importing required packages
# ----------------
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.llms import HuggingFacePipeline
from langchain.callbacks import StdOutCallbackHandler
from langchain_community.document_loaders import BSHTMLLoader
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import JsonOutputParser

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from datasets import load_dataset, Dataset
from rouge_score import rouge_scorer

import torch
import re
import os
import pickle
# ----------------


  from .autonotebook import tqdm as notebook_tqdm


# SUTD Question Answering RAG system 
First, we set up the basic RAG system on SUTD content, as you have explored in assignment 3.

In [3]:
# Download SUTD's annual reports
! mkdir -p ./data

! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2022_23.pdf
! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2021.pdf
! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2020.pdf


File ‘data/SUTD_AnnualReport_2022_23.pdf’ already there; not retrieving.

File ‘data/SUTD_AnnualReport_2021.pdf’ already there; not retrieving.

File ‘data/SUTD_AnnualReport_2020.pdf’ already there; not retrieving.



In [4]:
# Download html files from SUTD website 
! curl --output data/Admission-Requirements.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements  
! curl --output data/Application-Timeline.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Application-Timeline
! curl --output data/Singapore-Cambridge-GCE-A-Level.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/Singapore-Cambridge-GCE-A-Level
! curl --output data/Local-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/Local-Diploma
! curl --output data/NUS-High-School-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/NUS-High-School-Diploma
! curl --output data/International-Baccalaureate-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/International-Baccalaureate-Diploma-\(Singapore\) 
! curl --output data/International-Qualifications.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/International-Qualifications 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  512k  100  512k    0     0  5177k      0 --:--:-- --:--:-- --:--:-- 5177k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  515k  100  515k    0     0  4519k      0 --:--:-- --:--:-- --:--:-- 4519k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  519k  100  519k    0     0  6185k      0 --:--:-- --:--:-- --:--:-- 6185k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  517k  100  517k    0     0  4352k      0 --:--:-- --:--:-- --:--:-- 4316k
  % Total    % Received % Xferd  Average Speed   Tim

In [5]:
# Load the PDF documents and HTML files. Then use LangChain to split the documents into smaller text chunks.
data_root = "./data/"

pdf_filenames = [
    'SUTD_AnnualReport_2020.pdf',
    'SUTD_AnnualReport_2021.pdf',
    'SUTD_AnnualReport_2022_23.pdf',
]

html_filenames = [
    'Admission-Requirements.html',
    'Application-Timeline.html',
    'Singapore-Cambridge-GCE-A-Level.html',
    'Local-Diploma.html',
    'NUS-High-School-Diploma.html',
    'International-Baccalaureate-Diploma.html',
    'International-Qualifications.html'
]

pdf_metadata = [
    dict(year=2020, source=pdf_filenames[0]),
    dict(year=2021, source=pdf_filenames[1]),
    dict(year=2023, source=pdf_filenames[2])
]

html_metadata = [
    dict(year=2024, source=html_filenames[0]),
    dict(year=2024, source=html_filenames[1]),
    dict(year=2024, source=html_filenames[2]),
    dict(year=2024, source=html_filenames[3]),
    dict(year=2024, source=html_filenames[4]),
    dict(year=2024, source=html_filenames[5]),
    dict(year=2024, source=html_filenames[6])
]

# load pdf files, attach meta data
documents = []
for idx, file in enumerate(pdf_filenames):
    print("Load file", file)
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = pdf_metadata[idx]
    documents += document

# load html files, attach meta data
for idx, file in enumerate(html_filenames):
    print("Load file", file)
    loader = BSHTMLLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        # remove duplicate whitespace
        document_fragment.page_content = repr(re.sub(r"(?<=\n)(\s+)",r" ", document_fragment.page_content))
        document_fragment.metadata = html_metadata[idx]
    documents += document

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=200, 
    chunk_overlap=10
)

docs = text_splitter.split_documents(documents)

print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

Load file SUTD_AnnualReport_2020.pdf
Load file SUTD_AnnualReport_2021.pdf
Load file SUTD_AnnualReport_2022_23.pdf
Load file Admission-Requirements.html
Load file Application-Timeline.html
Load file Singapore-Cambridge-GCE-A-Level.html
Load file Local-Diploma.html
Load file NUS-High-School-Diploma.html
Load file International-Baccalaureate-Diploma.html
Load file International-Qualifications.html
# of Document Pages 148
# of Document Chunks: 1042


In [6]:
# Create embeddings of document chunks and store them in vector store for fast lookup
store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.from_documents(docs, embedder)


In [7]:
# Load Llama-2 13B LLM model 

model_id = "NousResearch/Llama-2-13b-chat-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_config = AutoConfig.from_pretrained(
    model_id
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

tokenizer = AutoTokenizer.from_pretrained(model_id)


Loading checkpoint shards: 100%|██████████| 3/3 [00:33<00:00, 11.06s/it]


In [8]:
# check that the model can generate text
prompt = "Today was an amazing day because"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))



["Today was an amazing day because I went for a ride on my new scooter! It was so much fun, and I felt so free and alive. The wind in my hair and the sun on my face made me feel like I was on top of the world. I can't wait to go on more adventures on my scooter.\n\nI also had a great conversation with a friend today. We talked about our dreams and goals, and it was so inspiring to hear each other's"]


In [9]:
# Create a text generation pipeline with the LLM model 
generate_text = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=False,
    temperature=0.5,
    do_sample=True,
    max_new_tokens=500
)

llm = HuggingFacePipeline(pipeline=generate_text)

In [10]:
# instantiate retriever model and callback handler for QA results
retriever = vector_store.as_retriever()
handler = StdOutCallbackHandler()


In [11]:
# build RAG question answering chain with a custom prompt template

template = """You are a helpful assistant. Use the following pieces of context to answer the question at the end.
Answer the following questions about the Singapore University of Technology and Design (SUTD).
Use three sentences maximum and keep the answer as concise as possible.

Context: {context}

Question: {question}

Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)


In [12]:
# Test RAG with example question
rag_chain.invoke("What types of student organizations and clubs are available on campus?")

' SUTD has a variety of student organizations and clubs available on campus, including academic clubs, cultural clubs, and recreational clubs. Some examples include the Robotics Club, the Photography Club, and the SUTD Student Council. These clubs provide opportunities for students to engage in extracurricular activities, develop their skills and interests, and connect with their peers.'

Great, we have a working LLM and RAG system about SUTD. Now it is time to generate some data.

In [13]:

# QUESTION: When generating data with LLMs, it is helpful to parse the LLLM output into structured data formats. 
# Create a JsonOutputParser from langchain. Name the variable 'output_parser'. Print the format instructions that come with the parser.

#--- ADD YOUR SOLUTION HERE (5 points)---
output_parser = JsonOutputParser()
print(output_parser.get_format_instructions())

#---------------------------------



Return a JSON object.


In [14]:
# When generating data, it is often helpful to guide the generation process through some hierachical structure. 
# Before we create question-answer pairs, let's generate some topics which the questions should be about.

# QUESTION: Create a function 'generate_topics' which takes an integer n_length as input and outputs a dictionary with key 'topics' 
# and as value a list of n_length topics which prospective students might care about such as financial aid, campus life etc.
# Use the LLM and an appropriate prompt to generate these topics and the Json parser to parse the LLM output (use the format instructions). 
# Make sure your function is robust to non well-formed LLM output.

#--- ADD YOUR SOLUTION HERE (20 points)---
def generate_topics(n_length):

    prompt = PromptTemplate(
        template="Generate a dictionary with key 'topics' and as value a list of {n_length} topics which prospective students of the Singapore University of Technology and Design (SUTD) might care about such as financial aid, campus life, undergraduate programs, etc. Keep your answer concise without any unnecessary details or information and ensure there are {n_length} topics. {format_instructions}\nYou can use the provided context to help you generate the list of topics.\nContext: {context}\nJSON Object:",
        input_variables=["n_length"],
        partial_variables={"format_instructions": output_parser.get_format_instructions(), "context": retriever | format_docs},
    )

    try:
        chain = prompt | llm | output_parser
        output = chain.invoke({"n_length": n_length})
        return output
    except Exception as e:
        print(f"Error: {e}")
        return {"topics": []}
#---------------------------------



In [15]:
# Now let's generate a list of 20 topics 
# We save a copy to disk and reload it from there if the file exists


# generate topics
if os.path.exists("topics.txt"):
    print("File with topics exists. Read topics from file..")
    with open("topics.txt", "r") as fin:
        topics = {"topics": fin.read().splitlines()}
else:
    print("Generate topics..")
    n_topics = 20
    topics = generate_topics(n_topics)
    with open("topics.txt", "w") as fout:
        fout.write("\n".join(topics['topics']))
print(topics)


File with topics exists. Read topics from file..
{'topics': ['financial aid', 'campus life', 'undergraduate programs', 'graduate programs', 'research opportunities', 'international student support', 'student organizations', 'career services', 'academic calendar', 'course offerings', 'admissions requirements', 'scholarships and grants', 'housing options', 'student life', 'campus resources', 'community engagement', 'sustainability initiatives', 'student health services', 'mentorship programs', 'alumni network']}


In [16]:
# Now we need another function to generate questions for the topics.

# QUESTION: Create a function 'generate_questions' which a topic string and takes an integer n_length as input and outputs a dictionary with key 'questions' 
# and as value a list of at least n_length questions which prospective students might have about this topic.
# Again, use the LLM and an appropriate prompt and the Json parser to parse the LLM output (use the format instructions). 
# Make sure your function is robust to non well-formed LLM output.

#--- ADD YOUR SOLUTION HERE (20 points)---

example_output={'questions': ['What types of library resources are available at SUTD?', 'How do I access library resources remotely?']}

def generate_questions(topic, n_length):
    
    prompt = PromptTemplate(
        template="Generate a dictionary with key 'questions' and as value a list of {n_length} questions which prospective students of the Singapore University of Technology and Design (SUTD) might have regarding the topic {topic}. Keep your answer concise without any unnecessary details or information and ensure there are exactly {n_length} questions. Example output with 2 questions on the topic 'library': {example_output}. {format_instructions}\nJSON Object:",
        input_variables=["topic", "n_length"],
        partial_variables={"format_instructions": output_parser.get_format_instructions(), "example_output": example_output},
    )

    chain = prompt | llm | output_parser
    output = chain.invoke({"topic": topic, "n_length": n_length})
    return output

#---------------------------------

In [17]:
# Now let's generate some questions for the topics.

# QUESTION: For every topic, generate at least 10 questions. 
# LLM generation can take time, save intermediate results to disk and reload them if necessary to speed up subsequent runs.
# Store all questions in a list of strings 'questions_all'
# Extra points: check that there is diversity in the generated questions, i.e. they are not all the same or too similar.
# You can achieve this by checking that questions are not too similar to each other

n_questions_per_topic = 10
questions_all = []

#--- ADD YOUR SOLUTION HERE (20 points)---
for topic in topics['topics']:
    while True:
        try:
            questions = generate_questions(topic, n_questions_per_topic)['questions']
        except Exception as e:
            print(f"Error: {e}")
        else:
            break
    questions_all.extend(questions)

#---------------------------------



Error: Invalid json output: {'questions': ['What is the purpose of community engagement at SUTD?', 'How does community engagement relate to the curriculum at SUTD?', 'What types of community engagement opportunities are available at SUTD?', 'How can I get involved in community engagement activities at SUTD?', 'What are the benefits of community engagement for students at SUTD?', 'How does community engagement contribute to the development of soft skills at SUTD?', 'What are some examples of community engagement projects that SUTD students have participated in?', 'How can I find out about community engagement opportunities at SUTD?', 'What support is available for students who participate in community engagement activities at SUTD?']}




In [18]:
# save questions to disk 
if not os.path.exists("questions.txt"):
    print("Write all questions to questions.txt")
    with open("questions.txt", "w") as fout:
        fout.write("\n".join(questions_all))
else:
      print("File questions.txt exists. skip")

Write all questions to questions.txt


In [20]:
# Now create answers to questions using the RAG pipeline

# QUESTION: For every question, generate an answer using the RAG system
# Store all answers in a list of strings 'answers_all'
# Extra points: check that there is diversity in the generated questions, i.e. they are not all the same or too similar.
# You can achieve this by checking that questions are not too similar to each other

answers_all = []


#--- ADD YOUR SOLUTION HERE (10 points)---
for question in questions_all:
    answers_all.append(rag_chain.invoke(question))


#---------------------------------



In [21]:
# save a copy of the answers to disk

if not os.path.exists("answers.txt"):
    print("Write all answers to answers.txt")
    with open("answers.txt", "w") as fout:
        fout.write("\n".join(answers_all))
else:
      print("File answers.txt exists. skip")
   

Write all answers to answers.txt


In [22]:
# create huggingface dataset to make it easier to work with the data

# QUESTION: create a huggingface dataset object with the keys 'question' and 'answer' and the questions and answers you have generated, respectively
# shuffle the dataset. use a fixed seed.

#--- ADD YOUR SOLUTION HERE (5 points)---
sutd_qa_dataset = Dataset.from_dict({"question": questions_all, "answer": answers_all})
sutd_qa_dataset = sutd_qa_dataset.shuffle(seed=42)

#---------------------------------


In [23]:
# inspect schema and size of dataset
sutd_qa_dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 200
})

In [24]:
# inspect first instance
sutd_qa_dataset[0]

{'question': 'How can I get involved in community engagement initiatives at SUTD?',
 'answer': ' As an ambassador-at-large at the MFA and a Professor at SUTD, you can collaborate with the Venture, Innovation, and Entrepreneurship (VIE) Office to engage in community initiatives. They provide support for alumni, students, researchers, and mid-career aspiring entrepreneurs to turn their ideas into reality. You can also participate in curated entrepreneurship programs, entrepreneurship capstone projects, incubation support, and mentorship opportunities. Additionally, you can explore collaborations with strategic partners, social venture building programs, and research commercialization initiatives to make a positive impact on the community.'}

In [25]:
# save dataset to disk
with open('sutd_qa_dataset.pkl', 'wb') as f:
    pickle.dump(sutd_qa_dataset, f)



In [26]:
from huggingface_hub import login

# log in to huggingface, you need to put your huggingface access token
# https://huggingface.co/docs/hub/en/security-tokens

hf_access_token = ""
login(token=hf_access_token)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/jovyan/.cache/huggingface/token
Login successful


In [27]:
# push dataset to huggingface
sutd_qa_dataset.push_to_hub("sutd_qa_dataset")



Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]
Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 191.18ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:02<00:00,  2.18s/it]


### This concludes the first part of the assignment. Continue with the next part