# Assignment 4: Instruction finetuning a Llama-2 7B model - part 1
**Assignment due 19 April 11:59pm**

Welcome to the fourth assignment for 50.055 Machine Learning Operations. These assignments give you a chance to practice the methods and tools you have learned. 

**This assignment is a group assignment.**

- Read the instructions in this notebook carefully
- Add your solution code and answers in the appropriate places. The questions are marked as **QUESTION:**, the places where you need to add your code and text answers are marked as **ADD YOUR SOLUTION HERE**
- The completed notebook, including your added code and generated output will be your submission for the assignment.
- The notebook should execute without errors from start to finish when you select "Restart Kernel and Run All Cells..". Please test this before submission.
- Use the SUTD Education Cluster to solve and test the assignment.

**Rubric for assessment** 

Your submission will be graded using the following criteria. 
1. Code executes: your code should execute without errors. The SUTD Education cluster should be used to ensure the same execution environment.
2. Correctness: the code should produce the correct result or the text answer should state the factual correct answer.
3. Style: your code should be written in a way that is clean and efficient. Your text answers should be relevant, concise and easy to understand.
4. Partial marks will be awarded for partially correct solutions.
5. There is a maximum of 200 (80 + 40 + 80) points for this assignment.

**ChatGPT policy** 

If you use AI tools, such as ChatGPT, to solve the assignment questions, you need to be transparent about its use and mark AI-generated content as such. In particular, you should include the following in addition to your final answer:
- A copy or screenshot of the prompt you used
- The name of the AI model
- The AI generated output
- An explanation why the answer is correct or what you had to change to arrive at the correct answer

**Assignment Notes:** Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.



### Finetuning LLMs

The goal of the assignment is to build a chatbot that can talk to prospective students and answer questions about SUTD, similar to the chat-with-a-student function on the SUTD website (https://www.sutd.edu.sg/Admissions/chat)

Instead of using a RAG approach, in this assignment, you will finetune an LLM to perform the task. We will fine-tune a LLama-2 7B LLM model on question-answer pairs which we synthetically generate with an LLM.

We will  be leveraging `langchain`, `llama 2`, and `LoRA` again.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [LLaMA 2](https://huggingface.co/blog/llama2)
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)



### Step 1: generate training data
The first step of the assignment is generating synthetic question-answer pairs which can be used for finetuning an LLM model. 
To do this, we first load an LLM and the RAG question-answering system about SUTD from assignment 3. Ideally we would use a very accurate LLM, like GPT-4, to generate the
the training data. For cost reasons, here we will use Llama-2 13B.

You can check the Stanford Alpaca project for some examples on data generation: https://crfm.stanford.edu/2023/03/13/alpaca.html


In [3]:
# Installing required packages
# ----------------
! pip install -q -U peft==0.6.2 transformers==4.35.2 datasets==2.15.0 bitsandbytes==0.41.2.post2 trl==0.7.4 accelerate==0.24.1 wandb==0.16.3
! pip install -q -U langchain==0.1.13
! pip install -q -U safetensors>=0.3.1
! pip install -q -U faiss-cpu==1.7.4
! pip install -q tiktoken==0.6.0
! pip install -q sentence-transformers==2.3.1
! pip install -q pypdf==4.0.1
! pip install -q protobuf==4.25.2
! pip install -q lxml==5.1.0
! pip install -q rouge_score==0.1.2
# ----------------


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
zsh:1: 0.3.1 not found

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip

In [4]:
from langchain.chat_models import ChatOpenAI
from langchain_groq import ChatGroq
from langchain_community.chat_models.ollama import ChatOllama
from langchain_cohere import ChatCohere
from langchain_community.chat_models.perplexity import ChatPerplexity
from langchain_community.chat_models.bedrock import BedrockChat
from rich.pretty import pprint

In [7]:
# Importing required packages
# ----------------
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.llms import HuggingFacePipeline
from langchain.callbacks import StdOutCallbackHandler
from langchain_community.document_loaders import BSHTMLLoader
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import JsonOutputParser

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from datasets import load_dataset, Dataset
from rouge_score import rouge_scorer

import torch
import re
import os
import pickle
# ----------------


  from .autonotebook import tqdm as notebook_tqdm
Exception ignored in: <bound method IPythonKernel._clean_thread_parent_frames of <ipykernel.ipkernel.IPythonKernel object at 0x1081d1690>>
Traceback (most recent call last):
  File "/Users/jon/code/school/t8/mlops/assignments/4/venv/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 775, in _clean_thread_parent_frames
    def _clean_thread_parent_frames(

KeyboardInterrupt: 
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


# SUTD Question Answering RAG system 
First, we set up the basic RAG system on SUTD content, as you have explored in assignment 3.

In [None]:
# Download SUTD's annual reports
! mkdir -p ./data

! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2022_23.pdf
! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2021.pdf
! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2020.pdf


File ‘data/SUTD_AnnualReport_2022_23.pdf’ already there; not retrieving.

File ‘data/SUTD_AnnualReport_2021.pdf’ already there; not retrieving.

File ‘data/SUTD_AnnualReport_2020.pdf’ already there; not retrieving.



In [None]:
# Download html files from SUTD website
! curl --output data/Admission-Requirements.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements
! curl --output data/Application-Timeline.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Application-Timeline
! curl --output data/Singapore-Cambridge-GCE-A-Level.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/Singapore-Cambridge-GCE-A-Level
! curl --output data/Local-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/Local-Diploma
! curl --output data/NUS-High-School-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/NUS-High-School-Diploma
! curl --output data/International-Baccalaureate-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/International-Baccalaureate-Diploma-\(Singapore\)
! curl --output data/International-Qualifications.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/International-Qualifications

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  519k    0  519k    0     0   130k      0 --:--:--  0:00:03 --:--:--  130k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  521k    0  521k    0     0   131k      0 --:--:--  0:00:03 --:--:--  131k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  526k    0  526k    0     0   133k      0 --:--:--  0:00:03 --:--:--  133k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  524k    0  524k    0     0   131k      0 --:--:--  0:00:03 --:--:--  131k
  % Total    % Received % Xferd  Average Speed   Tim

In [8]:
# Load the PDF documents and HTML files. Then use LangChain to split the documents into smaller text chunks.
data_root = "./data/"

pdf_filenames = [
    'SUTD_AnnualReport_2020.pdf',
    'SUTD_AnnualReport_2021.pdf',
    'SUTD_AnnualReport_2022_23.pdf',
]

html_filenames = [
    'Admission-Requirements.html',
    'Application-Timeline.html',
    'Singapore-Cambridge-GCE-A-Level.html',
    'Local-Diploma.html',
    'NUS-High-School-Diploma.html',
    'International-Baccalaureate-Diploma.html',
    'International-Qualifications.html'
]

pdf_metadata = [
    dict(year=2020, source=pdf_filenames[0]),
    dict(year=2021, source=pdf_filenames[1]),
    dict(year=2023, source=pdf_filenames[2])
]

html_metadata = [
    dict(year=2024, source=html_filenames[0]),
    dict(year=2024, source=html_filenames[1]),
    dict(year=2024, source=html_filenames[2]),
    dict(year=2024, source=html_filenames[3]),
    dict(year=2024, source=html_filenames[4]),
    dict(year=2024, source=html_filenames[5]),
    dict(year=2024, source=html_filenames[6])
]

# load pdf files, attach meta data
documents = []
for idx, file in enumerate(pdf_filenames):
    print("Load file", file)
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = pdf_metadata[idx]
    documents += document

# load html files, attach meta data
for idx, file in enumerate(html_filenames):
    print("Load file", file)
    loader = BSHTMLLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        # remove duplicate whitespace
        document_fragment.page_content = repr(re.sub(r"(?<=\n)(\s+)",r" ", document_fragment.page_content))
        document_fragment.metadata = html_metadata[idx]
    documents += document

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=200,
    chunk_overlap=10
)

docs = text_splitter.split_documents(documents)

print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

Load file SUTD_AnnualReport_2020.pdf
Load file SUTD_AnnualReport_2021.pdf
Load file SUTD_AnnualReport_2022_23.pdf
Load file Admission-Requirements.html
Load file Application-Timeline.html
Load file Singapore-Cambridge-GCE-A-Level.html
Load file Local-Diploma.html
Load file NUS-High-School-Diploma.html
Load file International-Baccalaureate-Diploma.html
Load file International-Qualifications.html
# of Document Pages 148
# of Document Chunks: 1042


In [9]:
# Create embeddings of document chunks and store them in vector store for fast lookup
store = LocalFileStore("./cache/")

# embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'
embed_model_id = 'nomic-embed-text'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

from langchain_community.embeddings.ollama import OllamaEmbeddings
core_embeddings_model = OllamaEmbeddings(model=embed_model_id)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.from_documents(docs, embedder)


In [10]:
# Load Llama-2 13B LLM model

model_id = "NousResearch/Llama-2-13b-chat-hf"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

model_config = AutoConfig.from_pretrained(
    model_id
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto'
)

tokenizer = AutoTokenizer.from_pretrained(model_id)


In [11]:
# check that the model can generate text
prompt = "Today was an amazing day because"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

In [12]:
# Create a text generation pipeline with the LLM model
generate_text = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    return_full_text=False,
    temperature=0.5,
    do_sample=True,
    max_new_tokens=500
)

llm = HuggingFacePipeline(pipeline=generate_text)

In [13]:
# instantiate retriever model and callback handler for QA results
retriever = vector_store.as_retriever()
handler = StdOutCallbackHandler()

In [14]:
# build RAG question answering chain with a custom prompt template

template = """You are a helpful assistant. Use the following pieces of context to answer the question at the end.
Answer the following questions about the Singapore University of Technology and Design (SUTD).
Use three sentences maximum and keep the answer as concise as possible.

Context: {context}

Question: {question}

Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)


def format_docs(docs):
    docs_str = "\n\n".join(doc.page_content for doc in docs)
    # print(docs_str)
    return docs_str

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

In [15]:
# Test RAG with example question
rag_chain.invoke("What types of student organizations and clubs are available on campus?")

'There is mention of sports and recreation, as well as F&B services, but no specific student organizations or clubs are listed. It is best to contact the university directly for a comprehensive list of student groups and activities.'

Great, we have a working LLM and RAG system about SUTD. Now it is time to generate some data.

In [16]:

# QUESTION: When generating data with LLMs, it is helpful to parse the LLLM output into structured data formats.
# Create a JsonOutputParser from langchain. Name the variable 'output_parser'. Print the format instructions that come with the parser.

#--- ADD YOUR SOLUTION HERE (5 points)---
from langchain_core.output_parsers import JsonOutputParser

output_parser = JsonOutputParser()
print(output_parser.get_format_instructions())

#---------------------------------



Return a JSON object.


In [17]:
# When generating data, it is often helpful to guide the generation process through some hierachical structure.
# Before we create question-answer pairs, let's generate some topics which the questions should be about.

# QUESTION: Create a function 'generate_topics' which takes an integer n_length as input and outputs a dictionary with key 'topics'
# and as value a list of n_length topics which prospective students might care about such as financial aid, campus life etc.
# Use the LLM and an appropriate prompt to generate these topics and the Json parser to parse the LLM output (use the format instructions).
# Make sure your function is robust to non well-formed LLM output.

# REFERENCE: https://python.langchain.com/docs/modules/model_io/output_parsers/types/json/

#--- ADD YOUR SOLUTION HERE (20 points)---
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List
# Define your desired data structure.
class Data(BaseModel):
    """Data to be generated"""
    topics: List[str] = Field(description="Topics prospective students might care about")

def extract_json_str(ai_message) -> str:
    message = ai_message.content
    # Find the first occurrence of '{' and the last occurrence of '}'
    start = message.find('{')
    end = message.rfind('}') + 1  # '+ 1' to include the '}' in the slice

    # Extract the substring between these indices
    json_str = message[start:end]

    return json_str

def generate_topics(n_length):
    template = """
    Generate {n_length} topics that prospective students of SUTD university might care about.
    Ensure your output generation is in a structured json format with key topics and value as a list of {n_length} topics.
    You are to only generate a single JSON output, nothing else. DO not explain your answer.
    Return a JSON object.
    {format_instructions}
    """
    prompt = PromptTemplate.from_template(template)

    output_parser = JsonOutputParser(pydantic_object=Data)

    chain = prompt | llm | extract_json_str | output_parser
    output = chain.invoke({"n_length": str(n_length), "format_instructions": output_parser.get_format_instructions()})
    return output

#---------------------------------
out = generate_topics(20)
out

{'topics': ['Curriculum and Academic Programs',
  'Campus Life and Student Activities',
  'Housing and Residential Options',
  'Career Services and Job Prospects',
  'Research Opportunities',
  'Tuition Fees and Financial Aid',
  'Faculty and Teaching Quality',
  'Campus Facilities and Resources',
  'Student Diversity and Inclusion',
  'Study Abroad and International Opportunities',
  'Internship and Co-op Programs',
  'Student Support Services',
  'Campus Safety and Security',
  'Sustainability Initiatives',
  'Alumni Network and Connections',
  'Entrepreneurship and Innovation',
  'Sports and Recreation',
  'Arts and Cultural Events',
  'Student Organizations and Clubs',
  'Community Engagement and Service Learning']}

In [18]:
# Now let's generate a list of 20 topics
# We save a copy to disk and reload it from there if the file exists


# generate topics
if os.path.exists("topics.txt"):
    print("File with topics exists. Read topics from file..")
    with open("topics.txt", "r") as fin:
        topics = {"topics": fin.read().splitlines()}
else:
    print("Generate topics..")
    n_topics = 20
    topics = generate_topics(n_topics)
    with open("topics.txt", "w") as fout:
        fout.write("\n".join(topics['topics']))
pprint(topics)


File with topics exists. Read topics from file..


In [19]:
# Now we need another function to generate questions for the topics.

# QUESTION: Create a function 'generate_questions' which a topic string and takes an integer n_length as input and outputs a dictionary with key 'questions'
# and as value a list of at least n_length questions which prospective students might have about this topic.
# Again, use the LLM and an appropriate prompt and the Json parser to parse the LLM output (use the format instructions).
# Make sure your function is robust to non well-formed LLM output.

#--- ADD YOUR SOLUTION HERE (20 points)---
class QuestionsData(BaseModel):
    """Data to be generated"""
    questions: List[str] = Field(description="Questions prospective SUTD University students might have about the topic")

def generate_questions(topic, n_length):
    template = """
    Generate {n_length} questions that prospective students might have about the topic {topic}.
    Your output must be a structured json format with key 'questions' and value as a list of {n_length} questions.
    Ensure that there is diversity in the generated questions, i.e. they are not all the same or too similar
    Return the JSON object only, nothing esle.
    {format_instructions}
    """
    output_parser = JsonOutputParser(pydantic_object=QuestionsData)

    prompt = PromptTemplate(
        template=template,
        input_variables=["topic", "n_length"],
        partial_variables={"format_instructions": output_parser.get_format_instructions()},
    )

    chain = prompt | llm | output_parser
    output = chain.invoke({"topic": topic, "n_length": str(n_length)})
    return output

#---------------------------------
pprint(generate_questions("financial aid", 10))

In [20]:
# Now let's generate some questions for the topics.

# QUESTION: For every topic, generate at least 10 questions.
# LLM generation can take time, save intermediate results to disk and reload them if necessary to speed up subsequent runs.
# Store all questions in a list of strings 'questions_all'
# Extra points: check that there is diversity in the generated questions, i.e. they are not all the same or too similar.
# You can achieve this by checking that questions are not too similar to each other

n_questions_per_topic = 10
questions_all = []



#--- ADD YOUR SOLUTION HERE (20 points)---
for topic in topics['topics']:
    filename = f"questions_{topic}.txt"
    if os.path.exists(filename) and os.stat(filename).st_size > 0:
      print(f"File with questions for topic {topic} exists and has content. Read questions from file..")
      with open(filename, "r") as fin:
          questions = fin.read().splitlines()
    else:
        print(f"Generate questions for topic {topic}..")
        data = generate_questions(topic, n_questions_per_topic)
        questions = data['questions']
        questions_all += questions
        with open(filename, "w") as fout:
            fout.write("\n".join(questions))
    questions_all += questions
print(len(questions_all))
#---------------------------------

File with questions for topic Curriculum and Academic Programs exists and has content. Read questions from file..
File with questions for topic Campus Life and Student Activities exists and has content. Read questions from file..
File with questions for topic Housing and Residential Options exists and has content. Read questions from file..
File with questions for topic Career Services and Internship Opportunities exists and has content. Read questions from file..
File with questions for topic Research Facilities and Labs exists and has content. Read questions from file..
File with questions for topic Faculty Expertise and Achievements exists and has content. Read questions from file..
File with questions for topic Tuition Fees and Financial Aid exists and has content. Read questions from file..
File with questions for topic Student Diversity and Inclusion exists and has content. Read questions from file..
File with questions for topic Study Abroad and Exchange Programs exists and has 

In [21]:
# save questions to disk
if not os.path.exists("questions.txt"):
    print("Write all questions to questions.txt")
    with open("questions.txt", "w") as fout:
        fout.write("\n".join(questions_all))
else:
      print("File questions.txt exists. skip")

File questions.txt exists. skip


In [24]:
import time

answers_all = []
start_time = time.time()
counter = 0

for question in questions_all:
    if counter == 10:
        end_time = time.time()
        elapsed_time = end_time - start_time

        print(f"Made {counter} invoke calls in {elapsed_time} seconds")

        if elapsed_time < 60:
            time_to_sleep = 60 - elapsed_time
            print(f"Sleeping for {time_to_sleep} seconds to limit invoke calls to 10 per minute")
            time.sleep(time_to_sleep)

        start_time = time.time()
        counter = 0

    answer = rag_chain.invoke(question)
    print(f"question: {question}\n answer: {answer}")
    answers_all.append(answer)

    counter += 1

question: What are the core academic programs offered at SUTD?
 answer: The Singapore University of Technology and Design (SUTD) offers a range of academic programs with a focus on technology and design. Core programs include Master of Architecture, Master of Engineering, Master of Innovation by Design, and Master of Science in various fields. These programs provide a strong foundation for students seeking to excel in these areas.
question: How is the curriculum structured for each program?
 answer: The Singapore University of Technology and Design offers a unique curriculum. Undergraduates first complete a three-term Freshmore program covering fundamentals, then specialize in one of five areas: Architecture, Computer Science, Design & AI, Engineering Product, or Systems & Design. Minors are also available in AI, CS, entrepreneurship, and design & society.
question: Are there opportunities for interdisciplinary study or customizing your curriculum?
 answer: Yes, the Singapore Universit

In [None]:
# Now create answers to questions using the RAG pipeline

# QUESTION: For every question, generate an answer using the RAG system
# Store all answers in a list of strings 'answers_all'
# Extra points: check that there is diversity in the generated questions, i.e. they are not all the same or too similar.
# You can achieve this by checking that questions are not too similar to each other

answers_all = []

#--- ADD YOUR SOLUTION HERE (10 points)---
for question in questions_all:
  answer = rag_chain.invoke(question)
  print(f"question: {question}\n answer: {answer}")
  answers_all.append(answer)

#---------------------------------

In [25]:
# save a copy of the answers to disk

if not os.path.exists("answers.txt"):
    print("Write all answers to answers.txt")
    with open("answers.txt", "w") as fout:
        fout.write("\n".join(answers_all))
else:
      print("File answers.txt exists. skip")


Write all answers to answers.txt


In [26]:
len(questions_all)

200

In [27]:
len(answers_all)

200

In [28]:
# create huggingface dataset to make it easier to work with the data

# QUESTION: create a huggingface dataset object with the keys 'question' and 'answer' and the questions and answers you have generated, respectively
# shuffle the dataset. use a fixed seed.

#--- ADD YOUR SOLUTION HERE (5 points)---
from datasets import load_dataset, Dataset
data = {'question': questions_all, 'answer': answers_all}
sutd_qa_dataset = Dataset.from_dict(data)
#---------------------------------


In [29]:
# inspect schema and size of dataset
sutd_qa_dataset

Dataset({
    features: ['question', 'answer'],
    num_rows: 200
})

In [30]:
# inspect first instance
sutd_qa_dataset[0]

{'question': 'What are the core academic programs offered at SUTD?',
 'answer': 'The Singapore University of Technology and Design (SUTD) offers a range of academic programs with a focus on technology and design. Core programs include Master of Architecture, Master of Engineering, Master of Innovation by Design, and Master of Science in various fields. These programs provide a strong foundation for students seeking to excel in these areas.'}

In [31]:
# save dataset to disk
with open('sutd_qa_dataset.pkl', 'wb') as f:
    pickle.dump(sutd_qa_dataset, f)



In [32]:
from huggingface_hub import login

# log in to huggingface, you need to put your huggingface access token
# https://huggingface.co/docs/hub/en/security-tokens

hf_access_token = "<YOUR HF WRITE ACCESS TOKEN>"
login(token=hf_access_token)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/jon/.cache/huggingface/token
Login successful


In [33]:
# push dataset to huggingface
sutd_qa_dataset.push_to_hub("sutd_qa_dataset")



Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 287.01ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.68s/it]


### This concludes the first part of the assignment. Continue with the next part