# Assignment 4: Instruction finetuning a Llama-2 7B model - part 4
## Objective
1. Create a dataset of answers generated using Claude-3-Sonnet
2. Create a dataset of answers to the same duplicate questions for both Claude-3-Sonnet and Mistral Large



In [11]:
# Installing required packages
# ----------------
! pip install -q -U peft==0.6.2 transformers==4.35.2 datasets==2.15.0 bitsandbytes==0.41.2.post2 trl==0.7.4 accelerate==0.24.1 wandb==0.16.3
! pip install -q -U langchain==0.1.13
! pip install -q -U safetensors>=0.3.1
! pip install -q -U faiss-cpu==1.7.4
! pip install -q tiktoken==0.6.0
! pip install -q sentence-transformers==2.3.1
! pip install -q pypdf==4.0.1
! pip install -q protobuf==4.25.2
! pip install -q lxml==5.1.0
! pip install -q rouge_score==0.1.2
! pip install -q beautifulsoup4 boto3
# ----------------


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
zsh:1: 0.3.1 not found

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip

In [12]:
# Importing required packages
# ----------------
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import CacheBackedEmbeddings, HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore
from langchain.llms import HuggingFacePipeline
from langchain.callbacks import StdOutCallbackHandler
from langchain_community.document_loaders import BSHTMLLoader
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import JsonOutputParser

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
from datasets import load_dataset, Dataset
from rouge_score import rouge_scorer

import torch
import re
import os
import pickle
# ----------------


# SUTD Question Answering RAG system 
First, we set up the basic RAG system on SUTD content, as you have explored in assignment 3.

In [13]:
# Download SUTD's annual reports
! mkdir -p ./data

! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2022_23.pdf
! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2021.pdf
! wget -nc -P data https://www.sutd.edu.sg/SUTD/media/SUTD/SUTD_AnnualReport_2020.pdf


File ‘data/SUTD_AnnualReport_2022_23.pdf’ already there; not retrieving.

File ‘data/SUTD_AnnualReport_2021.pdf’ already there; not retrieving.

File ‘data/SUTD_AnnualReport_2020.pdf’ already there; not retrieving.



In [14]:
# Download html files from SUTD website
! curl --output data/Admission-Requirements.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements
! curl --output data/Application-Timeline.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Application-Timeline
! curl --output data/Singapore-Cambridge-GCE-A-Level.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/Singapore-Cambridge-GCE-A-Level
! curl --output data/Local-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/Local-Diploma
! curl --output data/NUS-High-School-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/NUS-High-School-Diploma
! curl --output data/International-Baccalaureate-Diploma.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/International-Baccalaureate-Diploma-\(Singapore\)
! curl --output data/International-Qualifications.html https://www.sutd.edu.sg/Admissions/Undergraduate/Application/Admission-Requirements/International-Qualifications

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  512k    0  512k    0     0  1793k      0 --:--:-- --:--:-- --:--:-- 1798k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  515k    0  515k    0     0  3519k      0 --:--:-- --:--:-- --:--:-- 3506k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  519k    0  519k    0     0  3929k      0 --:--:-- --:--:-- --:--:-- 3937k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  518k    0  518k    0     0  4176k      0 --:--:-- --:--:-- --:--:-- 4212k
  % Total    % Received % Xferd  Average Speed   Tim

In [15]:
# Load the PDF documents and HTML files. Then use LangChain to split the documents into smaller text chunks.
data_root = "./data/"

pdf_filenames = [
    'SUTD_AnnualReport_2020.pdf',
    'SUTD_AnnualReport_2021.pdf',
    'SUTD_AnnualReport_2022_23.pdf',
]

html_filenames = [
    'Admission-Requirements.html',
    'Application-Timeline.html',
    'Singapore-Cambridge-GCE-A-Level.html',
    'Local-Diploma.html',
    'NUS-High-School-Diploma.html',
    'International-Baccalaureate-Diploma.html',
    'International-Qualifications.html'
]

pdf_metadata = [
    dict(year=2020, source=pdf_filenames[0]),
    dict(year=2021, source=pdf_filenames[1]),
    dict(year=2023, source=pdf_filenames[2])
]

html_metadata = [
    dict(year=2024, source=html_filenames[0]),
    dict(year=2024, source=html_filenames[1]),
    dict(year=2024, source=html_filenames[2]),
    dict(year=2024, source=html_filenames[3]),
    dict(year=2024, source=html_filenames[4]),
    dict(year=2024, source=html_filenames[5]),
    dict(year=2024, source=html_filenames[6])
]

# load pdf files, attach meta data
documents = []
for idx, file in enumerate(pdf_filenames):
    print("Load file", file)
    loader = PyPDFLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        document_fragment.metadata = pdf_metadata[idx]
    documents += document

# load html files, attach meta data
for idx, file in enumerate(html_filenames):
    print("Load file", file)
    loader = BSHTMLLoader(data_root + file)
    document = loader.load()
    for document_fragment in document:
        # remove duplicate whitespace
        document_fragment.page_content = repr(re.sub(r"(?<=\n)(\s+)",r" ", document_fragment.page_content))
        document_fragment.metadata = html_metadata[idx]
    documents += document

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=200,
    chunk_overlap=10
)

docs = text_splitter.split_documents(documents)

print(f'# of Document Pages {len(documents)}')
print(f'# of Document Chunks: {len(docs)}')

Load file SUTD_AnnualReport_2020.pdf


Load file SUTD_AnnualReport_2021.pdf
Load file SUTD_AnnualReport_2022_23.pdf
Load file Admission-Requirements.html
Load file Application-Timeline.html
Load file Singapore-Cambridge-GCE-A-Level.html
Load file Local-Diploma.html
Load file NUS-High-School-Diploma.html
Load file International-Baccalaureate-Diploma.html
Load file International-Qualifications.html
# of Document Pages 148
# of Document Chunks: 1042


In [16]:
# Create embeddings of document chunks and store them in vector store for fast lookup
store = LocalFileStore("./cache/")

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

core_embeddings_model = HuggingFaceEmbeddings(
    model_name=embed_model_id
)

embedder = CacheBackedEmbeddings.from_bytes_store(
    core_embeddings_model, store, namespace=embed_model_id
)

vector_store = FAISS.from_documents(docs, embedder)

In [17]:
import os
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

In [18]:
# Load anthropic claude 3 sonnet using AWS bedrock
from langchain_community.chat_models.bedrock import BedrockChat
llm_sonnet = BedrockChat(model_id="anthropic.claude-3-sonnet-20240229-v1:0", model_kwargs={"temperature": 0.1})
llm_mistral_large = BedrockChat(model_id="mistral.mistral-large-2402-v1:0", model_kwargs={"temperature": 0.1})

In [19]:
# instantiate retriever model and callback handler for QA results
retriever = vector_store.as_retriever()
handler = StdOutCallbackHandler()

In [20]:
# build RAG question answering chain with a custom prompt template

template = """You are a helpful assistant. Use the following pieces of context to answer the question at the end.
Answer the following questions about the Singapore University of Technology and Design (SUTD).
Use three sentences maximum and keep the answer as concise as possible.

Context: {context}

Question: {question}

Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)


def format_docs(docs):
    docs_str = "\n\n".join(doc.page_content for doc in docs)
    # print(docs_str)
    return docs_str

rag_chain_sonnet = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm_sonnet
    | StrOutputParser()
)

rag_chain_mistral = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm_mistral_large
    | StrOutputParser()
)

In [22]:
# Test RAG with example question
# rag_chain_sonnet.invoke("What types of student organizations and clubs are available on campus?")

Great, we have a working LLM and RAG system about SUTD. Now it is time to generate the answers to the pre defined questions using the Sonnet LLM

In [31]:
# Now create answers to questions using the RAG pipeline

# QUESTION: For every question, generate an answer using the RAG system
# Store all answers in a list of strings 'answers_all'
# Extra points: check that there is diversity in the generated questions, i.e. they are not all the same or too similar.
# You can achieve this by checking that questions are not too similar to each other

with open('./part1-outputs/questions.txt', 'r') as f:
    questions_all = f.readlines()

import os

sonnet_answers = []
mistral_answers = []
filename_sonnet = "./part1-outputs/answers_sonnet.txt"
filename_mistral = "./part1-outputs/answers_mistral.txt"

# Check if the file exists for the sonnet answers
if os.path.exists(filename_sonnet):
  # Read the answers from the file
  with open(filename_sonnet, "r") as f:
    sonnet_answers = f.read().split("\n\n\n\n\n\n")
else:
  # Generate the answers
  for question in questions_all:
    answer = rag_chain_sonnet.invoke(question)
    print(f"question: {question}\n answer: {answer}")
    sonnet_answers.append(answer)

  # Write the answers to the file
  with open(filename_sonnet, "w") as f:
    for answer in sonnet_answers:
      f.write("\n\n\n\n\n\n".join(sonnet_answers))


# Check if the file exists: for mistral answers
if os.path.exists(filename_mistral):
  # Read the answers from the file
  with open(filename_mistral, "r") as f:
    mistral_answers = f.read().split("\n")
else:
  # Generate the answers
  for question in questions_all:
    answer = rag_chain_mistral.invoke(question)
    print(f"question: {question}\n answer: {answer}")
    mistral_answers.append(answer)

  # Write the answers to the file
  with open(filename_mistral, "w") as f:
    for answer in mistral_answers:
      f.write("\n".join(mistral_answers))
#---------------------------------
len(sonnet_answers)
len(mistral_answers)

200

Create a Sonnet dataset

In [32]:
# create huggingface dataset to make it easier to work with the data

# QUESTION: create a huggingface dataset object with the keys 'question' and 'answer' and the questions and answers you have generated, respectively
# shuffle the dataset. use a fixed seed.

#--- ADD YOUR SOLUTION HERE (5 points)---
from datasets import load_dataset, Dataset
data = {'question': questions_all, 'answer': sonnet_answers}
sutd_qa_dataset_sonnet = Dataset.from_dict(data)
#---------------------------------


In [33]:
# inspect schema and size of dataset
sutd_qa_dataset_sonnet

Dataset({
    features: ['question', 'answer'],
    num_rows: 200
})

In [34]:
# inspect first instance
sutd_qa_dataset_sonnet[0]

{'question': 'What are the core academic programs offered at SUTD?\n',
 'answer': 'Based on the context provided, the core academic programs offered at the Singapore University of Technology and Design (SUTD) are:\n\n1) Undergraduate programs, which are not explicitly listed but implied by mentions of "Transition Into SUTD" and "Integrated Learning Programme".\n\n2) Master\'s programs, including Master of Architecture, Master of Engineering (Research), Master of Innovation by Design, Master of Science in Security by Design, Master of Science in Urban Science, Policy and Planning, MSc in Technology and Design, and MTD (AI Empowered Built Environment).\n\n3) A Dual Master\'s program in Nano-Electronic Engineering and Design, offered in collaboration with CGU (likely referring to Claremont Graduate University).'}

In [35]:
# save dataset to disk
import pickle
with open('sutd_qa_dataset_sonnet.pkl', 'wb') as f:
    pickle.dump(sutd_qa_dataset_sonnet, f)

In [36]:
from huggingface_hub import login

# log in to huggingface, you need to put your huggingface access token
# https://huggingface.co/docs/hub/en/security-tokens

hf_access_token = "YOUR_HF_ACCESS"
login(token=hf_access_token)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/jon/.cache/huggingface/token
Login successful


In [37]:
# push dataset to huggingface
sutd_qa_dataset_sonnet.push_to_hub("sutd_qa_dataset_sonnet")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 1461.94ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  1.87it/s]


Create a Sonnet + Mistral dataset

In [38]:
from datasets import load_dataset, Dataset
data = {'question': questions_all + questions_all, 'answer': sonnet_answers + mistral_answers}
sutd_qa_dataset_sonnet_mistral = Dataset.from_dict(data)

In [41]:
sutd_qa_dataset_sonnet_mistral

Dataset({
    features: ['question', 'answer'],
    num_rows: 400
})

In [42]:
sutd_qa_dataset_sonnet_mistral[0]

{'question': 'What are the core academic programs offered at SUTD?\n',
 'answer': 'Based on the context provided, the core academic programs offered at the Singapore University of Technology and Design (SUTD) are:\n\n1) Undergraduate programs, which are not explicitly listed but implied by mentions of "Transition Into SUTD" and "Integrated Learning Programme".\n\n2) Master\'s programs, including Master of Architecture, Master of Engineering (Research), Master of Innovation by Design, Master of Science in Security by Design, Master of Science in Urban Science, Policy and Planning, MSc in Technology and Design, and MTD (AI Empowered Built Environment).\n\n3) A Dual Master\'s program in Nano-Electronic Engineering and Design, offered in collaboration with CGU (likely referring to Claremont Graduate University).'}

In [43]:
sutd_qa_dataset_sonnet_mistral[200]

{'question': 'What are the core academic programs offered at SUTD?\n',
 'answer': " The Singapore University of Technology and Design (SUTD) offers a variety of academic programs. At the graduate level, they offer Master's programs such as the Master of Architecture, Master of Engineering (Research), Master of Innovation by Design, and Master of Science in Security by Design, among others. They also have a dual Master's program in Nano-Electronic Engineering and Design with CGU. Additionally, SUTD provides opportunities for early matriculation and integrated learning."}

In [39]:
# save dataset to disk
import pickle
with open('sutd_qa_dataset_sonnet_mistral.pkl', 'wb') as f:
    pickle.dump(sutd_qa_dataset_sonnet_mistral, f)

In [40]:
# push dataset to huggingface
sutd_qa_dataset_sonnet_mistral.push_to_hub("sutd_qa_dataset_sonnet_mistral")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 595.61ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.59s/it]
