## CA 3, LLMs Spring 2024

- **Name: Pouya Sadeghi**
- **Student ID: 810199447**

---
### This is due on **May 11th, 2024**, submitted via [elearn](https://elearn.ut.ac.ir/).
#### Your submission should be named using the following format: `CA3_LASTNAME_STUDENTID.ipynb`.

---

##### *How to do this problem set:*

- Some questions require writing Python code and computing results, and the rest of them have written answers. For coding problems, you will have to fill out all code blocks that say `WRITE YOUR CODE HERE`.

- For text-based answers, you should replace the text that says "Write your answer here..." with your actual answer.

- There is no penalty for using AI assistance on this homework as long as you fully disclose it in the final cell of this notebook (this includes storing any prompts that you feed to large language models). That said, anyone caught using AI assistance without proper disclosure will receive a zero on the assignment (we have several automatic tools to detect such cases). We're literally allowing you to use it with no limitations, so there is no reason to lie!

---

##### *Academic honesty*

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your notebook. If you turn in correct answers on your notebook without code that actually generates those answers, we will consider this a serious case of cheating.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

---
Colab notebook: https://drive.google.com/file/d/19Kk9vsFbyBEQXVavfivyQHtLhqrmm7mC/view?usp=sharing

---

# Chain-of-Thought (CoT) (20 points)

If you have any further questions or concerns, contact the TA via email: mehdimohajeri@ut.ac.ir

LLMs have demonstrated good reasoning abilities. Furthermore, their capabilities can be further improved by incorporating reasoning techniques. One of the most notable developments in this area is the [Chain-of-Thought (CoT)](https://arxiv.org/abs/2201.11903), which was introduced by Google. This approach has shown promising results in improving the reasoning capabilities of language models across a variety of tasks. Can you explain what CoT is and how it works? (2.5 Points)

\# WRITE YOUR ANSWER HERE

CoT is a problem solving strategy. It is a method of reasoning that involves breaking down a problem into smaller parts, solving each part separately, and then combining the solutions to solve the original problem. CoT is a powerful tool for solving complex problems because it allows you to focus on one part of the problem at a time, rather than trying to solve the entire problem all at once. This makes it easier to understand the problem and come up with a solution. CoT can be used to solve a wide range of problems, from simple math problems to complex real-world issues and reasoning problems. There are two variant for this method, known as few-shot CoT and zero-shot CoT, both introduced in almost the same time.

In this section, you should use the CoT technique. firstly you need to load the [Phi-2 model](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/). This model has been introduced by Microsoft as a small LLM

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

def generate_output(model, input, max_length=500, apply_template=True, temperature=False):
  if apply_template:
    input = f"Question: {input}\nOutput:"
  input = tokenizer(input, return_tensors="pt", return_attention_mask=True)
  if temperature:
    outputs = model.generate(
      **input,
      max_length=max_length,
      temperature=temperature,
      do_sample=True)
  else:
    outputs = model.generate(**input, max_length=max_length)
  text = tokenizer.batch_decode(outputs)[0]
  return text

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Use Phi-2 to answer the questions below with and without CoT. Compare results and explain their difference. (4 Points)

There are two variant for CoT, known as `few-shot CoT` (the provided paper) and zero-shot CoT.
We will try both settings.

In [None]:
questions = ["Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?",
"Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?",
"John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?",
"There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?",
"Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?"
]

## Direct prompting

In [None]:
# WRITE YOUR CODE HERE
## Without CoT
direct_results = []
for question in questions:
  direct_results.append(generate_output(model, question))


## Zero-shot CoT


In [None]:
## With CoT
cot_results = []
cot_thinking = []
cot_prompt = "{question}\nlet's think step by step."
conc_prompt = "{question}\nContext: {cot}\nNow, what is the final result?"
for question in questions:
  cot = generate_output(model, cot_prompt.format(question=question))
  result = generate_output(model, conc_prompt.format(question=question, cot=cot))
  cot_results.append(result)
  cot_thinking.append(cot)


## Few-shot CoT

We use it in one-shot format.

In [None]:
## With CoT
cot_results2 = []
cot_prompt2 = """\
Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
Output: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Question: {question}
Output: \
"""
for question in questions:
  cot = generate_output(model, cot_prompt2.format(question=question), apply_template=False)
  cot_results2.append(cot)


## Compare

In [None]:
from textwrap import fill

itr = enumerate(zip(questions, direct_results, cot_results, cot_results2))
subsequent_indent = '\t'+' '*15
for i, (que, direct, cot, cot2) in itr:
  print(f"Question ({i+1}):")
  print(f"\t{'(q)':<13}: {fill(que, width=100, initial_indent='', subsequent_indent=subsequent_indent)}")
  print(f"\t{'Direct':<13}: {fill(direct, width=100, initial_indent='', subsequent_indent=subsequent_indent)}")
  print(f"\tZero-shot CoT: {fill(cot, width=100, initial_indent='', subsequent_indent=subsequent_indent)}")
  print(f"\tFew-shot CoT : {fill(cot2, width=100, initial_indent='', subsequent_indent=subsequent_indent)}")
  print()

Question (1):
	(q)          : Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much
	               did she earn?
	Direct       : Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting.
	               How much did she earn? Output: Weng earned $9 for babysitting. <|endoftext|>
	Zero-shot CoT: Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting.
	               How much did she earn? Context: Question: Weng earns $12 an hour for babysitting.
	               Yesterday, she just did 50 minutes of babysitting. How much did she earn? let's
	               think step by step. Output: Let's convert 50 minutes to hours. Since there are 60
	               minutes in an hour, 50 minutes is equal to 50/60 = 5/6 hours. To find out how much
	               Weng earned, we can multiply her hourly rate of $12 by the number of hours she
	               babysat, wh

\# WRITE YOUR ANSWER HERE (Compare results)

First, lets check how different approaches were behaved on each question:

| Question Number | Direct       | Few-shot CoT               | Zero-shot CoT             |
|-----------------|--------------|----------------------------|---------------------------|
| 1               | Wrong answer | Almost correct (low error) | Correct + junk generation |
| 2               | Correct      | Correct                    | Correct                   |
| 3               | Wrong        | Correct                    | Wrong                     |
| 4               | Wrong        | Wrong                      | Correct                   |
| 5               | Correct      | Correct                    | Correct + junk generation |

As we can see, in each time, at least one CoT method was able to find the correct answer. Few-show CoT seems to work a bit better in contrast with zero-shot variant, but we also need to design a careful demonstration(s) for it. Direct method achieves the lowest score, but it generated fewer tokens and if faster (and cheaper) but less accurate. It seems like using CoT is an overall good approach to improve model's performance on reasoning tasks (such as arithmetic reasoning and mathematics).


## Other Methods for Reasoning

There are many other approaches to utilize the reasoning abilities of LLMs. Describe the [Tree-of-Thought (ToT)](https://arxiv.org/abs/2305.10601) and [Self-Consistency](https://arxiv.org/abs/2203.11171) within these approaches. (3.5 Points)

\# WRITE YOUR ANSWER HERE (Describe methods)

**Tree-of-Thought (ToT):** It is designed to enhance LLMs’ problem-solving abilities by allowing them to consider multiple reasoning paths during inference. The LLM would mimic a DFS (or BFS) search approach to find a solution in different one of leaves in branches. Some of its advantages:
- Allows LLMs to deliberate by considering various reasoning paths and self-evaluating choices.
- LLMs can look ahead or backtrack (somehow correct itself) when necessary to make global decisions.

**Self-Consistency (CoT-SC):** Self-consistency with Chain of Thought (CoT-SC) is an ensemble approach that builds upon the Chain of Thought method (CoT). It designed to improve CoT performance, by letting the model try different thinking paths. CoT-SC involves sampling independent chains of thought and then selecting the most frequent output. CoT-SC improves upon CoT because it considers different reasoning paths and aims for consistency across these paths. The idea is that the correct answer is the most frequent and consistent answer in different thinking paths.

Now, implement Self-Consistency to answer the questions of the previous section. (6 Points)

In [None]:
import re
from collections import Counter

def get_last_digits(s):
  ss = re.findall(r"[-+]?(?:\d*\.*\d+)", s)
  return ss[-1]

def most_common(lst):
    occurrence_count = Counter(lst)
    return occurrence_count.most_common(1)[0][0]

In [None]:
# WRITE YOUR CODE HERE
number_of_samples = 3
sc_results = []
sc_paths = []

for question in questions:
  run_results = [generate_output(model, cot_prompt2.format(question=question), temperature=0.7, apply_template=False) for _ in range(number_of_samples)]
  sc_paths.append(run_results)
  votes = [get_last_digits(s) for s in run_results]
  most_voted = most_common(votes)
  sc_results.append(most_voted)

In [None]:
from textwrap import fill

itr = enumerate(zip(questions, sc_results, sc_paths))
subsequent_indent = '\t'+' '*17
subsequent_indent2 = '\t'+' '*6
for i, (que, res, paths) in itr:
  print(f"Question ({i+1}):")
  print(f"\t{'(q)':<15}: {fill(que, width=100, initial_indent='', subsequent_indent=subsequent_indent)}")
  print(f"\t{'Answer':<15}: {res}")
  print("\tDifferent paths:")
  for i, path in enumerate(paths, start=1):
    print(f"\t+ ({i}) {fill(path, width=100,initial_indent='', subsequent_indent=subsequent_indent2)}")
  print()

Question (1):
	(q)            : Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much
	                 did she earn?
	Answer         : 10
	Different paths:
	+ (1) Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis
	      balls. How many tennis balls does he have now? Output: Roger started with 5 balls. 2 cans of
	      3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.  Question: Weng earns
	      $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did
	      she earn? Output:  Step 1: Convert the time (50 minutes) into hours. Since there are 60
	      minutes in an hour, 50 minutes is 50/60 = 0.83 hours. Step 2: Multiply the hourly rate by the
	      time spent. So, 12 * 0.83 = 10 dollars. The answer is 10 dollars. <|endoftext|>
	+ (2) Question: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis
	      balls

Consider LLMs' features and propose a new approach based on them to enhance LLMs' reasoning abilities. Why do you believe this approach could enhance LLMs' reasoning abilities? (4 Points)

\# WRITE YOUR ANSWER HERE

**Ensemble CoT-SC:** This is based on existing CoT-SC and use different model instead of just repeating the same model. The idea is like CoT-SC, but we consider that different models are trained differently (data, arch, epochs, ...) and would behave differently. The same as CoT-SC while inspired by that of different humans having different point of view to the same problem.

---
**A suggestion from AI (ms Copilot):**
This method it proposed by microsoft's copilot.

***Contextual Pathway Enrichment (CPE)***

1. **Motivation**:
   - LLMs, such as GPT-4, exhibit remarkable capabilities in understanding context and generating coherent text. However, their reasoning abilities can be further improved.
   - Existing approaches like Tree of Thoughts (ToT) and Self-Consistency (CoT-SC) have demonstrated the value of exploring multiple reasoning paths. CPE builds upon this foundation.

2. **Key Features of CPE**:
   - **Dynamic Pathway Exploration**:
     - LLMs maintain a dynamic graph of contextual pathways during inference. Each pathway represents a sequence of intermediate steps taken to arrive at a decision.
     - As the model processes input, it dynamically constructs and updates these pathways, considering both local and global context.
   - **Pathway Enrichment**:
     - LLMs enrich pathways by incorporating external knowledge, domain-specific facts, and logical rules.
     - For example, during a medical diagnosis task, the model can consult a medical knowledge base to validate its reasoning steps.
   - **Adaptive Pathway Pruning**:
     - Not all pathways are equally informative. CPE includes an attention mechanism that prunes less relevant pathways.
     - The model learns to allocate attention to pathways based on their contribution to the final decision.
   - **Feedback Loop**:
     - After generating an output, the model evaluates the quality of its reasoning pathways.
     - If inconsistencies or errors are detected, the model retraces its steps, revisits relevant context, and adjusts the pathways.
     - This feedback loop promotes self-improvement over time.

3. **Why CPE Could Enhance LLMs' Reasoning**:
   - **Richer Contextual Understanding**:
     - By maintaining pathways, LLMs gain a deeper understanding of context. They can reason over longer spans and consider diverse perspectives.
   - **External Knowledge Integration**:
     - Pathway enrichment allows LLMs to tap into external knowledge sources (e.g., databases, scientific literature, ontologies).
     - This integration enhances factual accuracy and domain-specific reasoning.
   - **Error Correction and Robustness**:
     - The feedback loop enables error correction. If a pathway leads to an incorrect conclusion, the model can revise its reasoning.
     - CPE promotes robustness by minimizing biases and improving logical consistency.

4. **Experimental Validation**:
   - Evaluate CPE on diverse tasks: question answering, dialogue, code generation, etc.
   - Compare against baselines (e.g., ToT, CoT-SC) to measure improvements in reasoning quality and efficiency.

In summary, Contextual Pathway Enrichment leverages LLMs' existing features while introducing dynamic pathways, external knowledge, and self-correction. By doing so, it enhances their reasoning abilities, making them more reliable and contextually aware.

# PEFT (30 + 5 points)

If you have any further questions or concerns, contact the TA via email: pedram.rostami@ut.ac.ir

## Why We Are Using PEFT (5 points)

In this question, we're delving into PEFT. First, let's start by exploring why PEFT is crucial when training LLMs. For instance, let's consider the scenario where we want to train the [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) model. To get started, take a look at the Huggingface blog post on [model memory anatomy](https://huggingface.co/docs/transformers/en/model_memory_anatomy) to estimate how much memory we'll require. Just assume we're sticking to pure fp16 with Adam optimizer and a batch size of 1. (4 points)

\# WRITE YOUR ANSWER HERE

If we used fp32 (as it is):
- **Model Weights:** 4 * (2.7B) = 10.8 GB
- **Optimizer States:** 8 * (2.7B) = 21.6 GB (if we use normal AdamW)
- **Gradients:** 4 * (2.7B) = 10.8 GB (can't be reduced)
- **Forward Activations:** let's just skip this
- **In Total:** 10.8 + 21.6 + 10.8 = 43.2 GB

Now, if we use fp16:
- - **In Total:** $\frac{10.8 + 21.6}{2} + 10.8 = 27 GB$

Compare your estimation with the memory estimation provided by the [Model Memory Calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage). (1 point)

\# WRITE YOUR ANSWER HERE

For fp32:
- **Model Weights:** 9.88 GB
- **Backward pass:** 19.76 GB
- **Gradients:** 9.88 GB (can't be reduced)
- **Training using Adam (Peak vRAM):** 39.53 GB

For fp16/bfloat16:
- **Model Weights:** 9.88 GB
- **Backward pass:** 19.76 GB
- **Gradients:** 14.82 GB
- **Training using Adam (Peak vRAM):** 19.76 GB


## Preparing Dataset (5 points)

We're going to train the phi-2 model for a question generation task based on passages. For this purpose, we're using the Super-NaturalInstruction dataset, which comprises instruction tuning datasets for over 1600 tasks across different languages. While the dataset is available on the [Huggingface Hub](https://huggingface.co/datasets/Muennighoff/natural-instructions), downloading all its components consumes considerable time. Consequently, we're opting to download only the English Question Generation segment.

In [None]:
!wget https://huggingface.co/datasets/Muennighoff/natural-instructions/resolve/main/train/task001_quoref_question_generation_train.jsonl

--2024-05-13 23:02:14--  https://huggingface.co/datasets/Muennighoff/natural-instructions/resolve/main/train/task001_quoref_question_generation_train.jsonl
Resolving huggingface.co (huggingface.co)... 13.33.30.76, 13.33.30.49, 13.33.30.23, ...
Connecting to huggingface.co (huggingface.co)|13.33.30.76|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a1/fe/a1fedd93d2c00f67a096c36747356c03b6f01649bae4b4be932e6531a496022a/89ad3018bdb2cec45afea661fbe2fc8df9593243f58531d381c19b5fb13ce581?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27task001_quoref_question_generation_train.jsonl%3B+filename%3D%22task001_quoref_question_generation_train.jsonl%22%3B&Expires=1715900534&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTkwMDUzNH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9hMS9mZS9hMWZlZGQ5M2QyYzAwZjY3YTA5NmMzNjc0NzM1NmMwM2I2ZjAxNjQ5YmFlNGI0YmU5M

Read the dataset file and convert it into a `dataset` object. Then, split the dataset, selecting 95% for the training set and 5% for the test set. (5 points)

In [None]:
!pip install -qU datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m32.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# WRITE YOUR CODE HERE
import datasets

dataset = datasets.load_dataset('json', data_files='task001_quoref_question_generation_train.jsonl')["train"]
dataset_splited = dataset.train_test_split(test_size=0.05, shuffle=True, seed=42)
train_dataset, test_dataset = dataset_splited["train"], dataset_splited["test"]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
def apply_alpaca_template(instruction, input=None):
  if input is not None:
    return f"""\
### Instruction:
{instruction}

### Input:
{input}

### Response:
"""
  return f"""\
### Instruction:
{instruction}

### Response:
"""

def generate_output(model, input, max_length=500, temperature=False):
  input = tokenizer(input, return_tensors="pt", return_attention_mask=True).to("cuda" if torch.cuda.is_available() else "cpu")
  if temperature:
    outputs = model.generate(
      **input,
      max_length=max_length,
      temperature=temperature,
      do_sample=True)
  else:
    outputs = model.generate(**input, max_length=max_length)
  text = tokenizer.batch_decode(outputs)[0]
  return text

Choose random samples from the test set, apply the [Alpaca template](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release) to them, and obtain the model outputs (If you are using the [sample code](https://huggingface.co/microsoft/phi-2#sample-code) provided by Microsoft for using the model, please comment out the `torch.set_default_device("cuda")` line to conserve memory. Instead, you can move the model to the GPU using the `.to` function after loading it.). (5 points)

In [None]:
import random

NUMBER_OF_SAMPLES = 3

random.seed(42)
test_samples = random.sample(list(test_dataset), NUMBER_OF_SAMPLES)

In [None]:
from textwrap import fill

def pretty_print_qa_samples(sample, generated: str, init_indent='\t'):
  subsequent_indent = init_indent + " " * 14
  instruction_idx = generated.find("### Instruction:")
  passage_idx = generated.find("### Input:")
  response_idx = generated.find("### Response:")
  print(f"{init_indent}{'(Instruction)':<12}: {fill(generated[instruction_idx:passage_idx], width=90, initial_indent='', subsequent_indent=subsequent_indent)}")
  print(f"{init_indent}{'(passage)':<12}: {fill(sample['inputs'], width=90, initial_indent='', subsequent_indent=subsequent_indent)}")
  print(f"{init_indent}{'(generated)':<12}: {fill(generated[response_idx:], width=90, initial_indent='', subsequent_indent=subsequent_indent)}")
  print(f"{init_indent}{'(target)':<12}: {fill(sample['targets'], width=90, initial_indent='', subsequent_indent=subsequent_indent)}")
  print()

## Pretrained Model (5 points)

In [None]:
# WRITE YOUR CODE HERE
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = (AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True).
         to("cuda" if torch.cuda.is_available() else "cpu"))
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
for i, sample in enumerate(test_samples, start=1):
  inst = sample["definition"]
  input = sample["inputs"]
  prompt = apply_alpaca_template(inst, input)
  output = generate_output(model, prompt, max_length=1000)

  print(f"Sample ({i})", "*"*110)
  pretty_print_qa_samples(sample, output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample (1) **************************************************************************************************************
	(Instruction): ### Instruction: In this task, you're given passages that contain mentions of names of
	              people, places, or things. Some of these mentions refer to the same person,
	              place, or thing. Your job is to write questions that evaluate one's
	              understanding of such references. Good questions are expected to link
	              pronouns (she, her, him, his, their, etc.) or other mentions to people,
	              places, or things to which they may refer. Do not ask questions that can be
	              answered correctly without understanding the paragraph or having multiple
	              answers. Avoid questions that do not link phrases referring to the same
	              entity. For each of your questions, the answer should be one or more
	              phrases in the paragraph, and it should be unambiguous.
	(passa

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample (2) **************************************************************************************************************
	(Instruction): ### Instruction: In this task, you're given passages that contain mentions of names of
	              people, places, or things. Some of these mentions refer to the same person,
	              place, or thing. Your job is to write questions that evaluate one's
	              understanding of such references. Good questions are expected to link
	              pronouns (she, her, him, his, their, etc.) or other mentions to people,
	              places, or things to which they may refer. Do not ask questions that can be
	              answered correctly without understanding the paragraph or having multiple
	              answers. Avoid questions that do not link phrases referring to the same
	              entity. For each of your questions, the answer should be one or more
	              phrases in the paragraph, and it should be unambiguous.
	(passa

In [None]:
import gc
del model, tokenizer

gc.collect()
if torch.cuda.is_available():
  torch.cuda.empty_cache()

## Fine-tuning with LoRA (15 + 5 points)

In this phase, we're fine-tuning the phi-2 model on a question generation dataset. To begin, we need to format our dataset into the instruction tuning format. For this task, we can employ `DataCollatorForCompletionOnlyLM`. Look at the [example](https://huggingface.co/docs/trl/en/sft_trainer#train-on-completions-only) in the HuggingFace documentation and instantiate the data collator using the Alpaca template. (3 points)

In [None]:
!pip install -qU trl peft accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.2/245.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m22.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.4/102.4 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# WRITE YOUR CODE HERE
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

def formatting_prompts_func(sample):
    output_texts = []
    for i in range(len(sample['definition'])):
        text = f"""\
### Instruction:
{sample["definition"][i]}

### Input:
{sample["inputs"][i]}

### Response:
{sample["targets"][i]}
"""
        output_texts.append(text)
    return output_texts

response_template = "### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)



Refer to the HuggingFace [documentation](https://huggingface.co/docs/trl/en/sft_trainer#training-adapters) and instantiate the Lora config. (3 points)

In [None]:
# WRITE YOUR CODE HERE
from peft import LoraConfig

peft_config = LoraConfig(
    r=4,
    lora_alpha=8,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Configure other training arguments. [Here](https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/trainer#transformers.TrainingArguments) is a list of available options. Consider using a small batch size to prevent CUDA out of memory errors. You can augment batch size artificially through gradient accumulation. Enabling gradient checkpointing can further save memory. You may train the model for tens of steps. (3 points)

In [None]:
import torch
from transformers import AutoModelForCausalLM

model = (AutoModelForCausalLM.from_pretrained(
  "microsoft/phi-2",
  torch_dtype=torch.bfloat16,
  trust_remote_code=True).
         to("cuda" if torch.cuda.is_available() else "cpu"))




config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Take a look at the HuggingFace [documentation](https://huggingface.co/docs/trl/en/sft_trainer) on supervised fine-tuning trainers. Instantiate the trainer and train the model ( Note that you should initialize the phi-2 model with `bfloat16` or `float16` dtype to avoid encountering Cuda out of memory errors.). (3 points)

In [None]:
from transformers import TrainingArguments, BitsAndBytesConfig

training_args = TrainingArguments(
    output_dir="./",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    save_total_limit=0,
    report_to="none",
    auto_find_batch_size=True,
    gradient_checkpointing=True,
    gradient_accumulation_steps=4,
)

In [None]:
print("Total Samples:", len(train_dataset))
train_dataset_splited = train_dataset.train_test_split(train_size=0.1, shuffle=True, seed=42)["train"] # As we can't have resource for such long time, we will reduce the train dataset size and choosing some samples
print("Splited Samples:", len(train_dataset_splited))

Total Samples: 20726
Splited Samples: 207


In [None]:
# WRITE YOUR CODE HERE
trainer = SFTTrainer(
    model,
    train_dataset=train_dataset_splited,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    tokenizer=tokenizer,
    peft_config=peft_config,
    max_seq_length=None,
    args=training_args,
    packing=False,
)

Map:   0%|          | 0/207 [00:00<?, ? examples/s]

Get the final model from the trainer and merge the Lora weights with it. Then, test the model with the inputs you gave to the pretrained model and compare the results. (3 points)

In [None]:
# WRITE YOUR CODE HERE
trainer.train()

Step,Training Loss


Step,Training Loss


TrainOutput(global_step=207, training_loss=0.0, metrics={'train_runtime': 104.2636, 'train_samples_per_second': 1.985, 'train_steps_per_second': 1.985, 'total_flos': 2081291625154560.0, 'train_loss': 0.0, 'epoch': 1.0})

In [None]:
trainer.model

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): PhiForCausalLM(
      (model): PhiModel(
        (embed_tokens): Embedding(51200, 2560)
        (embed_dropout): Dropout(p=0.0, inplace=False)
        (layers): ModuleList(
          (0-31): 32 x PhiDecoderLayer(
            (self_attn): PhiSdpaAttention(
              (q_proj): lora.Linear(
                (base_layer): Linear(in_features=2560, out_features=2560, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.05, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2560, out_features=4, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=4, out_features=2560, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): Linear(in_features

In [None]:
trainer.model.merge_and_unload()

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiSdpaAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((256

In [None]:
model = trainer.model.merge_and_unload()

In [None]:
for i, sample in enumerate(test_samples, start=1):
  inst = sample["definition"]
  input = sample["inputs"]
  prompt = apply_alpaca_template(inst, input)
  output = generate_output(model, prompt, max_length=1000)

  print(f"Sample ({i})", "*"*110)
  pretty_print_qa_samples(sample, output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample (1) **************************************************************************************************************
	(Instruction): ### Instruction: In this task, you're given passages that contain mentions of names of
	              people, places, or things. Some of these mentions refer to the same person,
	              place, or thing. Your job is to write questions that evaluate one's
	              understanding of such references. Good questions are expected to link
	              pronouns (she, her, him, his, their, etc.) or other mentions to people,
	              places, or things to which they may refer. Do not ask questions that can be
	              answered correctly without understanding the paragraph or having multiple
	              answers. Avoid questions that do not link phrases referring to the same
	              entity. For each of your questions, the answer should be one or more
	              phrases in the paragraph, and it should be unambiguous.
	(passa

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sample (2) **************************************************************************************************************
	(Instruction): ### Instruction: In this task, you're given passages that contain mentions of names of
	              people, places, or things. Some of these mentions refer to the same person,
	              place, or thing. Your job is to write questions that evaluate one's
	              understanding of such references. Good questions are expected to link
	              pronouns (she, her, him, his, their, etc.) or other mentions to people,
	              places, or things to which they may refer. Do not ask questions that can be
	              answered correctly without understanding the paragraph or having multiple
	              answers. Avoid questions that do not link phrases referring to the same
	              entity. For each of your questions, the answer should be one or more
	              phrases in the paragraph, and it should be unambiguous.
	(passa

\# WRITE YOUR ANSWER HERE

We know that fine-tuning LLMs on Colab or Kaggle notebooks can be a bit tricky, and fine-tuning phi-2 for this task may require more GPU hours. The main point of this question is to teach you how to train your model using HuggingFace packages. So, it's okay if your model doesn't produce optimal results. However, there are 5 additional points available if it can generate better results :)

# RAG (50 points)

If you have any further questions or concerns, contact the TA via email: alisalemi@ut.ac.ir

## Install Requirements

In [None]:
%pip install -q langchain
%pip install -q ctransformers
%pip install -q sentence_transformers
%pip install -q datasets
%pip install -q rank_bm25
%pip install -q faiss-gpu
%pip install -q arxiv
%pip install -q pymupdf
%pip install -q scikit-learn
%pip install -q arxivloader

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m52.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.0/121.0 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.5/142.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m25.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

## 1. An Overview of LangChain (10 pt)

LangChain is an open-source framework designed to simplify the creation of applications using LLMs. It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

In this overview, we will provide a step-by-step guide on how to construct a basic application using LangChain. This application will fetch country-related information from a Large Language Model. For this purpose, we will be utilizing the LLaMa 2 chat 7B as our base model.

In [None]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50
  }
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

llama-2-7b-chat.Q8_0.gguf:   0%|          | 0.00/7.16G [00:00<?, ?B/s]

### 1.1 GGUF Format (3 pt)

Write a brief paragraph discussing the GGUF format and its benefits. Compare it with transformers library.

\# WRITE YOUR ANSWER HERE

The GGUF (Georgi Gerganov Universal Format) is a file format specifically designed for quantized language models, allowing for efficient inference from a single file. Developed by Georgi Gerganov, GGUF enables a more streamlined and cost-effective deployment process for Large Language Models (LLMs). In contrast to the popular transformers library, GGUF is optimized for LLMs, providing better performance and efficiency. GGUF's benefits include stability, reduced breaking changes, and support for a diverse range of models. Additionally, GGUF enables efficient storage and processing of LLMs, making it a more suitable choice for large-scale language model deployment.

GGUF is a new format introduced by the llama.cpp team on August 21st 2023. It is a replacement for GGML, which is no longer supported by llama.cpp. [Source](https://huggingface.co/TheBloke/law-LLM-GGUF)

### 1.2 Simple Chain (2 pt)

Complete the next cell to create a simple chain that takes the name of a country as input and outputs its capital. To accomplish this, you should utilize the `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes to formulate an effective prompt.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
  HumanMessagePromptTemplate.from_template("What is the capital of {country}?"),
  AIMessagePromptTemplate.from_template("")
])

output_parser = StrOutputParser()

simple_chain = prompt | model | output_parser

answer = simple_chain.invoke({"country": "Iran"})

print(answer)


 The capital of Iran is Tehran.


Write about the objectives behind the creation of `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes. What they actually do? Write a brief description.

\# WRITE YOUR ANSWER HERE

[Link to lang-chain's docs](https://python.langchain.com/v0.1/docs/modules/model_io/prompts/quick_start/#message-prompts)
[Message types](https://python.langchain.com/v0.1/docs/modules/model_io/chat/message_types/)

**From lang-chain docs:**
ChatModels take a list of messages as input and return a message. There are a few different types of messages. All messages have a `role` and a `content` property. The role describes *WHO* is saying the message. LangChain has different message classes for different roles. The content property describes the content of the message. This can be a few different things: (i) a string, (ii) A List of dictionaries (for multi-modals).
Different types of messages are:
- **HumanMessage**: This represents a message from the user. Generally consists only of content.
- **AIMessage**: This represents a message from the AI model. This may have *additional_kwargs* in it - for example *tool_calls* if using OpenAI tool calling.
- **SystemMessage**: This represents a system message, which tells the model how to behave. This generally only consists of content. Not every model supports this.
- **FunctionMessage**: This represents the result of a function call. In addition to role and content, this message has a name parameter which conveys the name of the function that was called to produce this result.
- **ToolMessage**: This represents the result of a tool call. This is distinct from a FunctionMessage in order to match OpenAI's function and tool message types. In addition to role and content, this message has a tool_call_id parameter which conveys the id of the call to the tool that was called to produce this result.


The `HumanMessagePromptTemplate` class is used to create a message that is sent from the user to the AI.
It allows for the creation of user-friendly prompts that can be used to gather information from users.
So, we use this class to design prompt templates, as user role to the AI and having template format helps us to use f-string format to parametrize the prompt.

The `AIMessagePromptTemplate` class is used to represent a message that is sent from the AI to the user.
The same as HumanMessage, but is used if we want the AI model to provide answers or engage in conversation.

What is the purpose of adding an empty `AIMessagePromptTemplate` at the end of prompt? What is the consequences of omitting it?

\# WRITE YOUR ANSWER HERE

As mentioned above, the purpose of adding an `AIMessagePromptTemplate` is to tell the AI model that it needs to provide an answer. If we omit it, the AI model will not know that it needs to provide an answer, and the chain will not work as expected. Also, AI may not provide a closing message and hurt the natural flow of the conversation.

### 1.3 JSON Chain (5 pt)

Now we want to improve the chain to extract data from the model response. Modify the existing prompt to request information about a country's name, population, and major cities in addition to the capital. Additionally, incorporate a `SystemMessagePromptTemplate` to ensure the model's response is structured in JSON format. Keep in mind that a distinct parser is required to parse the JSON output.

In [None]:
from langchain_core.prompts import SystemMessagePromptTemplate
from langchain_core.output_parsers import JsonOutputParser

resp_template = '{{"country": "{country}", "capital": "", "population": "", "cities": []}}'
prompt = ChatPromptTemplate.from_messages([
  SystemMessagePromptTemplate.from_template(f"You are a helpful AI assistants that answers in JSON format. Please responce in json format as shown below, and avoid generating any more tokens: \n{resp_template}"),
  HumanMessagePromptTemplate.from_template("What is the capital, population, and major cities of {country}?"),
  AIMessagePromptTemplate.from_template(""),
])
output_parser = JsonOutputParser()

json_chain = prompt | model | output_parser

answers = json_chain.batch([
  {"country": "Iran"},
  {"country": "USA"},
  {"country": "Japan"},
  {"country": "Nigeria"}
])

for ans in answers:
  print(f"{ans['country']}:")
  print(f"  capital: {ans['capital']}")
  print(f"  population: {ans['population']}")
  print(f"  important cities: {ans['cities']}")


Iran:
  capital: Tehran
  population: 831276549
  important cities: [{'name': 'Tehran', 'population': 155095647, 'province': 'Tehran Province'}, {'name': 'Mashhad', 'population': 31375944, 'province': 'Khorasan-e Razavi Province'}, {'name': 'Isfahan', 'population': 25680892, 'province': 'Esfahan Province'}]
USA:
  capital: Washington D.C.
  population: 329610257
  important cities: ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix']
Japan:
  capital: Tokyo
  population: 1276537000
  important cities: [{'name': 'Tokyo', 'population': 138897000}, {'name': 'Osaka', 'population': 21731000}, {'name': 'Nagoya', 'population': 23546000}]
Nigeria:
  capital: Abuja
  population: 233 million
  important cities: ['Lagos', 'Kano', 'Ibadan', 'Port Harcourt']


## 2. Different Types of Retrievers (15 pt)

In this section, We use mini-bioasq dataset to evalute different types of retrivers.

In [None]:
import json
from datasets import load_dataset

corpus = load_dataset("rag-datasets/mini-bioasq", "text-corpus", split="passages")
qa_dataset = load_dataset("rag-datasets/mini-bioasq", "question-answer-passages", split="test[:100]")

qa_dataset = qa_dataset.map(lambda data: {
  "relevant_passage_ids": json.loads(data["relevant_passage_ids"])
})

print(corpus)
print(qa_dataset)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/513 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.5M [00:00<?, ?B/s]

Generating passages split:   0%|          | 0/40221 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4719 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['passage', 'id'],
    num_rows: 40221
})
Dataset({
    features: ['question', 'answer', 'relevant_passage_ids', 'id'],
    num_rows: 100
})


In [None]:
from langchain_core.documents import Document

docs = []
for doc in corpus:
    docs.append(Document(page_content=doc["passage"], metadata={"id": doc["id"]}))

### 2.1 Evaluate Retriever (4 pt)

To effectively compare various retrieval systems, we must define a metric. Complete the `evaluate_retriever` function to measure the accuracy of the retrieved documents. Consider the `relevant_passage_ids` column as the expected documents to be retrieved.

In [None]:
def evaluate_retriever(retriever, k=5):
    correct = 0
    total = 0

    for data in qa_dataset:
        relevant_passage_ids = data["relevant_passage_ids"]
        question = data["question"]
        retrieved_passage_ids = [d.metadata['id'] for d in retriever.invoke(question)]

        correct += len(set(relevant_passage_ids) & set(retrieved_passage_ids))
        total += min(len(relevant_passage_ids), k)

    return correct / total

### 2.2 TF-IDF Retriever (3 pt)

Create a TF-IDF retriever and configure it to returns the top 5 relevant documents.

In [None]:
from langchain_community.retrievers import TFIDFRetriever

tfidf_retriever = TFIDFRetriever.from_documents(docs, k=5)

### 2.3 Semantic Retriever (5 pt)

Semantic retrievers operate by retrieving documents through embeddings. These systems require an embedding model to convert documents into a vector space, and a vector database to find the closest documents to a query. Construct a semantic retriever that utilizes [`intfloat/e5-base`](https://huggingface.co/intfloat/e5-base) as the embedding model and FAISS for the vector database.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import torch

embedding_model = HuggingFaceEmbeddings(
        model_name="intfloat/e5-base",
        model_kwargs={"device": 'cuda' if torch.cuda.is_available() else 'cpu'},
        encode_kwargs={'normalize_embeddings': True},
)
semantic_retriever = (FAISS.
                      from_documents(documents=docs, embedding=embedding_model).
                      as_retriever(search_kwargs={"k":5}))



### 2.4 Compare Retrivers (3 pt)

Calculate the score for each retriever using `evaluate_retriever` you previously writed. In this question, which one outperforms the other? Illustrate a scenario for each retriver that it outperforms the other.

\# WRITE YOUR ANSWER HERE

In [None]:
tfidf_acc = evaluate_retriever(tfidf_retriever)
semantic_acc = evaluate_retriever(semantic_retriever)

print(f"TF-IDF accuracy: {tfidf_acc:.2f}")
print(f"semantic accuracy: {semantic_acc:.2f}")


TF-IDF accuracy: 0.50
semantic accuracy: 0.58


To compare these methods under different scenarios, we need to get familiar with they pros and cons.

**TF-IDF based retrievers**: This method assumes that the higher the frequency of a word in a piece of text, the more likely that text is relevant to that word. It works well for Key-word based search.
- Advantages:
  - It is computationally fast (on cpu).
  - Works well for small to medium-sized document collections and is scalable.
- Disadvantages:
  - Doesn't capture word order, meaning, or grammatical structure.
  - Ignores context beyond individual terms (can understand their differences in different contexts).

**Semantic based retrievers**: This method use dense vector representations , know as embeddings, to capture semantic meaning. They used for semantic search.
- Advantages:
  - Considers word meaning, context, and relationships(in case of using transformer-based approaches).
  - Performs well with complex queries and long documents.
- Disadvantages:
  - Requires more resources and would be slow on cpu.
  - Would be even slower for large-scale applications with many documents.

As we can undertanf from the above, `TF-IDF retrievers` works well for simple tasks and quick retrieval and in cases we have a large corpus or not having access to GPUs or in low resource regime. On the other hand, `Semantic retrievers` would win the competition in complex  where scenarios the context and the meanings matter and can lead to better and more accurate results.

---
Microsoft's copilot help to make a list of these methods advantages and disadvantages. this list is later edited and supervised by the author and the conclusion is also made by the author.

## 3. RAG (25 pt)

### 3.0 Load model

In [None]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    'context_length' : 2048
  }
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

llama-2-7b-chat.Q8_0.gguf:   0%|          | 0.00/7.16G [00:00<?, ?B/s]

In this section, you should use all the concepts you've learned until now to create a complete RAG chain.

### 3.1 Load Documents (2 pt)

Load [RAFT](https://arxiv.org/abs/2403.10131) and [DSPy](https://arxiv.org/abs/2401.12178) papers. You can use `ArxivLoader` to get documents from arXiv.


In [None]:
from langchain.document_loaders import ArxivLoader

docs = ArxivLoader("2403.10131").load() + ArxivLoader("2401.12178").load()

docs

[Document(page_content='In-Context Learning for Extreme Multi-Label Classification\nKarel D’Oosterlinck1,2,∗, Omar Khattab2, François Remy1,\nThomas Demeester1, Chris Develder1, Christopher Potts2\n1Ghent University – imec\n2Stanford University\n∗karel.doosterlinck@ugent.be\nAbstract\nMulti-label classification problems with thou-\nsands of classes are hard to solve with in-\ncontext learning alone, as language models\n(LMs) might lack prior knowledge about the\nprecise classes or how to assign them, and\nit is generally infeasible to demonstrate ev-\nery class in a prompt. We propose a general\nprogram, Infer–Retrieve–Rank, that defines\nmulti-step interactions between LMs and re-\ntrievers to efficiently tackle such problems. We\nimplement this program using the DSPy pro-\ngramming model, which specifies in-context\nsystems in a declarative manner, and use DSPy\noptimizers to tune it towards specific datasets\nby bootstrapping only tens of few-shot exam-\nples. Our primary extreme cl

In [None]:
# # We can use method below to download PDFs, this code is generated by AI (Cohere c4ai)
# import requests

# raft_url = "https://arxiv.org/pdf/2403.10131v1"
# dspy_url = "https://arxiv.org/pdf/2401.12178v1"

# output_dir = "./papers"

# import os
# os.makedirs(output_dir, exist_ok=True)

# def download_paper(url, filename):
#     response = requests.get(url)
#     with open(os.path.join(output_dir, filename), "wb") as file:
#         file.write(response.content)

# download_paper(raft_url, "raft.pdf")
# download_paper(dspy_url, "dspy.pdf")

# print("Downloaded papers to", output_dir)

Downloaded papers to ./papers


### 3.2 Split Documents into Chunks (4 pt)

Usually, each document is constructed from multiple sections, each with a separate topic. It is better to split each document into smaller parts named chunks and search among them instead of actual documents. Write a splitter to create chunks from loaded documents.

In [None]:
# please note that we will continue with the previous loaded documents using ArxivLoader and just using summary section, however you can use whole PDFs and load them using langchain's PDF readers.
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter # there are many different splitters in lang-chain

# loader = DataFrameLoader(docs, page_content_column="summary")
# documents = loader.load()

documents = docs

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=30)

chunks = text_splitter.split_documents(documents)

### 3.3 Retriever (3 pt)

Create a retriever of your choice.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import torch

embedding_model = HuggingFaceEmbeddings(
        model_name="intfloat/e5-base",
        model_kwargs={"device": 'cuda' if torch.cuda.is_available() else 'cpu'},
        encode_kwargs={'normalize_embeddings': True},
)
retriever = (FAISS.
             from_documents(documents=chunks, embedding=embedding_model).
             as_retriever(search_kwargs={"k":2})) # We will just retrieve 2 documents




### 3.4 Design Prompt (2 pt)

Design a suitable prompt for RAG.

In [None]:
from langchain_core.prompts import (SystemMessagePromptTemplate,
                                    HumanMessagePromptTemplate,
                                    AIMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser


prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("You are a helpful AI assistant. Answer the question based on the provided context. If the answer is not explicitly mentioned in the context, just respond with \"I don't know\" and avoid anything more. Do not make assumptions or provide uncertain answers. Your goal is to provide accurate and reliable information based on the given context."),
    HumanMessagePromptTemplate.from_template("Provided context to answer the question: \n{context}\n\nQuestion to be answered: \n{question}"),
    AIMessagePromptTemplate.from_template(""),
])


### 3.5 RAG Chain (3 pt)

Design a question from the documents and get the retriever and RAG output for that question.

In [None]:
from langchain_core.runnables import RunnablePassthrough
from textwrap import fill

rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

question = "How LMs lack prior of knowledge would effect their performance multi-label classification problems?"
retrieved_doc = retriever.invoke(question)
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
answer = fill(answer, width=100, initial_indent='\t', subsequent_indent='\t')
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='hard to solve with in-context learning alone. Lan-\nguage models (LMs) might lack prior knowledge\nabout the precise classes, and the sheer number\nof available classes—often upwards of 10,000—\ngenerally means it is infeasible even to demon-\nstrate every class in a prompt. To deal with this,\nsome recent efforts make multiple LM calls at in-\nference time (Zhu and Zamani, 2023), while others', metadata={'Published': '2024-01-22', 'Title': 'In-Context Learning for Extreme Multi-Label Classification', 'Authors': "Karel D'Oosterlinck, Omar Khattab, François Remy, Thomas Demeester, Chris Develder, Christopher Potts", 'Summary': 'Multi-label classification problems with thousands of classes are hard to\nsolve with in-context learning alone, as language models (LMs) might lack prior\nknowledge about the precise classes or how to assign them, and it is generally\ninfeasible to demonstrate every class in a prompt. We propose a general\nprogram, $\\

### 3.6 Out of Domain Question (4 pt)

Ask a question that is not related to documents. Does model answer it? Change your prompt to force model say "I don't know" when some one asks out of domains questions.

WRITE YOUR ANSWER HERE

It didn't answer 'I don't know' at the first place, but we modified the system prompt to force the model to say 'I don't know' when it is asked out of domain questions.

In [None]:
rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

question = "How is the weather in Tehran today?"
retrieved_doc = retriever.invoke(question)
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
answer = fill(answer, width=100, initial_indent='\t', subsequent_indent='\t')
print(f"answer:\n{answer}")

retrieved document:
[Document(page_content='qualifications, and occupations. We consider three\ndatasets each containing snippets (typically one\nsentence) of online job vacancies in English with\ntheir relevant ESCO labels. We use the HOUSE,\nTECH, and TECHWOLF datasets (Zhang et al.,\n2022; Decorte et al., 2022, 2023). We take 10\nexamples each from the HOUSE and TECH valida-\ntion sets as training examples, and keep the remain-', metadata={'Published': '2024-01-22', 'Title': 'In-Context Learning for Extreme Multi-Label Classification', 'Authors': "Karel D'Oosterlinck, Omar Khattab, François Remy, Thomas Demeester, Chris Develder, Christopher Potts", 'Summary': 'Multi-label classification problems with thousands of classes are hard to\nsolve with in-context learning alone, as language models (LMs) might lack prior\nknowledge about the precise classes or how to assign them, and it is generally\ninfeasible to demonstrate every class in a prompt. We propose a general\nprogram, $\\texttt

### 3.7 The Effect of Temperature (7 pt)

RAG performance is highly dependent on model temperature. Explain that low temperature is better or high temperature? For the same prompt, compare the output of the model with low and high temperature.

WRITE YOUR ANSWER HERE

At low temperatures, the model prioritizes the most likely outputs based on its training data and internal probabilities. This leads to more predictable and safe responses, as the model sticks to well-learned patterns. On the other hand, high temperatures encourage the model to explore more diverse and creative responses, which can lead to more interesting and novel answers. However, this can also result in less coherent and relevant responses, as the model may generate more random or less accurate outputs.
It is true for the RAG models. Low temperature would lead the model to generates a concise and factually accurate response, directly answering the question. On the other hand, a high temperature response tends to be more verbose, providing additional details and a more descriptive answer. Moreover, higher temperatures may affect model's instruction following; For example in our case, causing the model to ignore the 'I don't know' instruction and generate an answer even for out-of-domain questions.

So as we discussed, choosing between high/low temperature depends on the use case and based on what we want and what we aspect from model, we can choose the best temperature for that.

In [None]:
from langchain_community.llms import CTransformers
model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    'context_length': 2048,
    'temperature': 0.9,
  }
)

from langchain.document_loaders import ArxivLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
documents = ArxivLoader("2403.10131").load() + ArxivLoader("2401.12178").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)

from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import torch
embedding_model = HuggingFaceEmbeddings(
        model_name="intfloat/e5-base",
        model_kwargs={"device": 'cuda' if torch.cuda.is_available() else 'cpu'},
        encode_kwargs={'normalize_embeddings': True},
)
retriever = (FAISS.
             from_documents(documents=chunks, embedding=embedding_model).
             as_retriever(search_kwargs={"k":2}))

from langchain_core.prompts import (SystemMessagePromptTemplate,
                                    HumanMessagePromptTemplate,
                                    AIMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("You are a helpful AI assistant. Answer the question based on the provided context. If the answer is not explicitly mentioned in the context, just respond with \"I don't know\" and avoid anything more. Do not make assumptions or provide uncertain answers. Your goal is to provide accurate and reliable information based on the given context."),
    HumanMessagePromptTemplate.from_template("Provided context to answer the question: \n{context}\n\nQuestion to be answered: \n{question}"),
    AIMessagePromptTemplate.from_template(""),
])

from langchain_core.runnables import RunnablePassthrough
from textwrap import fill
rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

question = "How model quatization can solve memory issues?"
answer = rag_chain.invoke(question)
answer = fill(answer, width=100, initial_indent='\t', subsequent_indent='\t')
print(f"(Higher temperature) answer:\n{answer}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]



(Higher temperature) answer:
	 Based on the provided context, the answer to the question "How model quatization can solve memory
	issues?" is not explicitly mentioned. Therefore, I will respond with a clear "I don't know" answer
	without making assumptions or providing uncertain answers.


In [None]:
from langchain_community.llms import CTransformers
model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    'context_length': 2048,
    'temperature': 0.0,
  }
)

from langchain.document_loaders import ArxivLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
documents = ArxivLoader("2403.10131").load() + ArxivLoader("2401.12178").load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)

from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
import torch
embedding_model = HuggingFaceEmbeddings(
        model_name="intfloat/e5-base",
        model_kwargs={"device": 'cuda' if torch.cuda.is_available() else 'cpu'},
        encode_kwargs={'normalize_embeddings': True},
)
retriever = (FAISS.
             from_documents(documents=chunks, embedding=embedding_model).
             as_retriever(search_kwargs={"k":2}))

from langchain_core.prompts import (SystemMessagePromptTemplate,
                                    HumanMessagePromptTemplate,
                                    AIMessagePromptTemplate,
                                    ChatPromptTemplate)
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("You are a helpful AI assistant. Answer the question based on the provided context. If the answer is not explicitly mentioned in the context, just respond with \"I don't know\" and avoid anything more. Do not make assumptions or provide uncertain answers. Your goal is to provide accurate and reliable information based on the given context."),
    HumanMessagePromptTemplate.from_template("Provided context to answer the question: \n{context}\n\nQuestion to be answered: \n{question}"),
    AIMessagePromptTemplate.from_template(""),
])

from langchain_core.runnables import RunnablePassthrough
from textwrap import fill
rag_chain = (
  {"context": retriever, "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

question = "How model quatization can solve memory issues?"
answer = rag_chain.invoke(question)
answer = fill(answer, width=100, initial_indent='\t', subsequent_indent='\t')
print(f"(Lower temperature) answer:\n{answer}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]



(Lower temperature) answer:
	I don't know.
