## CA 3, LLMs Spring 2024

- **Name:** Melika Nobakhtian
- **Student ID:** 4021305965008

---
### This is due on **May 11th, 2024**, submitted via [elearn](https://elearn.ut.ac.ir/).
#### Your submission should be named using the following format: `CA3_LASTNAME_STUDENTID.ipynb`.

---

##### *How to do this problem set:*

- Some questions require writing Python code and computing results, and the rest of them have written answers. For coding problems, you will have to fill out all code blocks that say `WRITE YOUR CODE HERE`.

- For text-based answers, you should replace the text that says "Write your answer here..." with your actual answer.

- There is no penalty for using AI assistance on this homework as long as you fully disclose it in the final cell of this notebook (this includes storing any prompts that you feed to large language models). That said, anyone caught using AI assistance without proper disclosure will receive a zero on the assignment (we have several automatic tools to detect such cases). We're literally allowing you to use it with no limitations, so there is no reason to lie!

---

##### *Academic honesty*

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your notebook. If you turn in correct answers on your notebook without code that actually generates those answers, we will consider this a serious case of cheating.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

---

# Chain-of-Thought (CoT) (20 points)

If you have any further questions or concerns, contact the TA via email: mehdimohajeri@ut.ac.ir

LLMs have demonstrated good reasoning abilities. Furthermore, their capabilities can be further improved by incorporating reasoning techniques. One of the most notable developments in this area is the [Chain-of-Thought (CoT)](https://arxiv.org/abs/2201.11903), which was introduced by Google. This approach has shown promising results in improving the reasoning capabilities of language models across a variety of tasks. Can you explain what CoT is and how it works? (2.5 Points)

**Answer:**

**Chain of Thought (CoT)** Chain-of-thought prompting is a prompt engineering technique that aims to improve language models' performance on tasks requiring logic, calculation and decision-making by structuring the input prompt in a way that mimics human reasoning.

To construct a chain-of-thought prompt, a user typically appends an instruction such as "Describe your reasoning in steps" or "Explain your answer step by step" to the end of their query to a large language model (LLM). In essence, this prompting technique asks the LLM to not only generate an end result, but also detail the series of intermediate steps that led to that answer.

**How CoT Works**:
   - At its core, CoT prompting enables complex reasoning capabilities through intermediate reasoning steps.
   - When faced with a complicated math or logic question, we often break down the larger problem into a series of smaller steps that help us arrive at a final answer.
   - CoT prompting encourages language models to perform similar intermediate reasoning steps.
   - By providing examples or prompts that guide the model step by step, we can enhance its ability to reason and solve complex tasks.

**Examples**:
   - **Few-shot CoT Prompting**:
   - **Zero-shot CoT Prompting**:
    



In this section, you should use the CoT technique. firstly you need to load the [Phi-2 model](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/). This model has been introduced by Microsoft as a small LLM

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

def generate_output(model, input, max_length=300):
  input = f"Question: {input}\nOutput:"
  input = tokenizer(input, return_tensors="pt", return_attention_mask=False)
  outputs = model.generate(**input, max_length=max_length)
  text = tokenizer.batch_decode(outputs)[0]
  return text

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Use Phi-2 to answer the questions below with and without CoT. Compare results and explain their difference. (4 Points)

In [None]:
questions = ["Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?",
"Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?",
"John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?",
"There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?",
"Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?"
]

## Correct Answers for each question:
    # 1: $10
    # 2: 400 ml
    # 3: 72 hours
    # 4: 91 chairs
    # 5: 75 words

# WRITE YOUR CODE HERE
"""Step 1: Prompting without CoT"""
for q in questions:
    answer = generate_output(model, q)
    print(answer)
    print("************************")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?
Output: Weng earned $9 for babysitting.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?
Output: To find the amount of salt in the seawater, we need to multiply the volume of water by the percentage of salt. 

2 liters x 20% = 0.4 liters

To convert liters to milliliters, we need to multiply by 1000.

0.4 liters x 1000 = 400 ml

Therefore, Jack will get 400 ml of salt when all the water evaporates.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?
Output: John volunteers a total of 36 hours per year (2 hours x 12 months = 24 hours per year; 24 hours x 2 months = 48 hours; 48 hours - 24 hours = 24 hours).
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?
Output: There are 32 tables in the hall.

Half of the tables have 2 chairs each, so there are 32/2 = 16 tables with 2 chairs each.

5 tables have 3 chairs each, so there are 5 x 3 = 15 tables with 3 chairs each.

The rest of the tables have 4 chairs each, so there are 32 - 16 - 5 = 11 tables with 4 chairs each.

The total number of chairs in the hall is 16 x 2 + 15 x 3 + 11 x 4 = 32 + 45 + 44 = 121.
<|endoftext|>
************************
Question: Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?
Output: On average, it takes Bert 1050 words to use up a pencil, so he uses up a pencil every 1050/2 = 525 words.
Since he fills out

In this part, because with the determined max length we couldn't answer the last question, in the next part, we increased max_length and again did the prompting:

In [None]:
"""Step 2: Prompting with CoT - Max length: 300"""
cot_prompt = " Try to solve this problem step by step."
for q in questions:
    answer = generate_output(model, q + cot_prompt)
    print(answer)
    print("************************")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Try to solve this problem step by step.
Output: To find out how much Weng earned, we need to convert the minutes of babysitting into hours. Since there are 60 minutes in an hour, 50 minutes is equal to 50/60 = 5/6 hours. Now, we can multiply the number of hours by the hourly rate to find the total earnings. Weng earned $12/hour * 5/6 hours = $10.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates? Try to solve this problem step by step.
Output: Step 1: Convert liters to milliliters. 2 liters = 2000 ml.
Step 2: Calculate the amount of salt in the seawater. 20% of 2000 ml = 0.2 * 2000 = 400 ml.
So, Jack will get 400 ml of salt when all the water evaporates.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year? Try to solve this problem step by step.
Output: To find the total number of hours John volunteers per year, we need to multiply the number of hours he volunteers per month by the number of months in a year.

Step 1: Calculate the number of hours John volunteers per month.
John volunteers for 3 hours per session, and he volunteers twice a month.
3 hours/session * 2 sessions/month = 6 hours/month

Step 2: Calculate the number of hours John volunteers per year.
There are 12 months in a year.
6 hours/month * 12 months/year = 72 hours/year

Therefore, John volunteers for 72 hours per year.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall? Try to solve this problem step by step.
Output: Step 1: Find the number of tables with 2 chairs each.
Half the tables have 2 chairs each, so there are 32/2 = 16 tables with 2 chairs each.

Step 2: Find the number of tables with 3 chairs each.
There are 5 tables with 3 chairs each.

Step 3: Find the number of tables with 4 chairs each.
The remaining tables have 4 chairs each, so there are 32 - 16 - 5 = 11 tables with 4 chairs each.

Step 4: Find the total number of chairs.
The total number of chairs is the sum of the chairs in each type of table.
For the tables with 2 chairs each, there are 16 * 2 = 32 chairs.
For the tables with 3 chairs each, there are 5 * 3 = 15 chairs.
For the tables with 4 chairs each, there are 11 * 4 = 44 chairs.
So, the total number of chairs is 32 + 15 + 44 = 91 chairs.
<|endoftext|>
*******

In [None]:
"""Step 2: Prompting with CoT - Max length = 350"""
cot_prompt = " Try to solve this problem step by step."
for q in questions:
    answer = generate_output(model, q + cot_prompt, 350)
    print(answer)
    print("************************")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Try to solve this problem step by step.
Output: To find out how much Weng earned, we need to convert the minutes of babysitting into hours. Since there are 60 minutes in an hour, 50 minutes is equal to 50/60 = 5/6 hours. Now, we can multiply the number of hours by the hourly rate to find the total earnings. Weng earned $12/hour * 5/6 hours = $10.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates? Try to solve this problem step by step.
Output: Step 1: Convert liters to milliliters. 2 liters = 2000 ml.
Step 2: Calculate the amount of salt in the seawater. 20% of 2000 ml = 0.2 * 2000 = 400 ml.
So, Jack will get 400 ml of salt when all the water evaporates.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year? Try to solve this problem step by step.
Output: To find the total number of hours John volunteers per year, we need to multiply the number of hours he volunteers per month by the number of months in a year.

Step 1: Calculate the number of hours John volunteers per month.
John volunteers for 3 hours per session, and he volunteers twice a month.
3 hours/session * 2 sessions/month = 6 hours/month

Step 2: Calculate the number of hours John volunteers per year.
There are 12 months in a year.
6 hours/month * 12 months/year = 72 hours/year

Therefore, John volunteers for 72 hours per year.
<|endoftext|>
************************


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall? Try to solve this problem step by step.
Output: Step 1: Find the number of tables with 2 chairs each.
Half the tables have 2 chairs each, so there are 32/2 = 16 tables with 2 chairs each.

Step 2: Find the number of tables with 3 chairs each.
There are 5 tables with 3 chairs each.

Step 3: Find the number of tables with 4 chairs each.
The remaining tables have 4 chairs each, so there are 32 - 16 - 5 = 11 tables with 4 chairs each.

Step 4: Find the total number of chairs.
The total number of chairs is the sum of the chairs in each type of table.
For the tables with 2 chairs each, there are 16 * 2 = 32 chairs.
For the tables with 3 chairs each, there are 5 * 3 = 15 chairs.
For the tables with 4 chairs each, there are 11 * 4 = 44 chairs.
So, the total number of chairs is 32 + 15 + 44 = 91 chairs.
<|endoftext|>
*******

**Results without CoT:** Among five questions we had in our propmts, the model just answered two of them correctly while for one of them (Bert example), it followed wrong steps but it concluded to correct answer.

**Results with Cot:** Cot did an incredible job and model answered all of the questions correctly. In addition, all of the reasoning step were correct and model obtained answers through correct reasoning steps. But this method has a problem, it makes llm output longer than what we had in the previous step so we will need a longer `max_length` for llm output.

## Other Methods for Reasoning

There are many other approaches to utilize the reasoning abilities of LLMs. Describe the [Tree-of-Thought (ToT)](https://arxiv.org/abs/2305.10601) and [Self-Consistency](https://arxiv.org/abs/2203.11171) within these approaches. (3.5 Points)

 **Tree of Thoughts (ToT)**:
   - ToT is a novel approach that enhances language model (LM) inference by allowing deliberate problem solving through interconnected reasoning steps.
   - Key features of ToT:
     - **Coherent Units of Text (Thoughts)**: ToT maintains a tree structure where each node represents a coherent sequence of language (a "thought"). These thoughts serve as intermediate steps toward solving a problem.
     - **Self-Evaluation and Decision Making**: LMs using ToT can self-evaluate their progress by considering multiple reasoning paths. They deliberate on choices and decide the next course of action based on intermediate thoughts.
     - **Global Choices and Backtracking**: ToT enables LMs to look ahead or backtrack when necessary, allowing for global decisions that impact the overall problem-solving process.

**Self-Consistency**:

  Self-consistency is an advanced prompting technique that builds on COT prompting. The aim here is to improve the naive greedy decoding using COT prompting by sampling multiple diverse reasoning paths and selecting the most consistent answers. By utilizing a majority voting system, the AI model can arrive at more accurate and reliable answers.


  To implement self-consistency, prompt engineers typically follow these steps:

- **Identify the problem:** Define the problem or question for which you require LLM's assistance. Make sure it is clear and specific.
- **Create multiple prompts:** Develop various prompts that approach the problem from different angles or perspectives. Each prompt should provide a unique reasoning path for the AI to follow.
- **Generate responses:** Submit the prompts to LLM and obtain the responses generated by the model.
- **Evaluate consistency:** Analyze the generated responses to determine their coherence, relevance, and consistency. This step may involve comparing the responses to each other, looking for common themes or patterns, and checking for internal logical consistency.
- **Select the best response:** Based on the evaluation, choose the most consistent and accurate response as the final answer.

Now, implement Self-Consistency to answer the questions of the previous section. (6 Points)

**Explanation:** I tried to do this section with just coding but it was hard to make model to just give the final answer in a format that makes majority voting easier. So I decided that to do majority voting by myself with analysing outputs. Here I tried self-consistency with giving model different prompts and aggregating between their outputs.

**Q1:** All of three prompts have different outputs so we could not use majority voting.

**Q2:** By majority voting the answer is 400 ml.

**Q3:** The true answer is 72 hours but with majority voting the answer will be 36 hours.

**Q4:** By majority voting the answer is 91 chairs.

**Q5:** All of three prompts have different outputs so we could not use majority voting.

In [None]:
# **** Question: diff prompts? we should recognize it by ourslves or not
"""Step 3: Prompting with CoT and Self-Consistency - Max length = 350"""
cot_prompts = [""" Try to solve this problem step by step. Then write the final answer in the last line.""",
              """Provide the answer for the problem by thinking step by step, Then write the final answer in the last line. """,
              """ Please present a soloution for the problem step by step, Then write the final answer in the last line."""]

for q in questions:
    for idx, cot_prompt in enumerate(cot_prompts):
        print(f"Iteration {idx + 1}:")
        answer = generate_output(model, q + cot_prompt, 350)
        print(answer)
    print("************************")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Iteration 1:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Try to solve this problem step by step. Then write the final answer in the last line.
Output: To solve this problem, we need to find out how many hours Weng babysat and then multiply that by her hourly rate. To find out how many hours she babysat, we need to divide 50 minutes by 60 minutes per hour. This gives us 0.83 hours. Then, we need to multiply 0.83 hours by $12 per hour. This gives us $10. So, Weng earned $10 for babysitting.
<|endoftext|>
Iteration 2:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?Provide the answer for the problem by thinking step by step, Then write the final answer in the last line. 
Output: Step 1: Convert 50 minutes to hours by dividing by 60. 50 minutes / 60 = 0.83 hours.
Step 2: Multiply the number of hours by the hourly rate. 0.83 hours x $12 = $9.96.
Step 3: Round the answer to the nearest cent. $9.96 rounded to the nearest cent is $9.96.
Final answer: Weng earned $9.96 for babysitting.
<|endoftext|>
Iteration 3:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? Please present a soloution for the problem step by step, Then write the final answer in the last line.
Output: To find out how much Weng earned, we need to multiply her hourly rate by the number of hours she worked. Since she worked 50 minutes, which is half an hour, we can write:

$12 \times 0.5 = 6$

Therefore, Weng earned $6 for babysitting.
<|endoftext|>
************************
Iteration 1:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates? Try to solve this problem step by step. Then write the final answer in the last line.
Output: To find the amount of salt in the seawater, we need to multiply the volume of the water by the percentage of salt. This gives us:

2 liters x 0.2 = 0.4 liters

To convert liters to milliliters, we need to multiply by 1000. This gives us:

0.4 liters x 1000 = 400 ml

Therefore, Jack will get 400 ml of salt when all the water evaporates.
<|endoftext|>
Iteration 2:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?Provide the answer for the problem by thinking step by step, Then write the final answer in the last line. 
Output: Step 1: Convert the volume of seawater from liters to milliliters.
2 liters = 2000 milliliters
Step 2: Calculate the amount of salt in the seawater.
20% of 2000 milliliters = 0.2 x 2000 = 400 milliliters
Step 3: The amount of salt that will be left after all the water evaporates is 400 milliliters.
Final answer: 400 milliliters
<|endoftext|>
Iteration 3:
Question: Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates? Please present a soloution for the problem step by step, Then write the 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year? Try to solve this problem step by step. Then write the final answer in the last line.
Output: To find the total number of hours John volunteers per year, we need to multiply the number of hours he volunteers per month by the number of months in a year. We can use the following formula:

Total hours = Hours per month x Months per year

Plugging in the given values, we get:

Total hours = 3 x 12
Total hours = 36

Therefore, John volunteers 36 hours per year.
<|endoftext|>
Iteration 2:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?Provide the answer for the problem by thinking step by step, Then write the final answer in the last line. 
Output: To find the total number of hours John volunteers per year, we need to multiply the number of hours he volunteers per month by the number of months in a year.

First, we need to find the number of hours he volunteers per month. We can do this by multiplying the number of times he volunteers per month by the number of hours he volunteers each time.

Number of hours per month = 2 times per month x 3 hours per time = 6 hours per month

Next, we need to find the number of months in a year. We can do this by using the fact that there are 12 months in a year.

Number of months in a year = 12 months

Finally, we need to multiply the number of hours per month by the number of months in a year to get the total number of hours per year.

Total number of hours per ye

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year? Please present a soloution for the problem step by step, Then write the final answer in the last line.
Output: To find the total number of hours John volunteers per year, we need to multiply the number of hours he volunteers per month by the number of months in a year.

Step 1: Multiply the number of hours he volunteers per month by the number of months in a year.

3 hours/month x 12 months/year = 36 hours/year

Step 2: Write the final answer.

John volunteers 36 hours per year.
<|endoftext|>
************************
Iteration 1:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall? Try to solve this problem step by step. Then write the final answer in the last line.
Output: Step 1: Find the number of tables that have 2 chairs each.
Half the tables have 2 chairs each, so we can divide 32 by 2 to get 16 tables.

Step 2: Find the number of tables that have 3 chairs each.
5 tables have 3 chairs each, so we can multiply 5 by 3 to get 15 tables.

Step 3: Find the number of tables that have 4 chairs each.
To find the number of tables that have 4 chairs each, we can subtract the number of tables that have 2 chairs and the number of tables that have 3 chairs from the total number of tables.
32 - 16 - 15 = 3 tables

Step 4: Find the total number of chairs in the hall.
To find the total number of chairs in the hall, we can multiply the number of tables that have 2 chairs by 2, the number of tables that h

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?Provide the answer for the problem by thinking step by step, Then write the final answer in the last line. 
Output: Step 1: Find the number of tables with 2 chairs each.
Half of 32 tables is 16 tables.
Step 2: Find the number of tables with 3 chairs each.
5 tables have 3 chairs each.
Step 3: Find the number of tables with 4 chairs each.
32 tables - 16 tables - 5 tables = 11 tables have 4 chairs each.
Step 4: Find the total number of chairs in the hall.
16 tables x 2 chairs = 32 chairs
5 tables x 3 chairs = 15 chairs
11 tables x 4 chairs = 44 chairs
Total chairs = 32 + 15 + 44 = 91 chairs
Final answer: There are 91 chairs in total in the hall.
<|endoftext|>
Iteration 3:


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall? Please present a soloution for the problem step by step, Then write the final answer in the last line.
Output: Step 1: Find the number of tables with 2 chairs each.
Half the tables have 2 chairs each, so there are 32/2 = 16 tables with 2 chairs each.

Step 2: Find the number of tables with 3 chairs each.
There are 5 tables with 3 chairs each.

Step 3: Find the number of tables with 4 chairs each.
The rest of the tables have 4 chairs each, so there are 32 - 16 - 5 = 11 tables with 4 chairs each.

Step 4: Find the total number of chairs.
The total number of chairs is the sum of the chairs in each type of table.
The chairs in the tables with 2 chairs each are 16 x 2 = 32.
The chairs in the tables with 3 chairs each are 5 x 3 = 15.
The chairs in the tables with 4 chairs each are 11 x 4 = 44.
The total number of chairs i

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average? Try to solve this problem step by step. Then write the final answer in the last line.
Output: Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average? Try to solve this problem step by step. Then write the final answer in the last line.

Step 1: Find the number of pencils Bert uses in a year.
Since he uses a pencil every two weeks, he uses a pencil 52/2 = 26 times in a year.

Step 2: Find the number of words Bert writes in a year.
Since he uses 1050 words to use up a pencil, he writes 1050 x 26 = 27,300 words in a year.

Step 3: Find the number of words i

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?Provide the answer for the problem by thinking step by step, Then write the final answer in the last line. 
Output: To solve this problem, we need to find the average number of words in each crossword puzzle that Bert fills out. 
Step 1: Find the number of pencils Bert uses in a year. 
Since Bert fills out the crossword puzzle every day, he fills out 365 puzzles in a year. 
Since he uses up a pencil every two weeks, he uses up 365/14 = 26.07 pencils in a year. 
Step 2: Find the total number of words Bert uses in a year. 
Since it takes him 1050 words to use up a pencil, he uses up 26.07 x 1050 = 27,857.5 words in a year. 
Step 3: Find the average number of words in each crossword puzzle. 
To find the average number of words in each crossw

Consider LLMs' features and propose a new approach based on them to enhance LLMs' reasoning abilities. Why do you believe this approach could enhance LLMs' reasoning abilities? (4 Points)

One idea that came to my mind is that after solving a reasoning problem, the LLM re-evaluates its reasoning steps and thinks that based on these steps, it provided wrong answer or right answer. This approach helps LLMs to learn from their outputs and correct themselves without any supervision by humans.

# PEFT (30 + 5 points)

If you have any further questions or concerns, contact the TA via email: pedram.rostami@ut.ac.ir

## Why We Are Using PEFT (5 points)

In this question, we're delving into PEFT. First, let's start by exploring why PEFT is crucial when training LLMs. For instance, let's consider the scenario where we want to train the [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) model. To get started, take a look at the Huggingface blog post on [model memory anatomy](https://huggingface.co/docs/transformers/en/model_memory_anatomy) to estimate how much memory we'll require. Just assume we're sticking to pure fp16 with Adam optimizer and a batch size of 1. (4 points)

Memory Required for training = Model weights + optimizer states + gradient calculations + activations

**Model memory:**

- pure fp16/bf16: (2 bytes / param)(# params)
- 2.7 billion * 2 bytes = 5.4 GB

**Optimizer:**
- 8 bytes * number of parameters for normal AdamW (maintains 2 states)
- 8 bytes * 2.7 billion = 21.6 GB

**Gradient Caculations:**
- 4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32)
- 4 bytes * 2.7 billion = 10.8 GB

**Activations:**
- ize depends on many factors, the key ones being sequence length, hidden size and batch size.
-sbhL bytes
- **s**: seq length - our case = 1024
- **b**: batch size per GPU - our case = 1
- **h**: hidden size of transformer - seems to be 2048
- **L**: number of layers in transformer model - 24 layers for decoder based on doc.
- 1024 * 1 * 2048 * 24 = 50,331,648 bytes
- This part is hard to compute and I don't think that my results is true.

Compare your estimation with the memory estimation provided by the [Model Memory Calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage). (1 point)

**Memory usage for microsoft/phi-2:**
- Total Size: 4.94 GB
- Training using Adam (Peak vRAM): 19.76 GB

**Training using Adam explained:**
* model: 9.88 GB
* gradient calculation: 14.82 GB
* backward pass: 19.76 GB
* optimizer step: 19.76 GB

**Comparison:**
- In the optimizer part, our results are higher than real values but in other parts, calculater assigned larger numbers.

****

## Preparing Dataset (5 points)

We're going to train the phi-2 model for a question generation task based on passages. For this purpose, we're using the Super-NaturalInstruction dataset, which comprises instruction tuning datasets for over 1600 tasks across different languages. While the dataset is available on the [Huggingface Hub](https://huggingface.co/datasets/Muennighoff/natural-instructions), downloading all its components consumes considerable time. Consequently, we're opting to download only the English Question Generation segment.

In [None]:
!wget https://huggingface.co/datasets/Muennighoff/natural-instructions/resolve/main/train/task001_quoref_question_generation_train.jsonl

--2024-05-14 15:47:05--  https://huggingface.co/datasets/Muennighoff/natural-instructions/resolve/main/train/task001_quoref_question_generation_train.jsonl
Resolving huggingface.co (huggingface.co)... 18.172.134.124, 18.172.134.24, 18.172.134.88, ...
Connecting to huggingface.co (huggingface.co)|18.172.134.124|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs.huggingface.co/repos/a1/fe/a1fedd93d2c00f67a096c36747356c03b6f01649bae4b4be932e6531a496022a/89ad3018bdb2cec45afea661fbe2fc8df9593243f58531d381c19b5fb13ce581?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27task001_quoref_question_generation_train.jsonl%3B+filename%3D%22task001_quoref_question_generation_train.jsonl%22%3B&Expires=1715960825&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxNTk2MDgyNX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy5odWdnaW5nZmFjZS5jby9yZXBvcy9hMS9mZS9hMWZlZGQ5M2QyYzAwZjY3YTA5NmMzNjc0NzM1NmMwM2I2ZjAxNjQ5YmF

Read the dataset file and convert it into a `dataset` object. Then, split the dataset, selecting 95% for the training set and 5% for the test set. (5 points)

In [None]:
# WRITE YOUR CODE HERE
import json
with open('/content/task001_quoref_question_generation_train.jsonl') as f:
    data = [json.loads(line) for line in f]

In [None]:
data[0]

{'task_name': 'task001_quoref_question_generation',
 'id': 'task001-f44801d948324957abe71877d837d070',
 'definition': "In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.",
 'targets': 'What is the first name of the person who doubted it would turn out to be a highly explosive eruption like those that can occur in subduction-zone volcanoes?'}

In [None]:
%%capture
!pip install datasets

In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

class CustomDataset(Dataset):
    def __init__(self, data):
        self.data_list = data

    def __len__(self):
        return len(self.data_list)

    def __getitem__(self, idx):
        example = self.data_list[idx]
        return { "id": example["id"],
                 "definition": example["definition"],
                 "inputs": example["inputs"],
                 "targets": example["targets"],
        }

dataset = CustomDataset(data)
train_dataset, test_dataset = train_test_split(dataset, test_size=0.05)

## Pretrained Model (5 points)

Choose random samples from the test set, apply the [Alpaca template](https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#data-release) to them, and obtain the model outputs (If you are using the [sample code](https://huggingface.co/microsoft/phi-2#sample-code) provided by Microsoft for using the model, please comment out the `torch.set_default_device("cuda")` line to conserve memory. Instead, you can move the model to the GPU using the `.to` function after loading it.). (5 points)

In [None]:
# WRITE YOUR CODE HERE
import random
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

def Alapaca_template(model, instruction, input):
    template = f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:{instruction}\n\n### Input:{input}\n\n### Response:"
    inputs = tokenizer(template, return_tensors="pt", return_attention_mask=False).to(device)
    outputs = model.generate(**inputs, max_length=850)
    text = tokenizer.batch_decode(outputs)[0]
    return text


for _ in range(5):
    idx = random.randint(0, len(test_dataset))
    instruction, input = test_dataset[idx]['definition'], test_dataset[idx]['inputs']
    output = Alapaca_template(model, instruction, input)
    print(output)
    print("=============")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:Passage: Although "Amazing Grace" set to "New Britain" was popular, other versions existed regionally. Primitiv

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:Passage: The film begins by introducing Kellyanne Williamson, playing with imaginary friends Pobby and Dingan. 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:Passage: Bill Whitney seems to have it all.  His family is wealthy and he lives in a mansion in Beverly Hills, 

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:In this task, you're given passages that contain mentions of names of people, places, or things. Some of these mentions refer to the same person, place, or thing. Your job is to write questions that evaluate one's understanding of such references. Good questions are expected to link pronouns (she, her, him, his, their, etc.) or other mentions to people, places, or things to which they may refer. Do not ask questions that can be answered correctly without understanding the paragraph or having multiple answers. Avoid questions that do not link phrases referring to the same entity. For each of your questions, the answer should be one or more phrases in the paragraph, and it should be unambiguous.

### Input:Passage: Writing in 1952, after the first section of the buildings was complete, J. M. Richards, the editor of 

In [None]:
del output

## Fine-tuning with LoRA (15 + 5 points)

In this phase, we're fine-tuning the phi-2 model on a question generation dataset. To begin, we need to format our dataset into the instruction tuning format. For this task, we can employ `DataCollatorForCompletionOnlyLM`. Look at the [example](https://huggingface.co/docs/trl/en/sft_trainer#train-on-completions-only) in the HuggingFace documentation and instantiate the data collator using the Alpaca template. (3 points)

In [None]:
# WRITE YOUR CODE HERE
from datasets import Dataset

instructions = [entry['definition'] for entry in train_dataset]
inputs = [entry['inputs'] for entry in train_dataset]
outputs = [entry['targets'] for entry in train_dataset]

# Dataset dict
data_dict = {
                "instruction": instructions,
                "input": inputs,
                "output": outputs,
            }

training_dataset = Dataset.from_dict(data_dict)

In [None]:
%%capture
!pip install trl

In [None]:
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Input: {example['input'][i]}\n ### Response: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

response_template = " ### Response:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

Refer to the HuggingFace [documentation](https://huggingface.co/docs/trl/en/sft_trainer#training-adapters) and instantiate the Lora config. (3 points)

In [None]:
%%capture
!pip install peft

In [None]:
# WRITE YOUR CODE HERE
from peft import LoraConfig, get_peft_model

peft_config = LoraConfig(
    r=4,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 4,587,520 || all params: 2,784,271,360 || trainable%: 0.16476554928898884


Configure other training arguments. [Here](https://huggingface.co/docs/transformers/v4.40.1/en/main_classes/trainer#transformers.TrainingArguments) is a list of available options. Consider using a small batch size to prevent CUDA out of memory errors. You can augment batch size artificially through gradient accumulation. Enabling gradient checkpointing can further save memory. You may train the model for tens of steps. (3 points)

In [None]:
%%capture
!pip install -U transformers

In [None]:
# WRITE YOUR CODE HERE
from transformers import TrainingArguments

# Set a small batch size to avoid OOM errors
batch_size = 1

# Increase effective batch size through gradient accumulation steps
gradient_accumulation_steps = 2048


training_args = TrainingArguments(
    output_dir='./output',
    num_train_epochs=1,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    gradient_checkpointing=True,
)


Take a look at the HuggingFace [documentation](https://huggingface.co/docs/trl/en/sft_trainer) on supervised fine-tuning trainers. Instantiate the trainer and train the model ( Note that you should initialize the phi-2 model with `bfloat16` or `float16` dtype to avoid encountering Cuda out of memory errors.). (3 points)

**Attention:** The process of training was time-consuming and because of this my gpu time ran out. So I could not merge model weights and test the model with inputs.

In [None]:
# WRITE YOUR CODE HERE
tokenizer.pad_token = tokenizer.eos_token

trainer = SFTTrainer(
    model,
    args=training_args,
    peft_config=peft_config,
    train_dataset=training_dataset,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    max_seq_length=1024,
)

trainer.train()

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/20726 [00:00<?, ? examples/s]

 ### Input: Passage: Music critic J. D. Considine noted "on albums, Jackson's sound isn't defined by her voice so much as by the way her voice is framed by the lush, propulsive production of Jimmy Jam and Terry Lewis." Wendy Robinson of PopMatters said "the power of Janet Jackson's voice does not lie in her pipes. She doesn't blow, she whispers... Jackson's confectionary vocals are masterfully complemented by gentle harmonies and balanced out by pulsing rhythms, so she's never unpleasant to listen to."Matthew Perpetus of Fluxblog suggested Jackson's vocal techniques as a study for indie rock music, considering it to possess "a somewhat subliminal effect on the listener, guiding and emphasizing dynamic shifts without distracting attention from its primal hooks." Perpetus added: "Her voice effortlessly transitions from a rhythmic toughness to soulful emoting to a flirty softness without overselling any aspect of her performance... a continuum of emotions and attitudes that add up to the 

Step,Training Loss


 ### Input: Passage: Loder moved to St. John's in January 2010, and performed alongside Starfield and Roy Martin later that year at the Exploits Valley Salmon Festival gospel concert in Grand Falls-Windsor. She also performed at the 2010 One Worship Festival in Springdale, and officially released Imperfections & Directions, another independent release, at YC Newfoundland that October. Loder's nursing studies hampered her ability to showcase this album by touring. A reporter for The Telegram, a St. John's-based newspaper, noted that Imperfections & Directions "demonstrates how Loder wears her faith and love of God on her sleeve." Loder was nominated as Female Artist of the Year at the 2010 MusicNL awards with Mary Barry; Teresa Ennis; Irene Bridger; and Amelia Curran, the eventual winner. Loder was nominated for another MusicNL award the following year, this time in the Gospel Artist of the Year category; this nomination was, in part, due to Imperfections & Directions.In early 2012, Lod

Step,Training Loss


 ### Input: Passage: When Alexander the Great died at Babylon in 323 BC, his mother Olympias immediately accused Antipater and his faction of poisoning him, although there is no evidence to confirm this. With no official heir apparent, the Macedonian military command split, with one side proclaiming Alexander's half-brother Philip III Arrhidaeus (r. 323 – 317 BC) as king and the other siding with the infant son of Alexander and Roxana, Alexander IV (r. 323 – 309 BC). Except for the Euboeans and Boeotians, the Greeks also immediately rose up in a rebellion against Antipater known as the Lamian War (323–322 BC). When Antipater was defeated at the 323 BC Battle of Thermopylae, he fled to Lamia where he was besieged by the Athenian commander Leosthenes. A Macedonian army led by Leonnatus rescued Antipater by lifting the siege. Antipater defeated the rebellion, yet his death in 319 BC left a power vacuum wherein the two proclaimed kings of Macedonia became pawns in a power struggle between 

Get the final model from the trainer and merge the Lora weights with it. Then, test the model with the inputs you gave to the pretrained model and compare the results. (3 points)

In [None]:
# WRITE YOUR CODE HERE
# Merge LoRA with the base model
merged = model.merge_and_unload()

\# WRITE YOUR ANSWER HERE

We know that fine-tuning LLMs on Colab or Kaggle notebooks can be a bit tricky, and fine-tuning phi-2 for this task may require more GPU hours. The main point of this question is to teach you how to train your model using HuggingFace packages. So, it's okay if your model doesn't produce optimal results. However, there are 5 additional points available if it can generate better results :)

# RAG (50 points)

If you have any further questions or concerns, contact the TA via email: alisalemi@ut.ac.ir

## Install Requirements

In [1]:
%pip install -q langchain
%pip install -q ctransformers
%pip install -q sentence_transformers
%pip install -q datasets
%pip install -q rank_bm25
%pip install -q faiss-gpu
%pip install -q arxiv
%pip install -q pymupdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.9/302.9 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.2/121.2 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.5/142.5 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

## 1. An Overview of LangChain (10 pt)

LangChain is an open-source framework designed to simplify the creation of applications using LLMs. It provides a standard interface for chains, lots of integrations with other tools, and end-to-end chains for common applications.

In this overview, we will provide a step-by-step guide on how to construct a basic application using LangChain. This application will fetch country-related information from a Large Language Model. For this purpose, we will be utilizing the LLaMa 2 chat 7B as our base model.

In [1]:
from langchain_community.llms import CTransformers

model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50
  }
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

### 1.1 GGUF Format (3 pt)

Write a brief paragraph discussing the GGUF format and its benefits. Compare it with transformers library.


**GGUF (Generative Graded Units Format)**:
   - **Definition**: GGUF is a file format used for storing models specifically designed for inference.
   - **Key Features**:
     - **Quantization**: GGUF models are quantized, meaning they use reduced precision (e.g., 4-bit quantization) to represent model weights. This reduces memory requirements and allows efficient inference on CPUs.
     - **Compact Size**: GGUF files are smaller than their full-precision counterparts, making them more manageable for deployment.
     - **CPU-Friendly**: GGUF models are optimized for CPU inference, making them suitable for scenarios where GPU resources are limited.

**Comparison**:
   - **Use Case**:
     - **GGUF**: Optimized for CPU inference, suitable for resource-constrained environments.
     - **Transformers**: Versatile, widely used for both research and production tasks.
   - **Model Size**:
     - **GGUF**: Smaller due to quantization.
     - **Transformers**: Larger, especially for state-of-the-art models.
   - **Deployment**:
     - **GGUF**: Easy to deploy on CPUs.
     - **Transformers**: Requires GPU resources for optimal performance.



### 1.2 Simple Chain (2 pt)

Complete the next cell to create a simple chain that takes the name of a country as input and outputs its capital. To accomplish this, you should utilize the `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes to formulate an effective prompt.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser

prompt = ChatPromptTemplate.from_messages([
  HumanMessagePromptTemplate.from_template("What is the capital of {country}?"),
  AIMessagePromptTemplate.from_template("The capital of {country} is ")
])

output_parser = StrOutputParser()

simple_chain = prompt | model | output_parser

answer = simple_chain.invoke({"country": "Iran"})

print(answer)


Tehran.


Write about the objectives behind the creation of `HumanMessagePromptTemplate` and `AIMessagePromptTemplate` classes. What they actually do? Write a brief description.

**`HumanMessagePromptTemplate:`**
- This is a message sent from the user.

- Create a new model by parsing and validating input data from keyword arguments.

- Raises ValidationError if the input data cannot be parsed to form a valid model.

- The objective behind the creation of the `HumanMessagePromptTemplate` class is to provide a template for crafting messages that simulate human input or conversation in the prompt.

**`AIMessagePromptTemplate`**:
- This is a message sent from the AI.

- Like `HumanMessagePromptTemplate` creates model and validates data

- The `AIMessagePromptTemplate` class aims to provide a template for generating AI-generated messages that provide context or guide the model in generating responses.

What is the purpose of adding an empty `AIMessagePromptTemplate` at the end of prompt? What is the consequences of omitting it?

The purpose of adding an empty `AIMessagePromptTemplate` at the end of the prompt is to ensure that the model generates a response even if the user's input is not directly followed by an AI-generated message. Let's explain this in more details:

**Purpose of Empty AIMessagePromptTemplate**:
   - When constructing a chat prompt, we alternate between human-generated and AI-generated messages.
   - By including an empty `AIMessagePromptTemplate`, we maintain this alternating pattern, ensuring that the model always has a chance to generate a response.
   - Without it, the last human-generated message would be the final input, and the model might not produce any output.

**Consequences of Omitting It**:
   - If we omit the empty `AIMessagePromptTemplate`:
     - The last human message becomes the final input, and the model won't have an opportunity to generate a response.
     - The conversation would abruptly end after the last human message, leaving the user without an AI-generated answer.
     - The chat prompt would lack completion, potentially confusing users who expect a full conversation.

### 1.3 JSON Chain (5 pt)

Now we want to improve the chain to extract data from the model response. Modify the existing prompt to request information about a country's name, population, and major cities in addition to the capital. Additionally, incorporate a `SystemMessagePromptTemplate` to ensure the model's response is structured in JSON format. Keep in mind that a distinct parser is required to parse the JSON output.

In [None]:
from langchain_core.prompts import ChatPromptTemplate, HumanMessagePromptTemplate, AIMessagePromptTemplate, SystemMessagePromptTemplate
from langchain_core.output_parsers import JsonOutputParser

# Create a chat prompt template
prompt = ChatPromptTemplate.from_messages([
    SystemMessagePromptTemplate.from_template("""You are a useful AI Assistant.
    Return the response in json format . The keys in the response are:country, capital, population, and cities.
     Do not include any additional text in the answer."""),
    HumanMessagePromptTemplate.from_template("Provide information about {country}: capital, population, and major cities."),
    AIMessagePromptTemplate.from_template(" "),

])

# Instantiate a JSON output parser
output_parser = JsonOutputParser()

# Create the JSON chain
json_chain = prompt | model | output_parser

# Batch process multiple countries
answers = json_chain.batch([
    {"country": "Iran"},
    {"country": "USA"},
    {"country": "Japan"},
    {"country": "Nigeria"}
])

# Print the extracted information
for ans in answers:
    print(f"{ans['country']}:")
    print(f"  capital: {ans['capital']}")
    print(f"  population: {ans['population']}")
    print(f"  important cities: {ans['cities']}")


Iran:
  capital: Tehran
  population: 820090400
  important cities: ['Tehran', 'Mashhad', 'Isfahan', 'Karaj', 'Shiraz']
USA:
  capital: Washington D.C.
  population: 327059148
  important cities: [{'name': 'New York City', 'population': 84258631}, {'name': 'Los Angeles', 'population': 39927183}, {'name': 'Chicago', 'population': 27223810}]
Japan:
  capital: Tokyo
  population: 127058000
  important cities: ['Tokyo', 'Osaka', 'Nagoya', 'Sapporo', 'Yokohama']
Nigeria:
  capital: Abuja
  population: 1900340241
  important cities: ['Lagos', 'Kano', 'Ibadan', 'Port Harcourt']


## 2. Different Types of Retrievers (15 pt)

In this section, We use mini-bioasq dataset to evalute different types of retrivers.

In [None]:
import json
from datasets import load_dataset

corpus = load_dataset("rag-datasets/mini-bioasq", "text-corpus", split="passages")
qa_dataset = load_dataset("rag-datasets/mini-bioasq", "question-answer-passages", split="test[:100]")

qa_dataset = qa_dataset.map(lambda data: {
  "relevant_passage_ids": json.loads(data["relevant_passage_ids"])
})

print(corpus)
print(qa_dataset)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/513 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/24.5M [00:00<?, ?B/s]

Generating passages split:   0%|          | 0/40221 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4719 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Dataset({
    features: ['passage', 'id'],
    num_rows: 40221
})
Dataset({
    features: ['question', 'answer', 'relevant_passage_ids', 'id'],
    num_rows: 100
})


In [None]:
corpus[0]

{'passage': 'New data on viruses isolated from patients with subacute thyroiditis de Quervain \nare reported. Characteristic morphological, cytological, some physico-chemical \nand biological features of the isolated viruses are described. A possible role \nof these viruses in human and animal health disorders is discussed. The isolated \nviruses remain unclassified so far.',
 'id': 9797}

### 2.1 Evaluate Retriever (4 pt)

To effectively compare various retrieval systems, we must define a metric. Complete the `evaluate_retriever` function to measure the accuracy of the retrieved documents. Consider the `relevant_passage_ids` column as the expected documents to be retrieved.

In [None]:
def evaluate_retriever(retriever, retriever_type):
    total_queries = len(qa_dataset)
    total_relevant_docs_retrieved = 0
    total_retrieved_docs = 0

    for idx in range(total_queries):
        query = qa_dataset[idx]['question']
        relevant_passage_ids = qa_dataset[idx]['relevant_passage_ids']

        if retriever_type == "TF-IDF":
            docs = retriever.invoke(query)
        elif retriever_type == "Semantic":
            # becuase we set for TF-IDF to retrieve 5 documents, here we will do this for semantic reteriever
            docs = retriever.similarity_search(query, k=5)

        retrieved_doc_ids = [doc.metadata['document_id'] for doc in docs]

        # Calculate how many retrieved documents are relevant
        relevant_retrieved_count = sum(1 for doc_id in retrieved_doc_ids if doc_id in relevant_passage_ids)

        total_relevant_docs_retrieved += relevant_retrieved_count
        total_retrieved_docs += len(retrieved_doc_ids)

    # Calculate precision at k
    acc_at_k = total_relevant_docs_retrieved / total_retrieved_docs if total_retrieved_docs > 0 else 0
    return acc_at_k


### 2.2 TF-IDF Retriever (3 pt)

Create a TF-IDF retriever and configure it to returns the top 5 relevant documents.

In [None]:
from langchain_core.documents import Document
from langchain_community.retrievers import TFIDFRetriever


# Define documents as a list of Document objects
documents = []
for i in range(len(corpus)):
  passage = corpus["passage"][i]
  doc_id = corpus["id"][i]
  document =  Document(page_content=passage, metadata={"document_id": doc_id})
  documents.append(document)

# Define TFIDF retriever with k=5 (number of retrieved documents)
retriever = TFIDFRetriever.from_documents(documents, k=5)

### 2.3 Semantic Retriever (5 pt)

Semantic retrievers operate by retrieving documents through embeddings. These systems require an embedding model to convert documents into a vector space, and a vector database to find the closest documents to a query. Construct a semantic retriever that utilizes [`intfloat/e5-base`](https://huggingface.co/intfloat/e5-base) as the embedding model and FAISS for the vector database.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

model_path = "intfloat/e5-base"

embedding_model = HuggingFaceEmbeddings(model_name=model_path)
semantic_retriever = FAISS.from_documents(documents, embedding_model)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/356 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

### 2.4 Compare Retrivers (3 pt)

Calculate the score for each retriever using `evaluate_retriever` you previously writed. In this question, which one outperforms the other? Illustrate a scenario for each retriver that it outperforms the other.

1. **TF-IDF Retriever:**
   - **Scenario**: Information Retrieval in a Document Search Engine
     - **Use Case**: Imagine a large-scale document search engine (like Google Search) where users enter queries to find relevant documents.
     - **Advantages of TF-IDF**:
       - **Term Frequency-Inverse Document Frequency (TF-IDF)** is effective for keyword-based retrieval.
       - It assigns weights to terms based on their frequency in a document relative to their frequency across all documents.
       - In scenarios where users search for specific keywords or phrases, TF-IDF can quickly identify relevant documents.
       - It's computationally efficient and works well for simple queries.
     - **Example**:
       - A user searches for "machine learning algorithms." TF-IDF retrieves documents containing these exact terms efficiently.

2. **Semantic Retriever:**
   - **Scenario**: Contextual Search in Conversational AI
     - **Use Case**: Consider a chatbot or virtual assistant that engages in natural language conversations with users.
     - **Advantages of Semantic Retrieval**:
       - **Semantic retrievers** capture deeper meaning by considering context and semantic similarity.
       - They understand synonyms, related concepts, and contextually similar phrases.
       - In conversational scenarios, where users express queries naturally, semantic retrievers excel.
       - They can handle paraphrasing and variations in user input.
     - **Example**:
       - A user asks, "What are the benefits of deep learning?" A semantic retriever can identify relevant documents even if they don't contain the exact phrase "deep learning."


In our scenario, the Semantic retriever outperforms TF-IDF retriever. Here we have a conversational case which we have question answering so we expected that Semantic retriever has a better performance.

In [None]:
tfidf_acc = evaluate_retriever(retriever, "TF-IDF")
semantic_acc = evaluate_retriever(semantic_retriever, "Semantic")

print(f"TF-IDF accuracy: {tfidf_acc:.2f}")
print(f"semantic accuracy: {semantic_acc:.2f}")


TF-IDF accuracy: 0.46
semantic accuracy: 0.54


## 3. RAG (25 pt)

In this section, you should use all the concepts you've learned until now to create a complete RAG chain.

### 3.1 Load Documents (2 pt)

Load [RAFT](https://arxiv.org/abs/2403.10131) and [DSPy](https://arxiv.org/abs/2401.12178) papers. You can use `ArxivLoader` to get documents from arXiv.


In [3]:
%%capture
!pip install --upgrade --quiet  arxiv

In [21]:
from langchain_community.document_loaders import ArxivLoader
dspy_doc = ArxivLoader(query="2310.03714", load_max_docs=1).load()[0]
raft_doc = ArxivLoader(query="2403.10131", load_max_docs=1).load()[0]

### 3.2 Split Documents into Chunks (4 pt)

Usually, each document is constructed from multiple sections, each with a separate topic. It is better to split each document into smaller parts named chunks and search among them instead of actual documents. Write a splitter to create chunks from loaded documents.

In [22]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

docs = [raft_doc, dspy_doc]

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
)

chunks = text_splitter.split_documents(docs)

### 3.3 Retriever (3 pt)

Create a retriever of your choice.

In [23]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS

model_path = "intfloat/e5-base"

embedding_model = HuggingFaceEmbeddings(model_name=model_path)
retriever = FAISS.from_documents(chunks, embedding_model).as_retriever()



### 3.4 Design Prompt (2 pt)

Design a suitable prompt for RAG.

In [24]:
from langchain_core.prompts import ChatPromptTemplate

template = """You are a useful AI assistant that answers questions about available papers .
Use the provided context to answer the question.
Context: {context}
Question: {question}
Answer: """
prompt = ChatPromptTemplate.from_template(template)

### 3.5 RAG Chain (3 pt)

Design a question from the documents and get the retriever and RAG output for that question.

In [6]:
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
  {"context":  retriever , "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

question = "Compare DSPy with exsiting libraries like Langchain and llamaIndex."
retrieved_doc = retriever.get_relevant_documents(question)
answer = rag_chain.invoke(question)

print(f"retrieved document:\n{retrieved_doc}\n")
print(f"answer:\n{answer}")

  warn_deprecated(


retrieved document:
[Document(page_content='answer found: bool.\nB\nCOMPARISON WITH EXISTING LIBRARIES LIKE LANGCHAIN AND\nLLAMAINDEX', metadata={'Published': '2023-10-05', 'Title': 'DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines', 'Authors': 'Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, Christopher Potts', 'Summary': 'The ML community is rapidly exploring techniques for prompting language\nmodels (LMs) and for stacking them into pipelines that solve complex tasks.\nUnfortunately, existing LM pipelines are typically implemented using hard-coded\n"prompt templates", i.e. lengthy strings discovered via trial and error. Toward\na more systematic approach for developing and optimizing LM pipelines, we\nintroduce DSPy, a programming model that abstracts LM pipelines as text\ntransformation graphs, i.e. imperative com

### 3.6 Out of Domain Question (4 pt)

Ask a question that is not related to documents. Does model answer it? Change your prompt to force model say "I don't know" when some one asks out of domains questions.

**Answer:** It tried to answer but it is not correct.

In [8]:
# Out of Domain
question = "What are the best novels in the world?."
answer = rag_chain.invoke(question)
answer



' There are there are you can provide a list of course, I apologize and outperforming your question about English author Omarriage novels. The paper titled "DSPy have you can you can you can you can you can help me\nSorry, Hello!There are there are there are there are there are there are there are there are romance novels of course on the papers written by Omarriage novels. There are there are romance andrewritten in terms such as a large amount of the document containing Romance novel or English author wrote Romance novels where can provide the papers written by providing a collection of English authors write essays. Here are there are there are there are there are there are there are there are there are there are there are you can you can you can you can you can you can you can you can you can you can you can answer your assistant@User: DSPy there are you got it appears to searchable to provide papers.\nThere are there are there are there are there are there are there are there are t

In [10]:
template = """You are a useful AI assistant that answers questions about available papers .
Use the provided context to answer the question.
if someone asked you out of domain questions, just answer with "I don't know".
Question: {question}
Context: {context}
Answer: """
prompt = ChatPromptTemplate.from_template(template)

In [12]:
rag_chain = (
  {"context":  retriever , "question": RunnablePassthrough()}
  | prompt
  | model
  | StrOutputParser()
)

answer = rag_chain.invoke(question)
answer



" I cannot provide a useful A question difficult to answer: I don'The best novels.\nI cannot provide the best novels.\nI cannot provide a) The best novels, I cannot provide a list of course, I cannot provide a great question doesn'I'I'I'I'I'I'I'I'I'I cannot provide a) I don'The best novels in my apologize the best novels.\nI cannot provide a list of course.  I cannot provide a list of course of course of course of course of course novels and this is difficult to answer: As an interestingly, There are some of course! The best novels, I'The best novels in my apologize the best novels, unfortunately, I cannot determine the best novels\nI cannot provide a useful A great question does not able to beowing your question: There are some of course: The best novels or a useful A definitively the best novels for what answelling Novels or one can provide here are the best novels,  I don'DSPy There are many different people have you cannot answer"

### 3.7 The Effect of Temperature (7 pt)

RAG performance is highly dependent on model temperature. Explain that low temperature is better or high temperature? For the same prompt, compare the output of the model with low and high temperature.



based on the [Ctransformers doc](https://github.com/marella/ctransformers#config), the default value for temprature is 0.8.



1. **What is Temperature?**
   - In the context of language models, **temperature** is a hyperparameter that controls the randomness of the generated output.
   - It affects the **softness** of the probability distribution over possible tokens.
   - Higher temperature values make the distribution **more uniform**, leading to more diverse and creative outputs.
   - Lower temperature values make the distribution **sharper**, resulting in more deterministic and focused outputs.

2. **Effect of Temperature on RAG Models:**
   - RAG models combine retrieval (from a set of documents) with generation (language modeling).
   - The retrieval component provides context, and the generation component produces the final output.
   - Temperature impacts both the retrieval and generation steps:

     - **High Temperature:**
       - **Pros:**
         - Encourages exploration and diversity.
         - Generates more novel responses.
         - Useful for creative tasks (e.g., poetry, story generation).
       - **Cons:**
         - May produce less coherent or relevant responses.
         - Risk of hallucination (inventing information not present in the retrieval context).
         - Can lead to verbosity or excessive repetition.

     - **Low Temperature:**
       - **Pros:**
         - Encourages exploitation of existing information.
         - Produces more focused and deterministic responses.
         - Suitable for factual or precise answers.
       - **Cons:**
         - May lack creativity or diversity.
         - Can be overly conservative.
         - Might miss out on alternative valid responses.

3. **Choosing the Right Temperature:**
   - There's no one-size-fits-all answer; it depends on the task and desired output.
   - **Guidelines:**
     - For **fact-based tasks**, use a **lower temperature** to ensure accuracy.
     - For **creative tasks**, experiment with **higher temperatures** to encourage diversity.
     - **Tune the temperature** during fine-tuning or inference based on your specific use case.



**Results:**
Although we expected that model with high temperature will give us a more creative answer that pays less attention to the context and low temperature model will have conservative answers that only depends on context and instructions, it seems that something is wrong and outputs are not relevant. I don't know what is the reason behind this but the expectation was what I mentioned.

In [None]:
import torch

del model
del rag_chain
torch.cuda.empty_cache()

In [20]:
from langchain_community.llms import CTransformers

low_temp_model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    "temperature": 0.3
  }
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

In [25]:
from langchain_core.runnables import RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

rag_chain = (
  {"context":  retriever , "question": RunnablePassthrough()}
  | prompt
  | low_temp_model
  | StrOutputParser()
)

question = "What is RAFT?"
answer = rag_chain.invoke(question)
answer



' The paper "RAFT?The paper'

In [19]:
import torch

del high_temp_model
del rag_chain
torch.cuda.empty_cache()

In [12]:
high_temp_model = CTransformers(
  model="TheBloke/Llama-2-7B-Chat-GGUF",
  model_file="llama-2-7b-chat.Q8_0.gguf",
  model_type="llama",
  config={
    "gpu_layers": 50,
    "temperature" : 1.0
  }
)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

In [18]:
rag_chain = (
  {"context":  retriever , "question": RunnablePassthrough()}
  | prompt
  | high_temp_model
  | StrOutputParser()
)

question = "Compare DSPy with exsiting libraries like Langchain and llamaIndex."
answer = rag_chain.invoke(question)
answer



' Yes, sorry I apologize them, whether or you can you are there is the provided paper\nDSPy DSPy DSPy DSPy DSPy, yes'