<div align="center">
    <img src="https://logoyab.com/wp-content/uploads/2024/08/IUST-University-Logo-1030x1030.png" alt="Logo" width="200">
    <p><b>HW5 @ Deep Learning Course, Dr. Mohammadi</b></p>
    <p><b>ِDesinged by Nafiseh Ahmadi</b></p>
</div>

--------


*Full Name:*

*Student Number:*


------


# What are Soft prompts?
Soft prompts are learnable tensors concatenated with the input embeddings that can be optimized to a dataset; the downside is that they aren’t human readable because you aren’t matching these “virtual tokens” to the embeddings of a real word.
<br>
<div>
<img src="https://www.researchgate.net/publication/366062946/figure/fig1/AS:11431281105340756@1670383256990/The-comparison-between-the-previous-T5-prompt-tuning-method-part-a-and-the-introduced.jpg">
</div>

Read More:
<br>[Youtube : PEFT and Soft Prompt](https://www.youtube.com/watch?v=8uy_WII76L0)
<br>[Paper: The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
https://arxiv.org/pdf/2101.00190.pdf
<br>[Paper: Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)

# Part 1
Before diving into the practical applications, let's first ensure your foundational knowledge is solid. Please answer the following questions.


**A) Compare and contrast model tuning and prompt tuning in terms of their effectiveness for specific downstream tasks.**

**B) Explore the challenges associated with interpreting soft prompts in the continuous embedding space and propose potential solutions.**

**C) What is the effect of initializing prompts randomly versus initializing them from the vocabulary, and how does this impact the performance of prompt tuning?**

**D) How is the optimization process in the prefix tuning(<br>[Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)) and Why did they use this technique?**


<font color='#FA5170'><b>ِYour answer:</b></font>

<font color='#FA5170'><b>A :</b></font>

<font color='#FA5170'><b>B :</b></font>

<font color='#FA5170'><b>C :</b></font>

<font color='#FA5170'><b>D :</b></font>

# Part 2

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModel
from transformers import AdamW
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

## Model Selection & Constants
We will use `bert-fa-base-uncased` as our base model from Hugging Face ([HF_Link](https://huggingface.co/HooshvareLab/bert-fa-base-uncased)). For our tuning, we intend to utilize 20 soft prompt tokens.

In [None]:
class CONFIG:
    seed =
    max_len =
    train_batch =
    valid_batch =
    epochs = 10
    n_tokens=
    learning_rate =
    model_name =
    tokenizer =
    device =

## Dataset

The dataset contains around 7000 Persian sentences and their corresponding polarity, and have been manually classified into 5 categories (i.e. Angry).

### Load Dataset

In [None]:
import pandas as pd
file_path = "/content/softprompt_dataset.csv"
df = pd.read_csv(file_path)

### Pre-Processing

In [None]:
%pip install -U clean-text[gpl]
%pip install hazm

In [None]:
import re
from cleantext import clean
from hazm import *

In [None]:
import re
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

def cleaning(text):
    text = text.strip()

    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )

    text = cleanhtml(text)

    # normalizing
    #normalizer = hazm.Normalizer()
    #text = normalizer.normalize(text)

    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)

    text = wierd_pattern.sub(r'', text)

    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)

    return text

In [None]:
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

tqdm.pandas()

def parallel_apply_with_progress(df, func, n_workers=4):
    with ThreadPoolExecutor(max_workers=n_workers) as executor, tqdm(total=len(df)) as pbar:
        def update(*args):
            pbar.update()

        results = []
        for result in executor.map(func, df['text']):
            results.append(result)
            update()

        df['text'] = pd.Series(results)

    return df

In [None]:
df = parallel_apply_with_progress(df, cleaning)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values,
                                                  df.label.values,
                                                  test_size=0.15,
                                                  random_state=42,
                                                  stratify=df.label.values)

train_df = df.loc[X_train]
validation_df = df.loc[X_val]

In [None]:
possible_labels = df.label.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

In [None]:
train_df['label'] = train_df.label.replace(label_dict)
validation_df['label'] = validation_df.label.replace(label_dict)

### Create Dataset Class
In this step we will getting our dataset ready for training.

In this part we will define BERT-based dataset class for text classification, with configuration parameters. It preprocesses text data and tokenizes it using the BERT tokenizer.


Complete the preprocessing step in the __getitem__ method by adding padding tokens to 'input_ids' and 'attention_mask',
The count of this pad tokens is the same as `n_tokens`.

In [None]:
class BERTDataset(Dataset):
    def __init__(self,df):
        self.text = df['text'].values
        self.labels = df['label'].values
        self.all_labels = [0, 1, 2, 3, 4]
        self.max_len = CONFIG.max_len
        self.tokenizer = CONFIG.tokenizer
        self.n_tokens=CONFIG.n_tokens

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = self.text[index]
        text = ' '.join(text.split())
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True
        )

        ######### Your code begins #########

        ######### Your code ends ###########
        # Get the ground-truth class label for the current sample
        labels = #TODO
        # Create a one-hot dictionary for the current label
        label_dict = #TODO
        # Convert the one-hot dictionary to a tensor
        labels_tensor = #TODO
        return {
            'ids':  #TODO # Token IDs including padding and special tokens
            'mask':  #TODO # Attention mask indicating real tokens vs padding
            'label':  #TODO # One-hot encoded label tensor for multi-class classification
        }


In [None]:
train_dataset = BERTDataset(train_df)
validation_dataset = BERTDataset(validation_df)

## Define Prompt Embedding Layer
In this part we will define our prompt layer in `PROMPTEmbedding` module.


<font color='#AA1A73'><b>You have to complete</b></font> `initialize_embedding`,  `forward` <font color='#AA1A73'><b>functions.</b></font>

In `initialize_embedding` function initialize the learned embeddings based on whether they should be initialized from the vocabulary or randomly within the specified range.

In `forward` function, modify the input_embedding to extract the relevant part based on n_tokens.

Repeat the learned_embedding to match the size of input_embedding.

Concatenate the learned_embedding and input_embedding properly.


In [None]:
class PROMPTEmbedding(nn.Module):
    def __init__(self,
                 emb_layer: nn.Embedding,
                 n_tokens: int = 20,
                 random_range: float = 0.5,
                 initialize_from_vocab: bool = True):

        super(PROMPTEmbedding, self).__init__()
        self.emb_layer = emb_layer
        self.n_tokens = n_tokens
        self.learned_embedding = nn.parameter.Parameter(self.initialize_embedding(emb_layer,
                                                                                   n_tokens,
                                                                                   random_range,
                                                                                   initialize_from_vocab))

    def initialize_embedding(self,
                             emb_layer: nn.Embedding,
                             n_tokens: int = 20,
                             random_range: float = 0.5,
                             initialize_from_vocab: bool = True):

        if initialize_from_vocab:
            # Initialize embeddings from the vocabulary
            vocab_emb = #TODO
            return vocab_emb
        else:
            # Initialize embeddings randomly within the specified range
            random_emb = #TODO
            return random_emb

    def forward(self, tokens):
        ######### Your code begins #########

        ######### Your code ends ###########
        return concat_embedding


## Replace model's embedding layer with our layer

In [None]:
######### Your code begins #########

# Define your BERT model
# Load a pretrained BERT model for classification with 5 output labels
model = #TODO

# Get the word embedding from the BERT model
# Extract the original word embedding layer from the BERT model
bert_embedding_layer = #TODO

# Create an instance of PROMPTEmbedding to replace it
prompt_embedding_layer = #TODO

# Set the embedding of the BERT model to the new PROMPTEmbedding instance
#TODO

######### Your code ends ###########

## Freezing Model Parameters
In this part we will freeze entire model except `learned_embedding`

In [None]:
######### Your code begins #########

######### Your code ends ###########

## Optimizer


In [None]:
######### Your code begins #########

# Using AdamW with the configs you have already set.

######### Your code ends ###########

## Training & Evaluation


### Define dataloaders

In [None]:
######### Your code begins #########

######### Your code ends ###########

### Define evaluation function

In [None]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = np.argmax(labels, axis=1).flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
######### Your code begins #########

# Evaluate the model on the validation set and return average loss, predictions, and true labels

######### Your code ends ###########

### Define trainng loop


In this section, you will implement the training loop for a model using the `tqdm` library to visualize progress. The function `train()` manages the training and evaluation of the model for a set number of epochs. It displays the training loss during each epoch and reports the validation loss and F1 score after each epoch without disrupting the progress bar display.

In [None]:
######### Your code begins #########

# Train the model using training data and evaluate on validation data each epoch

######### Your code ends ###########

### Run

Now that the training function is defined, you can call it to begin training your model. The following line initializes the training process by passing the model, optimizer, training data loader, and validation data loader to the train() function. Add this line of code below to execute training.

In [None]:
######### Your code begins #########

# Start training the model with the specified optimizer, training data, and validation data

######### Your code ends ###########

## Using OpenDelta library

In [None]:
!pip install git+https://github.com/thunlp/OpenDelta.git

Use `OpenDelta` library to do the same thing. [link](https://opendelta.readthedocs.io/en/latest/modules/deltas.html)

For hyperparameters, test with `N_SOFT_PROMPT_TOKENS=10` and `N_SOFT_PROMPT_TOKENS=20` and report them.

OpenDelta library append soft tokens directly to the prompts so we do not need to add them by ourselves, so we need to initialize our dataset another time them without them.

In [None]:
######### Your code begins #########

# Define a custom Dataset class for handling BERT tokenization and multi-class labels
class NewBERTDataset(Dataset):

######### Your code ends ###########

In [None]:
train_dataset = NewBERTDataset(train_df)
validation_dataset = NewBERTDataset(validation_df)

In [None]:
train_loader = DataLoader(train_dataset, batch_size=CONFIG.train_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

validation_loader = DataLoader(validation_dataset, batch_size=CONFIG.valid_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

The results in both cases show competitive performance but when `N_SOFT_PROMPT_TOKENS=10`, we have slightly better performance in terms of F1-score. We can continue this experiment with larger values for `soft_token_num` to see if performance improves or not:

In this section, you will load a pre-trained transformer model and apply soft prompt tuning using `SoftPromptModel` from the `opendelta` library. This approach prepends a set number of learnable prompt tokens to the model's input without updating the full model weights. After freezing the original model and initializing the optimizer, the training process begins using the custom `train()` function.

In [None]:
######### Your code begins #########

# CASE 1: N_SOFT_PROMPT_TOKENS = 10

######### Your code ends ###########

In [None]:
######### Your code begins #########

# CASE 2: N_SOFT_PROMPT_TOKENS=20

######### Your code ends ###########

# Reasoning

Reasoning is the mental process of drawing conclusions, making decisions, or solving problems by thinking through information step by step. In the context of humans, it's how we make sense of things. In the context of AI and language models, it's how the model simulates a logical thought process to arrive at an answer.

## Chain-of-Thought (CoT)

LLMs have demonstrated good reasoning abilities. Furthermore, their capabilities can be further improved by incorporating reasoning techniques. One of the most notable developments in this area is the [Chain-of-Thought (CoT)](https://arxiv.org/abs/2201.11903), which was introduced by Google. This approach has shown promising results in improving the reasoning capabilities of language models across a variety of tasks. Can you explain what CoT is and how it works?

<font color='#FA5170'><b>ِYour answer:</b></font>

In this section, you should use the CoT technique. firstly you need to load the [Phi-2 model](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/). This model has been introduced by Microsoft as a small LLM

In [None]:
######### Your code begins #########

# Initialize the model and tokenizer, and implement the generate_output function to generate responses based on input questions

######### Your code ends ###########

Use Phi-2 to answer the questions below with and without CoT. Compare results and explain their difference.

In [None]:
questions = ["Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?",
"Jack is stranded on a desert island. He wants some salt to season his fish. He collects 2 liters of seawater in an old bucket. If the water is 20% salt, how many ml of salt will Jack get when all the water evaporates?",
"John volunteers at a shelter twice a month for 3 hours at a time. How many hours does he volunteer per year?",
"There are 32 tables in a hall. Half the tables have 2 chairs each, 5 have 3 chairs each and the rest have 4 chairs each. How many chairs in total are in the hall?",
"Bert fills out the daily crossword puzzle in the newspaper every day. He uses up a pencil to fill out the puzzles every two weeks. On average, it takes him 1050 words to use up a pencil. How many words are in each crossword puzzle on average?"
]

## Correct Answers for each question:
    # 1: $10
    # 2: 400 ml
    # 3: 72 hours
    # 4: 91 chairs
    # 5: 75 words

######### Your code begins #########

# Prompt the model with each question (without Chain-of-Thought) to generate direct answers
"""Step 1: Prompting without CoT"""

######### Your code ends ###########

In this part, because with the determined max length we couldn't answer the last question, in the next part, we increased max_length and again did the prompting:

In [None]:
######### Your code begins #########

# Use the Chain-of-Thought (CoT) prompting technique to generate step-by-step answers for each question, limiting the output length to 300 tokens
"""Step 2: Prompting with CoT - Max length: 300"""

######### Your code ends ###########

In [None]:
######### Your code begins #########

# Use the Chain-of-Thought (CoT) prompting technique to generate step-by-step answers for each question, limiting the output length to 350 tokens
"""Step 2: Prompting with CoT - Max length = 350"""

######### Your code ends ###########

**Results without CoT:** Among five questions we had in our propmts, the model just answered two of them correctly while for one of them (Bert example), it followed wrong steps but it concluded to correct answer.

**Results with Cot:** Cot did an incredible job and model answered all of the questions correctly. In addition, all of the reasoning step were correct and model obtained answers through correct reasoning steps. But this method has a problem, it makes llm output longer than what we had in the previous step so we will need a longer `max_length` for llm output.

## Other Methods for Reasoning

There are many other approaches to utilize the reasoning abilities of LLMs. Describe the [Tree-of-Thought (ToT)](https://arxiv.org/abs/2305.10601) and [Self-Consistency](https://arxiv.org/abs/2203.11171) within these approaches.

 **Tree of Thoughts (ToT)**:
   - ToT is a novel approach that enhances language model (LM) inference by allowing deliberate problem solving through interconnected reasoning steps.
   - Key features of ToT:
     - **Coherent Units of Text (Thoughts)**: ToT maintains a tree structure where each node represents a coherent sequence of language (a "thought"). These thoughts serve as intermediate steps toward solving a problem.
     - **Self-Evaluation and Decision Making**: LMs using ToT can self-evaluate their progress by considering multiple reasoning paths. They deliberate on choices and decide the next course of action based on intermediate thoughts.
     - **Global Choices and Backtracking**: ToT enables LMs to look ahead or backtrack when necessary, allowing for global decisions that impact the overall problem-solving process.

**Self-Consistency**:

  Self-consistency is an advanced prompting technique that builds on COT prompting. The aim here is to improve the naive greedy decoding using COT prompting by sampling multiple diverse reasoning paths and selecting the most consistent answers. By utilizing a majority voting system, the AI model can arrive at more accurate and reliable answers.


  To implement self-consistency, prompt engineers typically follow these steps:

- **Identify the problem:** Define the problem or question for which you require LLM's assistance. Make sure it is clear and specific.
- **Create multiple prompts:** Develop various prompts that approach the problem from different angles or perspectives. Each prompt should provide a unique reasoning path for the AI to follow.
- **Generate responses:** Submit the prompts to LLM and obtain the responses generated by the model.
- **Evaluate consistency:** Analyze the generated responses to determine their coherence, relevance, and consistency. This step may involve comparing the responses to each other, looking for common themes or patterns, and checking for internal logical consistency.
- **Select the best response:** Based on the evaluation, choose the most consistent and accurate response as the final answer.

Now, implement Self-Consistency to answer the questions of the previous section.
Analyze the results obtained from Steps 1 and 2.

<font color='#FA5170'><b>ِYour answer:</b></font>

In [None]:
######### Your code begins #########

# Apply multiple Chain-of-Thought (CoT) prompts with self-consistency to generate answers and compare the results across different prompt variations
# **** Question: diff prompts? we should recognize it by ourslves or not
"""Step 3: Prompting with CoT and Self-Consistency - Max length = 350"""

######### Your code ends ###########

Consider LLMs' features and propose a new approach based on them to enhance LLMs' reasoning abilities. Why do you believe this approach could enhance LLMs' reasoning abilities?

<font color='#FA5170'><b>ِYour answer:</b></font>