## CA 2, LLMs Spring 2024

- **Name:** Melika Noubakhtian
- **Student ID:** 4021305965008

---
#### Your submission should be named using the following format: `CA2_LASTNAME_STUDENTID_soft_prompt.ipynb`.

- There is no penalty for using AI assistance on this homework as long as you fully disclose it in the final cell of this notebook (this includes storing any prompts that you feed to large language models). That said, anyone caught using AI assistance without proper disclosure will receive a zero on the assignment (we have several automatic tools to detect such cases). We're literally allowing you to use it with no limitations, so there is no reason to lie!

---

##### *Academic honesty*

- We will audit the Colab notebooks from a set number of students, chosen at random. The audits will check that the code you wrote actually generates the answers in your notebook. If you turn in correct answers on your notebook without code that actually generates those answers, we will consider this a serious case of cheating.

- We will also run automatic checks of Colab notebooks for plagiarism. Copying code from others is also considered a serious case of cheating.

---

If you have any further questions or concerns, contact the TA via email:
mohammad136631@gmail.com

---

# What are Soft prompts?
Soft prompts are learnable tensors concatenated with the input embeddings that can be optimized to a dataset; the downside is that they aren’t human readable because you aren’t matching these “virtual tokens” to the embeddings of a real word.
<br>
<div>
<img src="https://www.researchgate.net/publication/366062946/figure/fig1/AS:11431281105340756@1670383256990/The-comparison-between-the-previous-T5-prompt-tuning-method-part-a-and-the-introduced.jpg"/>
</div>

Read More:
<br>[Youtube : PEFT and Soft Prompt](https://www.youtube.com/watch?v=8uy_WII76L0)
<br>[Paper: The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
https://arxiv.org/pdf/2101.00190.pdf
<br>[Paper: Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)

# Part 1 (20 Points)
Before diving into the practical applications, let's first ensure your foundational knowledge is solid. Please answer the following questions.


**A) Compare and contrast model tuning and prompt tuning in terms of their effectiveness for specific downstream tasks. (5 Points)**

**B) Explore the challenges associated with interpreting soft prompts in the continuous embedding space and propose potential solutions. (5 Points)**

**C) What is the effect of initializing prompts randomly versus initializing them from the vocabulary, and how does this impact the performance of prompt tuning? (5 Points)**

**D) How is the optimization process in the prefix tuning(<br>[Prefix-Tuning: Optimizing Continuous Prompts for Generation](https://arxiv.org/pdf/2101.00190.pdf)) and Why did they use this technique? (5 Points)**


# Answer:

**Part (a):**

- **Prompt Tuning:**
  - Involves updating only the prompt weights while keeping the model weights fixed.
  - Eliminates the need for separate base models for each task by introducing specific virtual tokens for individual tasks.
  - Recognized for its efficiency, particularly with larger models.
  - May not yield optimal results with smaller models due to their limited capacity.

- **Model Tuning:**
  - Entails modifying the model weights directly.
  - Requires maintaining distinct model versions for each task, resulting in greater storage and memory demands.
  - Offers flexibility in adapting the underlying architecture of the model, potentially leading to improved performance across a broader range of tasks.
  - Particularly effective for tasks where significant adjustments to the model architecture are necessary to capture task-specific nuances.


**Part (b):**

**Challenges:**

1. **Curse of Dimensionality:** Continuous embedding spaces can have hundreds or even thousands of dimensions. Visualizing or reasoning about such high-dimensional spaces is incredibly difficult.

2. **Context Dependence:** Embeddings can shift their meaning depending on the surrounding context.  So we can not just interpret the word embeddings by themselves and the words around them are important.

4. **ُSpecific Words:** A prompt is interpretable for human if it contains meaningful words but continous embedding in many cases do not show certain words and it will be hard for us to understand the reason for choosing them.


## Potential Solutions


1. **Dimensionality Reduction Techniques:** Techniques like Principal Component Analysis (PCA) can project the high-dimensional embedding space into a lower-dimensional one, allowing for better visualization and analysis of the prompt's influence.

2. **Attention Mechanisms:** Examining the attention weights assigned by the LLM during generation can provide insights into which parts of the soft prompt embeddings were most influential in generating specific parts of the output.

3. **Mapping to nearest words**: Trying to map a continous embedding to a word that is near to it will help us to have meaningful words instead of just some numbers and we can explain them in a better way.

4. **Explainable AI (XAI) Techniques:** Integrating XAI methods into LLMs can help explain the internal decision-making processes during generation. This could shed light on how the model interprets and utilizes the soft prompt information.



**Part (c):**

**Random Initialization:**

* **Effect:**
    * Introduces a high degree of variability in the initial prompt. This can be beneficial for exploration, potentially leading to discovering unexpected directions for the model.
    * However, random initialization often leads to prompts with little semantic meaning, requiring the model to work harder to learn a meaningful representation.
* **Impact on Performance:**
    * Can be slow and inefficient, requiring many iterations of prompt tuning to find effective configurations.
    * May lead to prompts that are nonsensical or nonsyntactic.

**Vocabulary-based Initialization:**

* **Effect:**
    * Starts with prompts that at least have meaningful tokens and potentially hold some semantic meaning based on the chosen words.
    * This provides a better foundation for the model to build upon during prompt tuning.
* **Impact on Performance:**
    * Generally leads to faster convergence during prompt tuning as the starting point is closer to a potentially useful prompt.
    * Offers more control over the initial direction of the prompt, allowing for targeted tuning towards specific functionalities.



**Part(d):**

 How it works and why we should use it:

1. **Fine-tuning vs. Prefix-Tuning**:
   - Fine-tuning modifies all the language model parameters, which necessitates storing a full copy for each task.
   - In contrast, prefix-tuning keeps the language model parameters frozen and optimizes a small continuous task-specific vector (referred to as the "prefix"). This approach allows for more efficient parameter updates.

2. **Optimization Process**:
   - During prefix-tuning, the language model parameters remain fixed, and only the prefix vector is updated.
   - The prefix serves as a context for subsequent tokens, allowing them to attend to it as if it were "virtual tokens."
   - By learning only 0.1% of the parameters, prefix-tuning achieves comparable performance to fine-tuning in the full data setting, outperforms fine-tuning in low-data settings, and extrapolates better to examples with topics unseen during training.

3. **Applications**:
   - The authors apply prefix-tuning to GPT-2 for table-to-text generation and to BART for summarization.
   - Notably, prefix-tuning demonstrates promising results with significantly fewer parameters, making it an attractive alternative for certain natural language generation tasks.

# Part 2 (35 points)

## Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import transformers
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoModel
from transformers import AdamW
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

## Model Selection & Constants
We will use `bert-fa-base-uncased` as our base model from Hugging Face ([HF_Link](https://huggingface.co/HooshvareLab/bert-fa-base-uncased)). For our tuning, we intend to utilize 20 soft prompt tokens.

In [None]:
class CONFIG:
    seed = 42
    max_len = 128
    train_batch = 16
    valid_batch = 32
    epochs = 10
    n_tokens=20
    learning_rate = 0.01
    model_name = 'HooshvareLab/bert-fa-base-uncased'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

## Dataset

The dataset contains around 7000 Persian sentences and their corresponding polarity, and have been manually classified into 5 categories (i.e. Angry).

### Load Dataset

In [None]:
import pandas as pd
file_path = "/content/softprompt_dataset.csv"
df = pd.read_csv(file_path)

### Pre-Processing

In [None]:
%pip install -U clean-text[gpl]
%pip install hazm

In [None]:
import re
from cleantext import clean
from hazm import *

In [None]:
import re
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

def cleaning(text):
    text = text.strip()

    # regular cleaning
    text = clean(text,
        fix_unicode=True,
        to_ascii=False,
        lower=True,
        no_line_breaks=True,
        no_urls=True,
        no_emails=True,
        no_phone_numbers=True,
        no_numbers=False,
        no_digits=False,
        no_currency_symbols=True,
        no_punct=False,
        replace_with_url="",
        replace_with_email="",
        replace_with_phone_number="",
        replace_with_number="",
        replace_with_digit="0",
        replace_with_currency_symbol="",
    )

    text = cleanhtml(text)

    # normalizing
    #normalizer = hazm.Normalizer()
    #text = normalizer.normalize(text)

    wierd_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u'\U00010000-\U0010ffff'
        u"\u200d"
        u"\u2640-\u2642"
        u"\u2600-\u2B55"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\u3030"
        u"\ufe0f"
        u"\u2069"
        u"\u2066"
        u"\u2068"
        u"\u2067"
        "]+", flags=re.UNICODE)

    text = wierd_pattern.sub(r'', text)

    # removing extra spaces, hashtags
    text = re.sub("#", "", text)
    text = re.sub("\s+", " ", text)

    return text

In [None]:
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

tqdm.pandas()

def parallel_apply_with_progress(df, func, n_workers=4):
    with ThreadPoolExecutor(max_workers=n_workers) as executor, tqdm(total=len(df)) as pbar:
        def update(*args):
            pbar.update()

        results = []
        for result in executor.map(func, df['text']):
            results.append(result)
            update()

        df['text'] = pd.Series(results)

    return df

In [None]:
df = parallel_apply_with_progress(df, cleaning)

100%|██████████| 7023/7023 [00:02<00:00, 3278.45it/s]


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values,
                                                  df.label.values,
                                                  test_size=0.15,
                                                  random_state=42,
                                                  stratify=df.label.values)

train_df = df.loc[X_train]
validation_df = df.loc[X_val]

In [None]:
possible_labels = df.label.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index
label_dict

{0: 0, 1: 1, 2: 2, -1: 3, -2: 4}

In [None]:
train_df['label'] = train_df.label.replace(label_dict)
validation_df['label'] = validation_df.label.replace(label_dict)

### Create Dataset Class (5 Points)
In this step we will getting our dataset ready for training.

In this part we will define BERT-based dataset class for text classification, with configuration parameters. It preprocesses text data and tokenizes it using the BERT tokenizer.


Complete the preprocessing step in the __getitem__ method by adding padding tokens to 'input_ids' and 'attention_mask',
The count of this pad tokens is the same as `n_tokens`.

In [None]:
class BERTDataset(Dataset):
    def __init__(self,df):
        self.text = df['text'].values
        self.labels = df['label'].values
        self.all_labels = [0, 1, 2, 3, 4]
        self.max_len = CONFIG.max_len
        self.tokenizer = CONFIG.tokenizer
        self.n_tokens=CONFIG.n_tokens

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = self.text[index]
        text = ' '.join(text.split())
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True
        )

        ######### Your code begins #########
        inputs['input_ids'] = torch.tensor([self.tokenizer.pad_token_id] * self.n_tokens + inputs['input_ids'])
        inputs['attention_mask'] = torch.tensor([1] * (self.n_tokens) + inputs['attention_mask'])
        ######### Your code ends ###########

        labels = self.labels[index]
        label_dict = {label: (label == labels) for label in self.all_labels}
        labels_tensor = torch.tensor([float(label_dict[label]) for label in self.all_labels])
        return {
            'ids': inputs['input_ids'],
            'mask': inputs['attention_mask'],
            'label': labels_tensor
        }


In [None]:
train_dataset = BERTDataset(train_df)
validation_dataset = BERTDataset(validation_df)

## Define Prompt Embedding Layer (15 Points)
In this part we will define our prompt layer in `PROMPTEmbedding` module.


<font color='#73FF73'><b>You have to complete</b></font> `initialize_embedding` and  `forward` <font color='#73FF73'><b>functions.</b></font>

In `initialize_embedding` function initialize the learned embeddings based on whether they should be initialized from the vocabulary or randomly within the specified range.

In `forward` function, modify the input_embedding to extract the relevant part based on n_tokens.

Repeat the learned_embedding to match the size of input_embedding.

Concatenate the learned_embedding and input_embedding properly.


In [None]:
class PROMPTEmbedding(nn.Module):
    def __init__(self,
                 emb_layer: nn.Embedding,
                 n_tokens: int = 20,
                 random_range: float = 0.5,
                 initialize_from_vocab: bool = True):

        super(PROMPTEmbedding, self).__init__()
        self.emb_layer = emb_layer
        self.n_tokens = n_tokens
        self.learned_embedding = nn.parameter.Parameter(self.initialize_embedding(emb_layer,
                                                                                   n_tokens,
                                                                                   random_range,
                                                                                   initialize_from_vocab))

    def initialize_embedding(self,
                             emb_layer: nn.Embedding,
                             n_tokens: int = 20,
                             random_range: float = 0.5,
                             initialize_from_vocab: bool = True):

        if initialize_from_vocab:
            # Initialize embeddings from the vocabulary
            vocab_emb = emb_layer.weight[:n_tokens]
            return vocab_emb
        else:
            # Initialize embeddings randomly within the specified range
            random_emb = torch.rand(n_tokens, emb_layer.embedding_dim).uniform_(-random_range, random_range)
            return random_emb

    def forward(self, tokens):
        ######### Your code begins #########
        ext_input_embedding = self.emb_layer(tokens[:, self.n_tokens:])
        learned_embedding = self.learned_embedding.unsqueeze(0).repeat(ext_input_embedding.size(0), 1, 1)
        concat_embedding = torch.cat([learned_embedding, ext_input_embedding], dim=1)
        ######### Your code ends ###########
        return concat_embedding


## Replace model's embedding layer with our layer (5 Points)

In [None]:
# Define your BERT model
model = AutoModelForSequenceClassification.from_pretrained(CONFIG.model_name, num_labels=5, output_attentions = False,
                                                           output_hidden_states = False).to(CONFIG.device)
######### Your code begins #########
# Get the word embedding from the BERT model
bert_embedding_layer = model.bert.embeddings.word_embeddings

# Create an instance of PROMPTEmbedding to replace it
prompt_embedding_layer = PROMPTEmbedding(bert_embedding_layer)

# Set the embedding of the BERT model to the new PROMPTEmbedding instance
model.bert.embeddings.word_embeddings = prompt_embedding_layer

######### Your code ends ###########


pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Freezing Model Parameters (5 points)
In this part we will freeze entire model except `learned_embedding`

In [None]:
######### Your code begins #########
for name, param in model.named_parameters():
    if 'learned_embedding' not in name:  # Exclude the learned_embedding layer
        param.requires_grad = False

######### Your code ends ###########

## Optimizer


In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=CONFIG.learning_rate)

## Training & Evaluation


### Define dataloaders

In [None]:
train_loader = DataLoader(train_dataset, batch_size=CONFIG.train_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

validation_loader = DataLoader(validation_dataset, batch_size=CONFIG.valid_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

### Define evaluation function

In [None]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = np.argmax(labels, axis=1).flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
def evaluate(val_dataloader):

    model.eval()

    loss_val_total = 0
    predictions, true_vals = [], []

    for batch in val_dataloader:


        inputs = {'input_ids':      batch['ids'].to(CONFIG.device),
                  'attention_mask': batch['mask'].to(CONFIG.device),
                  'labels':         batch['label'].to(CONFIG.device),
                 }

        with torch.no_grad():
            outputs = model(**inputs)

        loss = outputs["loss"]
        logits = outputs["logits"]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)

    loss_val_avg = loss_val_total/len(val_dataloader)

    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)

    return loss_val_avg, predictions, true_vals

### Define trainng loop


In [None]:
def train(model, optimizer, train_dataloader, val_dataloader):

    epochs = CONFIG.epochs

    for epoch in tqdm(range(1, epochs+1)):

      model.train()

      loss_train_total = 0

      progress_bar = tqdm(train_loader, desc='Epoch {:1d}'.format(epoch), leave=False, disable=True)

      for batch in progress_bar:

        optimizer.zero_grad()

        inputs = {'input_ids':      batch['ids'].to(CONFIG.device),
                  'attention_mask': batch['mask'].to(CONFIG.device),
                  'labels':         batch['label'].to(CONFIG.device),
                }
        output = model(**inputs)

        loss = output["loss"]
        loss_train_total += loss.item()

        loss.backward()
        optimizer.step()

        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})


      tqdm.write(f'\nEpoch {epoch}')
      loss_train_avg = loss_train_total/len(train_loader)
      tqdm.write(f'Training loss: {loss_train_avg}')


      val_loss, predictions, true_vals = evaluate(val_dataloader)
      val_f1 = f1_score_func(predictions, true_vals)
      tqdm.write(f'Validation loss: {val_loss}')
      tqdm.write(f'F1 Score (Weighted): {val_f1}')


### Run

In [None]:
train(model=model, optimizer=optimizer, train_dataloader=train_loader, val_dataloader=validation_loader)

  0%|          | 0/10 [01:42<?, ?it/s]


Epoch 1
Training loss: 0.4698566302736813


 10%|█         | 1/10 [01:53<17:03, 113.68s/it]

Validation loss: 0.4465766151746114
F1 Score (Weighted): 0.3042141094704492


 10%|█         | 1/10 [03:44<17:03, 113.68s/it]


Epoch 2
Training loss: 0.4407631951698007


 20%|██        | 2/10 [03:54<15:43, 117.94s/it]

Validation loss: 0.41510971116297174
F1 Score (Weighted): 0.41760541996352235


 20%|██        | 2/10 [05:44<15:43, 117.94s/it]


Epoch 3
Training loss: 0.4245450972395147


 30%|███       | 3/10 [05:55<13:54, 119.22s/it]

Validation loss: 0.4040612822229212
F1 Score (Weighted): 0.44267267985414166


 30%|███       | 3/10 [07:46<13:54, 119.22s/it]


Epoch 4
Training loss: 0.4158958150422509


 40%|████      | 4/10 [07:56<12:00, 120.03s/it]

Validation loss: 0.3942379454771678
F1 Score (Weighted): 0.46508969735608335


 40%|████      | 4/10 [09:47<12:00, 120.03s/it]


Epoch 5
Training loss: 0.40972501294498137


 50%|█████     | 5/10 [09:58<10:02, 120.53s/it]

Validation loss: 0.39107980692025385
F1 Score (Weighted): 0.4497250070674416


 50%|█████     | 5/10 [11:48<10:02, 120.53s/it]


Epoch 6
Training loss: 0.4052949426645901


 60%|██████    | 6/10 [11:59<08:02, 120.75s/it]

Validation loss: 0.39074160023169086
F1 Score (Weighted): 0.49867124138254026


 60%|██████    | 6/10 [13:49<08:02, 120.75s/it]


Epoch 7
Training loss: 0.4046495989522832


 70%|███████   | 7/10 [13:59<06:02, 120.74s/it]

Validation loss: 0.3831418517864112
F1 Score (Weighted): 0.478198945184259


 70%|███████   | 7/10 [15:50<06:02, 120.74s/it]


Epoch 8
Training loss: 0.4013456246033709


 80%|████████  | 8/10 [16:00<04:01, 120.82s/it]

Validation loss: 0.38097597884409357
F1 Score (Weighted): 0.49900824971274427


 80%|████████  | 8/10 [17:51<04:01, 120.82s/it]


Epoch 9
Training loss: 0.39893683838971794


 90%|█████████ | 9/10 [18:02<02:00, 120.99s/it]

Validation loss: 0.37837752428921784
F1 Score (Weighted): 0.5046570187767595


 90%|█████████ | 9/10 [19:53<02:00, 120.99s/it]


Epoch 10
Training loss: 0.3984285789058808


100%|██████████| 10/10 [20:03<00:00, 120.38s/it]

Validation loss: 0.3806744118531545
F1 Score (Weighted): 0.49334732040674817





## Using OpenDelta library (5 Points)

In [None]:
!pip install git+https://github.com/thunlp/OpenDelta.git

Use `OpenDelta` library to do the same thing. [link](https://opendelta.readthedocs.io/en/latest/modules/deltas.html)

For hyperparameters, test with `N_SOFT_PROMPT_TOKENS=10` and `N_SOFT_PROMPT_TOKENS=20` and report them.

OpenDelta library append soft tokens directly to the prompts so we do not need to add them by ourselves, so we need to initialize our dataset another time them without them.

In [None]:
class NewBERTDataset(Dataset):
    def __init__(self,df):
        self.text = df['text'].values
        self.labels = df['label'].values
        self.all_labels = [0, 1, 2, 3, 4]
        self.max_len = CONFIG.max_len
        self.tokenizer = CONFIG.tokenizer
        self.n_tokens=CONFIG.n_tokens

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = self.text[index]
        text = ' '.join(text.split())
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            truncation=True,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            return_token_type_ids=True
        )

        labels = self.labels[index]
        label_dict = {label: (label == labels) for label in self.all_labels}
        labels_tensor = torch.tensor([float(label_dict[label]) for label in self.all_labels])
        return {
            'ids': torch.tensor(inputs['input_ids']),
            'mask': torch.tensor(inputs['attention_mask']),
            'label': labels_tensor
        }


In [None]:
train_dataset = NewBERTDataset(train_df)
validation_dataset = NewBERTDataset(validation_df)

In [None]:
train_loader = DataLoader(train_dataset, batch_size=CONFIG.train_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

validation_loader = DataLoader(validation_dataset, batch_size=CONFIG.valid_batch,
                              num_workers=2, shuffle=True, pin_memory=True)

The results in both cases show competitive performance but when `N_SOFT_PROMPT_TOKENS=10`, we have slightly better performance in terms of F1-score. We can continue this experiment with larger values for `soft_token_num` to see if performance improves or not:

In [None]:
######### Your code begins #########
from opendelta import SoftPromptModel
model = AutoModelForSequenceClassification.from_pretrained(CONFIG.model_name, num_labels=5, output_attentions = False,
                                                           output_hidden_states = False)

# CASE 1: N_SOFT_PROMPT_TOKENS=10
prompt_model = SoftPromptModel(backbone_model=model, soft_token_num=10, init_range=0.5)

prompt_model.freeze_module()

model.to(CONFIG.device)

optimizer = AdamW(model.parameters(), lr=CONFIG.learning_rate)

train(model=model, optimizer=optimizer, train_dataloader=train_loader, val_dataloader=validation_loader)

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/10 [01:28<?, ?it/s]


Epoch 1
Training loss: 0.4688398495396191


 10%|█         | 1/10 [01:36<14:32, 96.93s/it]

Validation loss: 0.4527739467042865
F1 Score (Weighted): 0.22693251170851797


 10%|█         | 1/10 [03:09<14:32, 96.93s/it]


Epoch 2
Training loss: 0.45014101386388994


 20%|██        | 2/10 [03:17<13:14, 99.31s/it]

Validation loss: 0.42514780163764954
F1 Score (Weighted): 0.4226839638728487


 20%|██        | 2/10 [04:51<13:14, 99.31s/it]


Epoch 3
Training loss: 0.4381828267465938


 30%|███       | 3/10 [04:59<11:41, 100.26s/it]

Validation loss: 0.4208501089702953
F1 Score (Weighted): 0.39901451867361704


 30%|███       | 3/10 [06:32<11:41, 100.26s/it]


Epoch 4
Training loss: 0.4320531115334302


 40%|████      | 4/10 [06:40<10:03, 100.63s/it]

Validation loss: 0.40971773953148816
F1 Score (Weighted): 0.3652065080963162


 40%|████      | 4/10 [08:13<10:03, 100.63s/it]


Epoch 5
Training loss: 0.42674727180106115


 50%|█████     | 5/10 [08:21<08:24, 100.81s/it]

Validation loss: 0.4125773346785343
F1 Score (Weighted): 0.42480877483042057


 50%|█████     | 5/10 [09:54<08:24, 100.81s/it]


Epoch 6
Training loss: 0.42502613142531187


 60%|██████    | 6/10 [10:02<06:43, 100.99s/it]

Validation loss: 0.4254924086007205
F1 Score (Weighted): 0.3792286737530338


 60%|██████    | 6/10 [11:36<06:43, 100.99s/it]


Epoch 7
Training loss: 0.42459849296087887


 70%|███████   | 7/10 [11:44<05:03, 101.13s/it]

Validation loss: 0.41107688799048914
F1 Score (Weighted): 0.38029761599780254


 70%|███████   | 7/10 [13:17<05:03, 101.13s/it]


Epoch 8
Training loss: 0.4221869628219044


 80%|████████  | 8/10 [13:25<03:22, 101.22s/it]

Validation loss: 0.4014404930851676
F1 Score (Weighted): 0.4387404508453415


 80%|████████  | 8/10 [14:58<03:22, 101.22s/it]


Epoch 9
Training loss: 0.4194143962732611


 90%|█████████ | 9/10 [15:07<01:41, 101.25s/it]

Validation loss: 0.40442008141315344
F1 Score (Weighted): 0.3887846476841646


 90%|█████████ | 9/10 [16:40<01:41, 101.25s/it]


Epoch 10
Training loss: 0.42011230801516036


100%|██████████| 10/10 [16:48<00:00, 100.83s/it]

Validation loss: 0.4060551804123503
F1 Score (Weighted): 0.4215387693294917





In [None]:
model = AutoModelForSequenceClassification.from_pretrained(CONFIG.model_name, num_labels=5, output_attentions = False,
                                                           output_hidden_states = False)

# CASE 2: N_SOFT_PROMPT_TOKENS=20
prompt_model = SoftPromptModel(backbone_model=model, soft_token_num=20, init_range=0.5)

prompt_model.freeze_module()

model.to(CONFIG.device)

optimizer = AdamW(model.parameters(), lr=CONFIG.learning_rate)

train(model=model, optimizer=optimizer, train_dataloader=train_loader, val_dataloader=validation_loader)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at HooshvareLab/bert-fa-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/10 [01:39<?, ?it/s]


Epoch 1
Training loss: 0.4656994526080269


 10%|█         | 1/10 [01:48<16:15, 108.38s/it]

Validation loss: 0.4511566405946558
F1 Score (Weighted): 0.25425743122970584


 10%|█         | 1/10 [03:27<16:15, 108.38s/it]


Epoch 2
Training loss: 0.45472557715234907


 20%|██        | 2/10 [03:36<14:27, 108.38s/it]

Validation loss: 0.4341157349673184
F1 Score (Weighted): 0.3233461068947574


 20%|██        | 2/10 [05:16<14:27, 108.38s/it]


Epoch 3
Training loss: 0.4403417745535386


 30%|███       | 3/10 [05:25<12:38, 108.39s/it]

Validation loss: 0.41616956663854193
F1 Score (Weighted): 0.35909447803248495


 30%|███       | 3/10 [07:04<12:38, 108.39s/it]


Epoch 4
Training loss: 0.42633764055323475


 40%|████      | 4/10 [07:13<10:49, 108.22s/it]

Validation loss: 0.40275415687850025
F1 Score (Weighted): 0.4177976331742116


 40%|████      | 4/10 [08:52<10:49, 108.22s/it]


Epoch 5
Training loss: 0.42031978014956184


 50%|█████     | 5/10 [09:01<09:00, 108.17s/it]

Validation loss: 0.4041371995752508
F1 Score (Weighted): 0.36334689423794364


 50%|█████     | 5/10 [10:40<09:00, 108.17s/it]


Epoch 6
Training loss: 0.4141849941588978


 60%|██████    | 6/10 [10:49<07:12, 108.19s/it]

Validation loss: 0.39452579256260034
F1 Score (Weighted): 0.45942909108806174


 60%|██████    | 6/10 [12:28<07:12, 108.19s/it]


Epoch 7
Training loss: 0.4091011870273932


 70%|███████   | 7/10 [12:37<05:24, 108.20s/it]

Validation loss: 0.39395132751175854
F1 Score (Weighted): 0.4076876714215705


 70%|███████   | 7/10 [14:16<05:24, 108.20s/it]


Epoch 8
Training loss: 0.4067523095378264


 80%|████████  | 8/10 [14:25<03:36, 108.23s/it]

Validation loss: 0.40398977380810364
F1 Score (Weighted): 0.42743823216317983


 80%|████████  | 8/10 [16:05<03:36, 108.23s/it]


Epoch 9
Training loss: 0.40519137752247364


 90%|█████████ | 9/10 [16:14<01:48, 108.25s/it]

Validation loss: 0.38387309963052924
F1 Score (Weighted): 0.5261349742358886


 90%|█████████ | 9/10 [17:53<01:48, 108.25s/it]


Epoch 10
Training loss: 0.40254843306732685


100%|██████████| 10/10 [18:02<00:00, 108.25s/it]

Validation loss: 0.3869407393715598
F1 Score (Weighted): 0.46437329948726763



