# <Font color = 'indianred'>**Multi Label Analysis using Hugging Face Ecosystem** </font>

## <Font color = 'indianred'>**1. Set Environment**

In this notebook, we have to install following additional libraries (compared to previous notebooks) from Huggingface  to enhance our workflow: **transformers**, **datasets**, **evaluate**, and **accelearte**. In addition, we are also installing **wandb**.

- The transformers library provides **Trainer** class that we will use to manage Training process.
- The **datasets** library simplifies the process of accessing and manipulating a wide array of datasets.
- The **evaluate** library offers a suite of standardized metrics and methods for robust and consistent model evaluation.
- We will not use **accelerate** library directly. However , we need to install it as transformer librray usses it in the background.
- Finally **wandb** library provide tools for efficient experiment tracking.

In [None]:
# If in Colab, then import the drive module from google.colab
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  # Mount the Google Drive to access files stored there
  drive.mount('/content/drive')

  # Install the latest version of torchtext library quietly without showing output

  !pip install transformers evaluate wandb datasets accelerate  -U -qq  ## NEW LINES ##

  basepath = '/content/drive/MyDrive/data/Colab Notebooks/'
else:
  basepath = '/Users/bvand/Desktop/Homework4'



Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m84.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m86.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.0/311.0 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━

In [None]:
# Importing PyTorch library for tensor computations and neural network modules
import torch
import torch.nn as nn

# For working with textual data vocabularies and for displaying model summaries

# General-purpose Python libraries for random number generation and numerical operations
import random
import numpy as np

# Utilities for efficient serialization/deserialization of Python objects and for element tallying
import joblib
from collections import Counter

# For creating lightweight attribute classes and for partial function application
from functools import partial

# For filesystem path handling, generating and displaying confusion matrices, and date-time manipulations
from pathlib import Path
from sklearn.metrics import confusion_matrix
from datetime import datetime

# For plotting and visualization
import matplotlib.pyplot as plt
import seaborn as sns
# %matplotlib inline

### NEW ##########################
# imports from Huggingface ecosystem
from transformers.modeling_outputs import SequenceClassifierOutput
from transformers import PreTrainedModel, PretrainedConfig
from transformers import TrainingArguments, Trainer
from datasets import Dataset
import evaluate

# wandb library
import wandb

- `from transformers.modeling_outputs import SequenceClassifierOutput`: This import provides a specific output format for sequence classification tasks required by Huggingface Trainer. We will use this in our custom model.
- `from transformers import PreTrainedModel, PretrainedConfig`: All models should be subclass of `PreTrainedModel` for the model to work with Trainer. We will also need to create a config file for the model which should be subclass of `PretrainedConfig.`
- `from transformers import TrainingArguments, Trainer`: `TrainingArguments` is used to define training hyperparameters, while `Trainer` is a high-level API for training, fine-tuning, and evaluating models easily.
- `from datasets import Dataset`: This import from the `datasets` library is used to handle datasets more efficiently.
- `import evaluate`: This import brings in the `evaluate` library that offers various metrics to assess the performance of NLP models.
`import wandb`: This import integrates the Weights & Biases library, a tool for experiment tracking.



<Font color = 'indianred'>*Specify Project Folders*

In [None]:
base_folder = Path(basepath)
data_folder = base_folder/'stackexchange'
model_folder = base_folder/'models/nlp_fall_2024/emotion/nn'
custom_functions = base_folder/'custom-functions'

In [None]:
model_folder.mkdir(exist_ok=True, parents = True)

In [None]:
model_folder

PosixPath('/content/drive/MyDrive/data/Colab Notebooks/models/nlp_fall_2024/emotion/nn')

In [None]:
data_folder

PosixPath('/content/drive/MyDrive/data/Colab Notebooks/stackexchange')

## <Font color = 'indianred'>**2. Load Data** </font>




<Font color = 'indianred'>*Load cleaned arrays from files using joblib*

In [None]:
X_train_cleaned_file = '/content/drive/MyDrive/data/train.csv'
X_test_cleaned_file = '/content/drive/MyDrive/data/test.csv'

In [None]:
type(X_train_cleaned_file)

str

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Convert PosixPath to string by wrapping in str()
train_data = pd.read_csv(str(X_train_cleaned_file))
test_data = pd.read_csv(str(X_test_cleaned_file))

# Features (tweets) and multi-label targets (emotions) from train data
X_train_cleaned = train_data['Tweet']  # Assuming 'Tweet' is the feature column
y_train = train_data[['anger', 'anticipation', 'disgust', 'fear', 'joy', 'love',
                      'optimism', 'pessimism', 'sadness', 'surprise', 'trust']]

# Prepare the test data
X_test_cleaned = test_data['Tweet']
y_test = test_data[['anger', 'anticipation', 'disgust', 'fear', 'joy', 'love',
                    'optimism', 'pessimism', 'sadness', 'surprise', 'trust']]

# Convert to Dataset for multi-label classification
trainset = Dataset.from_dict({
    'texts': X_train_cleaned.tolist(),
    'labels': y_train.values.tolist()  # Convert to list for compatibility
})



testset = Dataset.from_dict({
    'texts': X_test_cleaned.tolist(),
    'labels': y_test.values.tolist()
})

# Check data types
print(type(y_train))  # Should show it's a DataFrame


<class 'pandas.core.frame.DataFrame'>


In [None]:
type(y_train)

In [None]:
trainset.features

{'texts': Value(dtype='string', id=None),
 'labels': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)}

In [None]:
trainset.features['labels']

Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None)

In [None]:
trainset[0]

{'texts': "“Worry is a down payment on a problem you may never have'. \xa0Joyce Meyer.  #motivation #leadership #worry",
 'labels': [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1]}

## <Font color = 'indianred'>**4. Create Custom Model and Model Config Class** </font>


In [None]:
class CustomConfig(PretrainedConfig):
  def __init__(self, vocab_size=0, embedding_dim=0, hidden_dim1=0, hidden_dim2=0, num_labels=11, **kwargs):
      super().__init__()
      self.vocab_size = vocab_size
      self.embedding_dim = embedding_dim
      self.hidden_dim1 = hidden_dim1
      self.hidden_dim2 = hidden_dim2
      self.num_labels = num_labels

* `**kwargs` allows the class to accept any additional configuration attributes that are not part of the standard set of attributes defined in the class, providing flexibility and extensibility.

In [None]:
class CustomMLP(PreTrainedModel):
    config_class = CustomConfig

    def __init__(self, config):
        super().__init__(config)

        self.embedding_bag = nn.EmbeddingBag(config.vocab_size, config.embedding_dim)
        self.layers = nn.Sequential(
            nn.Linear(config.embedding_dim, config.hidden_dim1),
            nn.BatchNorm1d(num_features=config.hidden_dim1),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(config.hidden_dim1, config.hidden_dim2),
            nn.BatchNorm1d(num_features=config.hidden_dim2),
            nn.ReLU(),
            nn.Dropout(p=0.5),
            nn.Linear(config.hidden_dim2, config.num_labels)
        )

    def forward(self, input_ids, offsets, labels=None):
        embed_out = self.embedding_bag(input_ids, offsets)
        logits = self.layers(embed_out)
        loss = None
        if labels is not None:
            loss_fct = nn.BCEWithLogitsLoss()  # Use BCE for multi-label classification
            loss = loss_fct(logits.view(-1, self.config.num_labels), labels.float())

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits
        )


## <Font color = 'indianred'>**5. Train Model** </font>

We will train our model utilizing the Hugging Face Trainer, a versatile and powerful tool for training machine learning models. To effectively use the Trainer, specific inputs are required:

1. **Dataset**: This refers to the data used for training the model. We have already created datasets named `trainset` and `validset`.
3. **Collate Function**: This function batches individual data points together. Its primary role is to ensure the data is correctly formatted for the model's first layer, specifically the EmbeddingBag layer. This step is crucial for the effective processing of inputs by the model.
3. **Model (Instance of the Model Class)**: We have developed a custom class for the Model and its Configuration. First, we will instantiate the model configuration using our custom config file. Then, we'll use this configuration to instantiate the model itself.
4. **Compute Metric Function**: To evaluate the model's performance during training, a function to compute metrics (like accuracy, F1 score, etc.) is necessary. This function will guide the training process by providing feedback on the model's current performance.
5. **Training Arguments**: These arguments encompass various settings for the training process, such as the number of epochs, learning rate, batch size, etc. They are essential for controlling how the model learns.

Next we will discuss how to specify these inputs in detail. After defining each of these components, we will instantiate the Trainer with these inputs and commence the training of our model.



### <font color = 'indianred'> **5.1 Collate Function**</font>
The collate function need vocab. Hence, we will first create the function for creating vocab. We will then create the vocab so that it can be passed to collate function.

**Function to create vocab**

The `get_vocab` function has been updated to directly iterate over the 'texts' column of a Hugging Face `Dataset` object, a shift from the previous approach of handling custom data structures like lists of tuples. Rest of the function is similar to previous notebook.

In [None]:
from collections import Counter, OrderedDict
from typing import Dict, List, Optional, Union

class Vocab:
    def __init__(self, tokens: List[str]) -> None:
        self.itos: List[str] = tokens
        self.stoi: Dict[str, int] = {token: i for i, token in enumerate(tokens)}
        self.default_index: Optional[int] = None

    def __getitem__(self, token: str) -> int:
        if token in self.stoi:
            return self.stoi[token]
        if self.default_index is not None:
            return self.default_index
        raise RuntimeError(f"Token '{token}' not found in vocab")

    def __contains__(self, token: str) -> bool:
        return token in self.stoi

    def __len__(self) -> int:
        return len(self.itos)

    def insert_token(self, token: str, index: int) -> None:
        if index < 0 or index > len(self.itos):
            raise ValueError("Index out of range")
        if token in self.stoi:
            old_index = self.stoi[token]
            if old_index < index:
                self.itos.pop(old_index)
                self.itos.insert(index - 1, token)
            else:
                self.itos.pop(old_index)
                self.itos.insert(index, token)
        else:
            self.itos.insert(index, token)

        self.stoi = {token: i for i, token in enumerate(self.itos)}

    def append_token(self, token: str) -> None:
        if token in self.stoi:
            raise RuntimeError(f"Token '{token}' already exists in the vocab")
        self.insert_token(token, len(self.itos))

    def set_default_index(self, index: Optional[int]) -> None:
        self.default_index = index

    def get_default_index(self) -> Optional[int]:
        return self.default_index

    def lookup_token(self, index: int) -> str:
        if 0 <= index < len(self.itos):
            return self.itos[index]
        raise RuntimeError(f"Index {index} out of range")

    def lookup_tokens(self, indices: List[int]) -> List[str]:
        return [self.lookup_token(index) for index in indices]

    def lookup_indices(self, tokens: List[str]) -> List[int]:
        return [self[token] for token in tokens]

    def get_stoi(self) -> Dict[str, int]:
        return self.stoi.copy()

    def get_itos(self) -> List[str]:
        return self.itos.copy()

    @classmethod
    def vocab(cls, ordered_dict: Union[OrderedDict, Counter], min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True) -> 'Vocab':
        specials = specials or []
        for token in specials:
            ordered_dict.pop(token, None)

        tokens = [token for token, freq in ordered_dict.items() if freq >= min_freq]

        if special_first:
            tokens = specials + tokens
        else:
            tokens = tokens + specials

        return cls(tokens)

In [None]:
def get_vocab(dataset, min_freq=1):
    """
    Generate a vocabulary from a dataset.

    Args:
        dataset (Dataset): A Hugging Face Dataset object. The dataset should
                           have a key 'texts' that contains the text data.
        min_freq (int): The minimum frequency for a token to be included in
                        the vocabulary.

    Returns:
        torchtext.vocab.Vocab: Vocabulary object containing tokens from the
                               dataset that meet or exceed the specified
                               minimum frequency. It also includes a special
                               '<unk>' token for unknown words.
    """
    # Initialize a counter object to hold token frequencies
    counter = Counter()

    # Update the counter with tokens from each text in the dataset
    # Iterating through texts in the dataset
    for text in dataset['texts']:  ###### Change from previous function ####
        counter.update(str(text).split())

    # Create a vocabulary using the counter object
    # Tokens that appear fewer times than `min_freq` are excluded
    my_vocab = Vocab.vocab(counter, min_freq=min_freq)

    # Insert a '<unk>' token at index 0 to represent unknown words
    my_vocab.insert_token('<unk>', 0)

    # Set the default index to 0
    # This ensures that any unknown word will be mapped to '<unk>'
    my_vocab.set_default_index(0)

    return my_vocab

In [None]:
# Creating a function that will be used to get the indices of words from vocab
def tokenizer(text, vocab):
    """Converts text to a list of indices using a vocabulary dictionary"""
    return [vocab[token] for token in str(text).split()]

In [None]:
def collate_batch(batch, my_vocab):
    # Similar to the previous example but keeps offsets
    labels = [sample['labels'] for sample in batch]
    texts = [sample['texts'] for sample in batch]

    labels = torch.tensor(labels, dtype=torch.float32)

    list_of_list_of_indices = [tokenizer(text, my_vocab) for text in texts]

    # Compute the offsets for each text
    offsets = [0] + [len(i) for i in list_of_list_of_indices]
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)

    input_ids = torch.cat([torch.tensor(i, dtype=torch.long) for i in list_of_list_of_indices])

    return {
        'input_ids': input_ids,
        'offsets': offsets,
        'labels': labels
    }


In [None]:
emotion_vocab = get_vocab(trainset, min_freq=2)
collate_fn = partial(collate_batch, my_vocab=emotion_vocab)

### <Font color = 'indianred'>**5.2. Instantiate Model**
We will now specify the model using (1) model config class - `CustomConfig` and (2) model class - `CustomMLP`created earlier.

In [None]:
my_config = CustomConfig(vocab_size=len(emotion_vocab),
                         embedding_dim=300,
                         hidden_dim1=200,
                         hidden_dim2=100,
                         num_labels=11)



In [None]:
my_config

CustomConfig {
  "embedding_dim": 300,
  "hidden_dim1": 200,
  "hidden_dim2": 100,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9",
    "10": "LABEL_10"
  },
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,
    "LABEL_9": 9
  },
  "transformers_version": "4.45.1",
  "vocab_size": 10344
}

In [None]:
my_config.id2label = {
    0: 'anger',
    1: 'anticipation',
    2: 'disgust',
    3: 'fear',
    4: 'joy',
    5: 'love',
    6: 'optimism',
    7: 'pessimism',
    8: 'sadness',
    9: 'surprise',
    10: 'trust'
}

In [None]:
# Generating id_to_label by reversing the key-value pairs in label_to_id
#my_config.label2id = {v: k for k, v in my_config.id2label .items()}
my_config.label2id = {label: idx for idx, label in my_config.id2label.items()}

The above code is used to create a label2id mapping based on the existing id2label mapping in the my_config object.

In [None]:
my_config

CustomConfig {
  "embedding_dim": 300,
  "hidden_dim1": 200,
  "hidden_dim2": 100,
  "id2label": {
    "0": "anger",
    "1": "anticipation",
    "2": "disgust",
    "3": "fear",
    "4": "joy",
    "5": "love",
    "6": "optimism",
    "7": "pessimism",
    "8": "sadness",
    "9": "surprise",
    "10": "trust"
  },
  "label2id": {
    "anger": 0,
    "anticipation": 1,
    "disgust": 2,
    "fear": 3,
    "joy": 4,
    "love": 5,
    "optimism": 6,
    "pessimism": 7,
    "sadness": 8,
    "surprise": 9,
    "trust": 10
  },
  "transformers_version": "4.45.1",
  "vocab_size": 10344
}

In [None]:
model = CustomMLP(config=my_config)


In [None]:
model

CustomMLP(
  (embedding_bag): EmbeddingBag(10344, 300, mode='mean')
  (layers): Sequential(
    (0): Linear(in_features=300, out_features=200, bias=True)
    (1): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.5, inplace=False)
    (8): Linear(in_features=100, out_features=11, bias=True)
  )
)

###  <font color = 'indianred'> **5.3. compute_metrics function** </font>
To provide context for the `compute_metrics` function, it's important to understand the shift in approach to model evaluation when using the Hugging Face `Trainer` compared to traditional methods:

*Role of `compute_metrics` Function in Hugging Face Ecosystem:*

- In the earlier notebook, model evaluation metrics like accuracy were explicitly calculated within the training and validation loops. This required manual coding of the metric computation, which can be complex and repetitive.

- With the Hugging Face `Trainer` (discussed later on), the process is simplified. The `Trainer` automates training, evaluation, and testing loops but requires a way to compute evaluation metrics. This is where the `compute_metrics` function comes into play.

- The `compute_metrics` function serves as a standardized way to calculate and return various evaluation metrics. It can be easily customized to include any metric supported by the `evaluate` module.
   
- This function is passed to the `Trainer` and is automatically called to compute metrics on the evaluation dataset.


In [None]:
!pip install numpy scipy
import numpy as np
from scipy.special import expit



In [None]:
import numpy as np
import evaluate

def compute_metrics(eval_pred):
    # Load accuracy and F1 metric
    accuracy_metric = evaluate.load("accuracy")
    f1_metric = evaluate.load("f1", average="macro")

    logits, labels = eval_pred

    # Calculate probabilities using the sigmoid function
    probabilities = 1 / (1 + np.exp(-logits))  # Sigmoid calculation

    # Apply a threshold to convert probabilities to binary predictions (0 or 1)
    threshold = 0.5
    binary_predictions = (probabilities > threshold).astype(int)

    # Ensure the predictions and labels are in the correct format
    # Flatten both predictions and labels
    binary_predictions = binary_predictions.flatten()
    labels = labels.flatten()

    # Convert to list for metric calculations
    binary_predictions_list = binary_predictions.tolist()
    labels_list = labels.tolist()

    # Calculate metrics
    accuracy = accuracy_metric.compute(predictions=binary_predictions_list, references=labels_list)
    f1 = f1_metric.compute(predictions=binary_predictions_list, references=labels_list)

    # Combine results into a single evaluation dictionary
    evaluations = {
        'accuracy': accuracy['accuracy'],
        'f1_macro': f1['f1']
    }

    return evaluations


### <font color = 'indianred'> **5.4. Training Arguments**</font>


In [None]:
# Configure training parameters
training_args = TrainingArguments(

    # Training-specific configurations
    num_train_epochs=20,
    per_device_train_batch_size=128, # Number of samples per training batch
    per_device_eval_batch_size=128, # Number of samples per validation batch
    weight_decay=0.1, # weight decay (L2 regularization)
    learning_rate=0.001, # learning arte
    optim='adamw_torch', # optimizer
    remove_unused_columns=False, # flag to retain unused columns

    # Checkpoint saving and model evaluation settings
    output_dir=str(model_folder),  # Directory to save model checkpoints
    evaluation_strategy='steps',  # Evaluate model at specified step intervals
    eval_steps=50,  # Perform evaluation every 50 training steps
    save_strategy="steps",  # Save model checkpoint at specified step intervals
    save_steps=50,  # Save a model checkpoint every 50 training steps
    load_best_model_at_end=True,  # Reload the best model at the end of training
    save_total_limit=2,  # Retain only the best and the most recent model checkpoints
    # Use 'accuracy' as the metric to determine the best model
    metric_for_best_model="accuracy",
    greater_is_better=True,  # A model is 'better' if its accuracy is higher


    # Experiment logging configurations
    logging_strategy='steps',
    logging_steps=50,
    report_to='wandb',  # Log metrics and results to Weights & Biases platform
    run_name='imdb_hf_trainer',  # Experiment name for Weights & Biases
)



###  <font color = 'indianred'> **5.5. Initialize Trainer**</font>



In [None]:
train_set = trainset.train_test_split(test_size=0.2)

In [None]:
train_set

DatasetDict({
    train: Dataset({
        features: ['texts', 'labels'],
        num_rows: 6179
    })
    test: Dataset({
        features: ['texts', 'labels'],
        num_rows: 1545
    })
})

In [None]:
trainset = train_set['train']
validset = train_set['test']

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=trainset,
    eval_dataset = validset,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
)


### <Font color = 'indianred'>**5.5.Setup WandB**
Before we start training, we will log into WandB so that we can track our experiment.

In [None]:
!wandb login 0261695535fa3b7dff6691b6873a4399226e7fb7

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [None]:
# specify the project name where the experiment will be logged
%env WANDB_PROJECT = nlp_course_spring_2024-sentiment-analysis-hf-trainer

env: WANDB_PROJECT=nlp_course_spring_2024-sentiment-analysis-hf-trainer


###  <font color = 'indianred'> **5.6. Training and Validation**</font>

In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mbvandanareddy8[0m ([33mbvandanareddy8-university-of-texas-at-dallas[0m). Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss,Validation Loss,Accuracy,F1 Macro
50,0.5694,0.525216,0.785054,0.038431
100,0.4818,0.482849,0.787467,0.132565
150,0.465,0.468283,0.793174,0.251331
200,0.4465,0.458019,0.798058,0.33617
250,0.4276,0.445275,0.80306,0.377997
300,0.4126,0.439608,0.806002,0.40861
350,0.3952,0.43231,0.808826,0.434662
400,0.3794,0.427438,0.813769,0.468514
450,0.3659,0.425202,0.813592,0.473229
500,0.3581,0.421689,0.816358,0.488947


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

TrainOutput(global_step=980, training_loss=0.37849834987095426, metrics={'train_runtime': 80.0259, 'train_samples_per_second': 1544.25, 'train_steps_per_second': 12.246, 'total_flos': 37848715637040.0, 'train_loss': 0.37849834987095426, 'epoch': 20.0})

<font color = 'indianred'> *Evaluate model on Validation Set* </font>

Even though we have been evaluating the model periodically during training (e.g., every few epochs), trainer.evaluate() is typically used to perform a final, comprehensive evaluation after all training epochs are completed. This ensures that you assess the model's performance after it has been fully trained and using the best model. We can use these statistice to compare different experiments with different hyperparameters/models.

In [None]:
trainer.evaluate()

{'eval_loss': 0.420954167842865,
 'eval_accuracy': 0.8194763165636952,
 'eval_f1_macro': 0.5064350064350064,
 'eval_runtime': 2.6513,
 'eval_samples_per_second': 582.732,
 'eval_steps_per_second': 4.903,
 'epoch': 20.0}

In [None]:
valid_output = trainer.predict(trainset)

In [None]:
valid_output._fields

('predictions', 'label_ids', 'metrics')

In [None]:
valid_output

PredictionOutput(predictions=array([[-2.5525103 , -0.48390844, -2.307775  , ..., -2.252241  ,
        -2.114294  , -1.5065308 ],
       [ 0.02062101, -1.1015617 , -0.01023399, ..., -1.4708297 ,
        -1.7262698 , -1.9699585 ],
       [-0.28701395, -0.9499339 , -0.08711691, ..., -0.95581025,
        -2.382355  , -2.3736637 ],
       ...,
       [ 1.7703618 , -2.8455849 ,  1.5533438 , ..., -0.2650468 ,
        -3.8126452 , -4.848672  ],
       [ 2.746284  , -3.3804846 ,  2.4396737 , ..., -0.47917095,
        -4.0508432 , -5.4608192 ],
       [ 1.3494853 , -1.5440271 ,  1.0650587 , ..., -0.5860844 ,
        -2.4959521 , -3.117495  ]], dtype=float32), label_ids=array([[0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 1., ..., 0., 0., 1.],
       [1., 0., 1., ..., 0., 0., 0.],
       ...,
       [1., 0., 1., ..., 1., 0., 0.],
       [1., 0., 1., ..., 1., 0., 0.],
       [1., 0., 1., ..., 0., 0., 0.]], dtype=float32), metrics={'test_loss': 0.2969598174095154, 'test_accuracy': 0.87735585340375

In [None]:
valid_output.metrics

{'test_loss': 0.2969598174095154,
 'test_accuracy': 0.8773558534037575,
 'test_f1_macro': 0.6727386934673367,
 'test_runtime': 4.164,
 'test_samples_per_second': 1483.901,
 'test_steps_per_second': 11.767}

In [None]:
valid_preds = np.argmax(valid_output.predictions, axis=-1)
valid_labels = np.array(valid_output.label_ids)

<font color = 'indianred'> *Get best checkpoint*</font>

In [None]:
# After training, let us check the best checkpoint
# We need this for Inference
best_model_checkpoint_step = trainer.state.best_model_checkpoint.split('-')[-1]
print(f"The best model was saved at step {best_model_checkpoint_step}.")

The best model was saved at step 700.


In [None]:
wandb.finish()

VBox(children=(Label(value='0.022 MB of 0.022 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
eval/accuracy,▁▁▃▄▅▅▆▇▇▇▇█████████
eval/f1_macro,▁▂▄▅▆▇▇▇▇███████████
eval/loss,█▅▄▄▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
eval/runtime,█▁▁▁▂▄▁▁▁▂▁▂▁▁▁▂▁▁▂▁
eval/samples_per_second,▁███▇▄██▇▇█▇███▆█▇▆█
eval/steps_per_second,▁███▇▄██▇▇█▇███▆█▇▆█
test/accuracy,▁
test/f1_macro,▁
test/loss,▁
test/runtime,▁

0,1
eval/accuracy,0.81948
eval/f1_macro,0.50644
eval/loss,0.42095
eval/runtime,2.6513
eval/samples_per_second,582.732
eval/steps_per_second,4.903
test/accuracy,0.87736
test/f1_macro,0.67274
test/loss,0.29696
test/runtime,4.164


## <Font color = 'indianred'> **6. Performance on Test Set**

<Font color = 'indianred'> **Load Model from checkpoint**

In [None]:
# Define the path to the best model checkpoint
# 'model_checkpoint' variable is constructed using the model folder path and the checkpoint step
# This step is identified as having the best model performance during training
model_checkpoint = model_folder/f'checkpoint-{best_model_checkpoint_step}'


In [None]:
# Instantiate the CustomMLP model with predefined configurations
# 'my_config' is an instance of the CustomConfig class, containing specific model settings like
# vocabulary size, embedding dimensions, etc.
model = CustomMLP(my_config)


In [None]:
model

CustomMLP(
  (embedding_bag): EmbeddingBag(10344, 300, mode='mean')
  (layers): Sequential(
    (0): Linear(in_features=300, out_features=200, bias=True)
    (1): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.5, inplace=False)
    (8): Linear(in_features=100, out_features=11, bias=True)
  )
)

In [None]:
# Load the pre-trained weights into the CustomMLP model from the specified checkpoint
# 'model_checkpoint' refers to the path where the model's best-performing state is saved
# This step ensures the model is initialized with weights from its most effective training state
model = model.from_pretrained(model_checkpoint, config = my_config)


In [None]:
model

CustomMLP(
  (embedding_bag): EmbeddingBag(10344, 300, mode='mean')
  (layers): Sequential(
    (0): Linear(in_features=300, out_features=200, bias=True)
    (1): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.5, inplace=False)
    (8): Linear(in_features=100, out_features=11, bias=True)
  )
)

<Font color = 'indianred'> **Instantiate Trainer for evaluation**

In [None]:
# Create a partial function 'collate_fn' using 'collate_batch' with 'my_vocab' set to 'imdb_vocab'
# This function will be used by the Trainer to process batches of data during evaluation
collate_fn = partial(collate_batch, my_vocab=emotion_vocab)

# Configure training arguments for model evaluation
# 'output_dir' specifies where to save the results
# 'per_device_eval_batch_size' sets the batch size for evaluation, adjusted based on available GPU memory
# 'do_train = False' and 'do_eval=True' indicate that training is not performed, but evaluation is
# 'remove_unused_columns=False' ensures that all columns in the dataset are retained during evaluation
# 'report_to=[]' disables logging to external services like Weights & Biases

training_args = TrainingArguments(
    output_dir="./results",
    per_device_eval_batch_size=16,
    do_train=False,
    do_eval=True,
    remove_unused_columns=False,
    report_to=[]
)


In [None]:
test_dataset = pd.read_csv('/content/drive/MyDrive/data/test.csv', usecols=lambda column: column != 'ID')
testset = Dataset.from_dict({
    'texts': test_dataset['Tweet'].to_list(),
    'labels': [[0] * 11] * len(test_dataset),  # Exclude 'Tweet' column
})
testset[0]

{'texts': '@Adnan__786__ @AsYouNotWish Dont worry Indian army is on its ways to dispatch all Terrorists to Hell',
 'labels': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

In [None]:
# Initialize the Trainer with the specified model and training arguments
# 'model' is the CustomMLP model loaded with pre-trained weights
# 'training_args' contains the configurations for evaluation, including batch sizes and output directory
# 'eval_dataset' is set to 'testset', which is the dataset used for evaluating the model
# 'data_collator' is assigned 'collate_fn', the function for processing batches of data
# 'compute_metrics' is a function that calculates evaluation metrics like accuracy and F1 score

trainer = Trainer(
    model=model,
    args=training_args,
    eval_dataset=testset,
    data_collator=collate_fn,
    compute_metrics=compute_metrics,
)


In [None]:
trainer.evaluate()

{'eval_loss': 0.3602648079395294,
 'eval_model_preparation_time': 0.0006,
 'eval_accuracy': 0.8577087226979833,
 'eval_f1_macro': 0.0,
 'eval_runtime': 3.1795,
 'eval_samples_per_second': 1024.993,
 'eval_steps_per_second': 64.16}

## <Font color = 'indianred'> **7. Model Inference**
Model inference is the stage in the machine learning process where a trained model is used to make predictions on new, unseen data. Unlike the training or evaluation phases, labels are not required at this stage, as the primary goal is to apply the model's learned patterns and knowledge to generate predictions.




In [None]:
testset

Dataset({
    features: ['texts', 'labels'],
    num_rows: 3259
})

In [None]:
sample_X = testset['texts']

*Step 1. Preprocessing*

In [None]:
device = 'cpu'
# Convert the list of texts into a list of lists; each inner list contains the vocabulary indices for a text
list_of_list_of_indices = [tokenizer(text, emotion_vocab) for text in testset]

# Compute the offsets for each text in the concatenated tensor
offsets = [0] + [len(i) for i in list_of_list_of_indices]
offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)

# Concatenate all text indices into a single tensor
indices = torch.cat([torch.tensor(i, dtype=torch.int64) for i in list_of_list_of_indices])

*Step 2: Get Predictions*

In [None]:
# move model to appropriate device
model.to(device)

# put model in evaluation mode
model.eval()

# get outputs (logits) from model
outputs = model(indices, offsets)
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-1.0018, -0.7015, -0.7786,  ..., -1.3333, -1.8851, -1.8930],
        [-0.6305, -1.0281, -0.2947,  ..., -0.7987, -2.4085, -2.4518],
        [-1.6315, -0.7372, -1.4425,  ..., -1.7249, -1.8496, -1.7397],
        ...,
        [ 0.3254, -1.0429,  0.4378,  ..., -0.5334, -2.0963, -2.9929],
        [-1.4181, -0.7953, -1.1589,  ..., -1.3231, -1.7889, -1.6151],
        [-0.8624, -0.7395, -0.7451,  ..., -1.0174, -1.9232, -2.3489]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [None]:
outputs.logits

tensor([[-1.0018, -0.7015, -0.7786,  ..., -1.3333, -1.8851, -1.8930],
        [-0.6305, -1.0281, -0.2947,  ..., -0.7987, -2.4085, -2.4518],
        [-1.6315, -0.7372, -1.4425,  ..., -1.7249, -1.8496, -1.7397],
        ...,
        [ 0.3254, -1.0429,  0.4378,  ..., -0.5334, -2.0963, -2.9929],
        [-1.4181, -0.7953, -1.1589,  ..., -1.3231, -1.7889, -1.6151],
        [-0.8624, -0.7395, -0.7451,  ..., -1.0174, -1.9232, -2.3489]],
       grad_fn=<AddmmBackward0>)

*Step 3: Post Processing*

In [None]:
predictions = torch.abs(outputs.logits)
predictions = predictions.detach().numpy()
label_columns = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'love', 'optimism', 'pessimism', 'sadness', 'surprise', 'trust']
original_labels = {column for id, column in enumerate(label_columns)}
sigmoid_result = 1 / (1 + np.exp(-predictions))
sigmoid_result[:5]

array([[0.73141205, 0.66852367, 0.6853813 , 0.76264274, 0.5013437 ,
        0.8701301 , 0.62698454, 0.8876162 , 0.7913903 , 0.8681953 ,
        0.869095  ],
       [0.6526042 , 0.73655105, 0.5731455 , 0.6574237 , 0.74999654,
        0.9351286 , 0.7933292 , 0.83004355, 0.6896929 , 0.9174763 ,
        0.92069614],
       [0.8363751 , 0.6763792 , 0.80884147, 0.8472085 , 0.68911004,
        0.7729242 , 0.53268355, 0.9249624 , 0.8487603 , 0.86408365,
        0.85065436],
       [0.6566412 , 0.7010369 , 0.6173616 , 0.80843395, 0.54551876,
        0.89702123, 0.6893649 , 0.9048782 , 0.78803635, 0.87993777,
        0.9136633 ],
       [0.52838236, 0.7791795 , 0.5856165 , 0.61196715, 0.8414582 ,
        0.9713345 , 0.85331404, 0.857773  , 0.6748327 , 0.9140988 ,
        0.95538956]], dtype=float32)

In [None]:
threshold = 0.6
predictions_labels = (sigmoid_result > threshold)
predictions_labels.astype(int)[:5]

array([[1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1],
       [1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1],
       [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1],
       [0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]])

In [None]:
submission = pd.read_csv('/content/drive/MyDrive/data/sample_submission.csv')

In [None]:
submission.head()

Unnamed: 0,ID,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2018-01559,0,0,0,0,0,0,0,0,0,0,0
1,2018-03739,0,0,0,0,0,0,0,0,0,0,0
2,2018-00385,0,0,0,0,0,0,0,0,0,0,0
3,2018-03001,0,0,0,0,0,0,0,0,0,0,0
4,2018-01988,0,0,0,0,0,0,0,0,0,0,0


In [None]:
submission.columns


Index(['ID', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'love',
       'optimism', 'pessimism', 'sadness', 'surprise', 'trust'],
      dtype='object')

In [None]:
predictions_num = predictions_labels.astype(int)
predictions_num[:5]

array([[1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1],
       [1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1],
       [1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1],
       [0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1]])

In [None]:
submission[['anger', 'anticipation', 'disgust', 'fear', 'joy', 'love',
            'optimism', 'pessimism', 'sadness', 'surprise', 'trust']] = predictions_num

In [None]:
submission.head()

Unnamed: 0,ID,anger,anticipation,disgust,fear,joy,love,optimism,pessimism,sadness,surprise,trust
0,2018-01559,1,1,1,1,0,1,1,1,1,1,1
1,2018-03739,1,1,0,1,1,1,1,1,1,1,1
2,2018-00385,1,1,1,1,1,1,0,1,1,1,1
3,2018-03001,1,1,1,1,0,1,1,1,1,1,1
4,2018-01988,0,1,0,1,1,1,1,1,1,1,1


In [None]:
submission.to_csv('/content/drive/MyDrive/data/emotion_kaggle.csv', index = False)