# Practical Session 2 - Model Frugality

Welcome to this second practical session of the Frugal AI series. In this sessions we'll delve into the concept of model frugality, and how to build models that are both accurate and efficient while minimizing the resources required to train them.



*During this session you're invited to measure the energy consumption of your code snippets for different models. This is not mandatory as not working on google colab and requiring further configuration on some laptops*


### Energy Consumption Measurements
In the perspective of quantifying the notion of Frugality of our approaches, we'll try to use the library `pyRAPL`to measure the energy consumption of cpus during training times of our different sets of models.
The library isn't compatible with every piece of hardware or OS, if you encounter a problem in using it, just skip this part :)

Try the snippet below to check it out!

### Test of an Energy Consumption estimation snippet

In [None]:
import pyRAPL
import time
def sum_range(total = 1000000):
    sum(range(total))
    return

all_powers = []
start = 5
end = 9
for total in range(start, end):
    pyRAPL.setup()
    meter = pyRAPL.Measurement("energy-snippet")
    meter.begin()
    sum_range(10**total)
    meter.end()
    all_powers.append(meter.result.pkg[0])
# Power consumption is measured in micro-J
for i in range(len(all_powers)):
    print(f"Summing 10^{start+i} numbers consummed {all_powers[i]} μJ")



 You need to install pymongo>=3.9.0 in order to use MongoOutput 


PyRAPLCantRecordEnergyConsumption: 

# 1 - Dataset Preparation
In this [data folder](https://drive.google.com/drive/folders/1jX4omja8UBcX3MNupX2rLmUJkzR9RhM6?usp=drive_link), you'll find two datasets.

#### Amazon Reviews (6-class classification)
The first one is another extract of the amazon reviews dataset as a six-category classification task. For all classes you have 500 examples that have been pre-selected and split randomly into a classical 80%/10%/10% train/val/test scheme.
Categories that have been kept are the following:
[
    "Beauty",
    "Movies",
    "Appliances",
    "Digital Music",
    "Software",
    "Video_Games"
]

#### WikiText-103-100k

This second dataset is an extract from a known dataset of quality articles from wikipedia called WikiText-103, that comprises high quality articles from wikipedia. Later in the session this extract will allow you to train low-resource embedding models.

In [None]:
#1.1 Load the amazon dataset in a variable called split and print the first 5 lines of the train set
import os
import json
import gzip
from typing import List

path_data_amazon = "/home/ed/Dev/CODE_PERSO/data_tp2/amazon_reviews/" # To replace with your local path or GoogleColab path

paths_amazon_splits = {
    key : os.path.join(path_data_amazon, f'{key}.jsonl.gz') for key in ["train", "val", "test"]
}

splits = {

# `splits` should be of the form splits["train"] = [
#     {
#         "review": 'Works great',
#         "rating": 4.0
#      ...
#      }
# ]


}


# Your code here

for i in range(5):
    print(splits["train"][i])


In [None]:
# 1.2 Plot the label distribution of the classes in the training set
# You can copy paste and modify the function you made during the first practical session
import matplotlib.pyplot as plt
import numpy as np
from typing import Any

labels = [elem["category"] for elem in splits["train"]]


def plot_distribution_histogram(list_labels: List[Any]):
  # Your code here
  pass

plot_distribution_histogram(labels)

You should obtain something like this:


<a href="https://ibb.co/NTp03yN"><img src="https://i.ibb.co/g4w07RW/labels-distribution.png" alt="labels-distribution" border="0"></a>

## 2 - Bag-of-Words Classification Methods

- **BoW** methods are a simple family of methods to represent text data, treating it as a collection of words without considering grammar or word order.
- Each word is mapped to a vector representing word frequencies or occurrences in a document.

In this part we'll study two of those approaches: Naive Bayes classification and TF-IDF classification.

### Advantages
- **Simplicity**: Easy to implement.
- **Efficiency**: Low resources approaches that scale well.
- **Interpretability**: Easy to understand how classification works with word counts.
- **Universal**: Language agnostic and domain agnostic


### Lemmatization and Stop Words Removal in BoW

In BoW, the final feature space is in the size of our number of unique words. To reduce
the size of the feature space by removing the less informative words we usually peform to pre-processing steps: Lemmatization and Stop-words removel

#### 1. Lemmatization or Stemming
- **Definition**: Reducing words to their base or root form (e.g., 'running' to 'run').
- **Why Use It?**: Minimizes redundancy by treating different forms of a word as the same feature, improving model performance.
- **If no ressource are available**: In this context a simple `Stemmer` could be built that would remove common suffixes or prefixes of words such as marks of plurals, conjugations, gender, case endings...


#### 2. Stop Words Removal
- **Definition**: Removing common words (e.g., 'the', 'is') that carry little semantic value.
- **Why Use It?**: Reduces noise, improves computational efficiency, and prevents irrelevant words from influencing the model.
- **If no ressource are available**: A simple stop word list can be created by calculating word frequencies across a corpus, selecting the most frequent and semantically insignificant words, and optionally refining the list manually.

For the lemmatization and the stop word removal we'll use a known library called Spacy. It is available for many languages (See https://spacy.io/usage/models).

We first need to download the english models

In [None]:
! python -m spacy download en_core_web_sm

Below are two code snippets that show you how to use the lemmatizer and the stopword removal functions of spacy

In [None]:
# Here is a code snippet that shows how to use the spaCy lemmatizer

import spacy
# Load the spacy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The cats are running faster than the dogs."

# Process the text
doc = nlp(text)

# Print the lemmas
lemmatized = [token.lemma_ for token in doc]
print(lemmatized)

In [None]:
# Here is a code snipppet that shows how to use the spaCy stopwords filter
import spacy

# Load the spacy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The cats are running faster than the dogs."

# Process the text
doc = nlp(text)

# Filter out stop words
filtered_tokens = [token.text for token in doc if not token.is_stop]

print(filtered_tokens)

In [None]:
# 2.1 Process the three splits by creating a lemmatized and without stop words version
# Only keep words that are composed of letters, lowercase all the words, and remove accents
# Create entries named lemmatized_no_stopwords in the splits dictionary (see below for the expected output)
#
# splits["train"][0] = {'review': 'Works great / packaged in box was great too.',
#  'rating': 4.0,
#  'category': 'Appliances',
#  'lemmatized_no_stopwords': ['work', 'great', 'package', 'box', 'great']}

# The function to remove accents is supplied below
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

def lemmatize_and_remove_stopwords(splits: dict):

    # Your code here
    return splits

splits = lemmatize_and_remove_stopwords(splits)


## First Model: Naive Bayes Classification

## Overview
Naive Bayes is a probabilistic classifier based on Bayes' theorem, assuming independence between features. It is particularly effective for text classification tasks.

## Bag of Words Model
The Bag of Words (BoW) model represents text data by converting it into a matrix of token counts. Each document is transformed into a vector where each element corresponds to the frequency of a word in the document.

### Steps:

1. **Text Preprocessing**:
   - Tokenization: Split text into words.
   - Lowercasing: Convert all words to lowercase.
   - Stop-word removal: Eliminate common words (e.g., "and", "the").
   - Stemming/Lemmatization: Reduce words to their root forms.

2. **Vocabulary Creation**:
   - Build a vocabulary of unique words from the training dataset.

3. **Feature Extraction**:
   - Convert documents into vectors based on the vocabulary using count or TF-IDF.

4. **Training the Model**:
   - Calculate prior probabilities for each class:
   
    \begin{align}
     P(Class) = \frac{Count(Class)}{Total\ Count}
    \end{align}
   - Calculate likelihood probabilities:
   
     $$
     P(Word|Class) = \frac{Count(Word \cap Class) + 1}{Count(Class) + V}
     $$
     where $ V $ is the size of the vocabulary (Laplace smoothing).

5. **Classification**:
   - For a new document, compute the posterior probability for each class:
     $$
     P(Class|Document) \propto P(Class) \prod_{i} P(Word_i|Class)
     $$
   - Choose the class with the highest posterior probability.

## Advantages
- Simple and fast.
- Works well with large datasets.
- Handles high dimensionality effectively.

## Disadvantages
- Assumes independence of features.
- May perform poorly with highly correlated features.

## Use Cases
- Spam detection.
- Sentiment analysis.
- Document categorization.

In [None]:
# Here is a code snippet that shows how to use the spaCy lemmatizer

import spacy
# Load the spacy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The cats are running faster than the dogs."

# Process the text
doc = nlp(text)

# Print the lemmas
lemmatized = [token.lemma_ for token in doc]
print(lemmatized)

In [None]:
# Here is a code snipppet that shows how to use the spaCy stopwords filter
import spacy

# Load the spacy model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "The cats are running faster than the dogs."

# Process the text
doc = nlp(text)

# Filter out stop words
filtered_tokens = [token.text for token in doc if not token.is_stop]

print(filtered_tokens)

In [None]:
#2.1 Process the three splits by creating a lemmatized and without stop words version
# Only keep words that are composed of letters, lowercase all the words, and remove accents
# Create entries named lemmatized_no_stopwords in the splits dictionary (see below for the expected output)
#
# splits["train"][0] = {'review': 'Works great / packaged in box was great too.',
#  'rating': 4.0,
#  'category': 'Appliances',
#  'lemmatized_no_stopwords': ['work', 'great', 'package', 'box', 'great']}
import unicodedata

def remove_accents(input_str):
    nkfd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

def lemmatize_and_remove_stopwords(splits: dict):
    pass
    return splits

# --- Solution

def lemmatize_and_remove_stopwords(splits: dict):
    for split in splits:
        for elem in splits[split]:
            doc = nlp(elem["review"])
            elem["lemmatized_no_stopwords"] = [remove_accents(token.lemma_.lower()) for token in doc if not token.is_stop and token.text.isalpha()]
    return splits

splits = lemmatize_and_remove_stopwords(splits)

## First Model: Naive Bayes Classification

In [None]:
# 2.2 Using CountVectorizer and BernoulliNB from scikit-learn, train a Naive Bayes classifier on the training set
import numpy as np
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer

category2id = {category: i for i, category in enumerate(np.unique([elem["category"] for elem in splits["train"]]))} # Can be useful to transform categories' labels to integer and vice-versea
id2category = {i: category for category, i in category2id.items()}

count_vectorizer = CountVectorizer()

X_train, y_train = None, None
# Your code here


clf_nb = BernoulliNB()
clf_nb.fit(X_train, y_train)


In [None]:
# 2.3 Evaluate the peformance of your Naive Bayes classifier on the test set
from sklearn.metrics import classification_report

X_test, y_test = None, None

# Your code here


print(classification_report(y_test, clf_nb.predict(X_test), target_names=category2id.keys()))

#### TODO

Independently of the energy measurement, how would you quantify the computational and memory complexity of Naive Bayes approach ?
- Express the computation complexity of training as a function of the number of words in the training set
- Express the memory complexity as function of variables of your choice

In [None]:
# 2.4 Log probabilities of each word for each class can be found in the feature_log_prob_ array
# Extract top-10 words per class and print them

log_probabilities = clf_nb.feature_log_prob_

# You will need the inverse_transform function from count_vectorizer


top_words = {}
for k in range(6):
    # Your code here
    pass

# Second Model: TF-IDF Classification
TF-IDF, or "Term-frequency - inverse document frequency," is a method used to determine which words are the most discriminative in a text corpus.

It's an approach that can model documents as such.

TF-IDF is a score assigned to a word relative to a document. This score is calculated using two terms:

## TF "term-frequency":

The number of occurrences of the word "cat" in document *i*:

$TF_i(w_{cat}) = count(w_{cat},document_i)$

## IDF "inverse document frequency":

$IDF(w_{cat}) = \frac{\text{Total number of documents in the corpus}}{\text{Number of documents in which cat appears}}$

## Therefore, the TF-IDF of the word "cat" for document *i* is:

$TFIDF_i(w_{cat}) = TF_i(w_{cat}) \cdot IDF(w_{cat})$

We note that the more frequent the word "cat" is in the corpus, the lower its IDF.

Conversely, the more frequent the word "cat" is in document *i*, the higher its TF.

Words with the highest scores for document *i* will be frequent in this document and rare in the corpus, making them adequate to discriminate this type of document!

## Remarks:

Variants exist to limit the size of TF and IDF terms by applying the logarithmic function. We activate one of those variants with the option `sublinear_tf=True`

In [None]:
# 2.4 Fit the below TFidfVectorizer to your preprocessed BoW representation of your training set
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df=3, max_df=0.8, sublinear_tf=True, norm='l2')

# Your code here


In [None]:
# 2.5 Fit a logistic regression classifier to predict category from the tf-idf representations
# print the classification report on the test set
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

clf_logistic = LogisticRegression(random_state=0, max_iter=3000)

X_test, y_test = None, None

# Your code here


print(classification_report(y_test, clf_logistic.predict(X_test), target_names=category2id.keys()))


In [None]:
# 2.6 count the number of zero-entries in X_train


### Todo

What can you say about the sparsity of X_train? (2-3 sentences)


## Latent Semantic Indexing (LSI): An extension of TF-IDF

If you know PCA (Principal Component Analysis), or its equivalent SVD (Singular Value Decomposition), we can extract axes that explain the most the variability of our data. We can thus truncate and densify the representation by only keeping the k axes that explain the most the data variability.

This approach of dimensionality reduction is called Truncated Singular Value Decomposition.

<img src="https://www.researchgate.net/profile/Jila-Ayubi/publication/271076899/figure/fig1/AS:614261244051470@1523462701842/Singular-value-decomposition-of-A-a-full-rank-r-b-rank-k-approximation.png">

In [None]:
# 2.7 Truncate the svd for different number of components and fit a logistic regression model
# Plot the evolution of f1_macro function of the number of components
from sklearn.decomposition import TruncatedSVD
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score

for n in [8,16,32,128,256]:
    svd = TruncatedSVD(n_components=n, n_iter=50, random_state=42)

    X_train = None # TODO: modify

    svd.fit(X_train)

    # Your code here



In [None]:
# 2.8 Plot the confusion matrix for the smallest and highest number of component
# Below a code to plot the confusion matrix

from sklearn.metrics import confusion_matrix

import seaborn as sns

# Plot confusion matrix
import matplotlib.pyplot as plt



# Create the confusion matrix
cm = confusion_matrix(y_test, y_pred)
labels = [id2category[i] for i in range(6)]
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.xticks(ticks=np.arange(len(labels))+0.5, labels=labels)
plt.yticks(ticks=np.arange(len(labels))+0.5, labels=labels, rotation=0)
plt.show()


#### TODO

Do you note anything between the two confusion matrices? (1-2 sentences)

# Train your TF-IDF on wikipedia

Sometimes the quantity of data available isn't sufficient to actually build interesting tf-idf representations.

In those case one can try to learn the idf coefficients through the use of another big corpus.

Here we're going to do it through the use of a corpus of 100k paragraphs from 103 articles.


### Todo

Explain why the dataset is not ideal? (2-3 sentences)

In [None]:
# 2.9 learn a tf-idf on wiki-103

#TODO: Modify with your path
path_wikipedia = ".../wikitext-103/wiki-103-extract-100k.jsonl.gz"

# You can pass a generator to the fit() method to not overload your ram memory
text_generator = (json.loads(line)["text"] for line in gzip.open(path_wikipedia, "rt"))


tfidf = TfidfVectorizer(sublinear_tf=True, max_df = 0.8, min_df=10, norm='l2')
# Your code here

In [None]:
# 2.10 transform your training set, learn a logistic regression classifier and plot the classification_report

# 3 - Word Embeddings: Fasttext and Word2Vec

FastText and Word2Vec are two popular algorithms for generating word embeddings, which are dense vector representations of words in a continuous vector space. There's also a third popular algorithm called GloVe, which we'll not cover in this session, as is very different from the two others.

Those two models exist in two flavors: Skip-gram and Continuous Bag of Words (CBOW). The main difference between the two is the way they are trained. Skip-gram predicts context words from a target word, while CBOW predicts the target word based on its context.

![word2vec](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*cuOmGT7NevP9oJFJfVpRKA.png)

## Word2Vec
- **Training**: Trains using Skip-gram or Continuous Bag of Words (CBOW) models. It converts words into dense vectors based on their context within a window.
  - **Skip-gram**: Predicts context words from a target word.
  - **CBOW**: Predicts the target word based on its context.

## FastText
- **Training**: Also supports Skip-gram and CBOW. Key difference: it represents words as a bag of character **n-grams**.
  - Instead of mapping each word directly, it breaks words into subword units, helping with rare words and morphology.

## Similarities
- Both use Skip-gram and CBOW.
- Both output dense word vectors.
- Both leverage surrounding context in training.

## Differences
- **FastText**: Uses subword information (character n-grams), improving performance with rare/unknown words.
- **Word2Vec**: Treats each word as a single entity, making it less effective for rare words or inflectional languages.

In FastText word representations are thus computed as the average of the representations of its character n-grams.
Example: The word `elephant` can be represented as the sum of its character n-grams `'<elep', 'eleph', 'lepha', 'ephan', 'phant', 'hant>'`.


## FastText in Supervised Setting

FastText can be also be trained in supervised settings such as text classification. In a supervised setting, it learns word and sentence representations while simultaneously training a classifie

In this part we'll only use fasttext in unsupervised setting by using the model pretrained on wikipedia

In [None]:
# 3.1 Load the fasttext model cc.en.300.bin
import fasttext

# Your code here

In [None]:
# 3.2 Using the get_sentence_vector() method add for each entry of split its sentence vector representation from a lemmatized without stopwords version of the review. Add the representation in a field "fasttext_wiki" for each entry

for split in splits.keys():
    for elem in splits[split]:
        # Your code here
        pass

In [None]:
# 3.3 Train a logistic regression model and print its classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

y_test, y_pred = None
clf_logistic = LogisticRegression(random_state=0, max_iter=3000)

# Your code here

print(classification_report(y_test, y_pred))

### Todo
Explain to what extent fasttext representations are expected to be better than tf-idf ones on wikipedia?

Explain to what extent they're expected to be less relevant too?

(2-3 sentences)

### Weighted Fasttext:

We can try to make use of the best of both worlds by using the idf coefficient to weight the sum of our fasttext features.

In [None]:
# Below we supply some functions to compute average representations using Fasttext and IDF coefficients
import numpy as np


def get_idf_coefs(words, tfidf):
    """
    Given a tf-idf model, this functions only return the words that have an entry in the tf-idf matrix
    It also return their associated idf coefficients
    """
    indices = [tfidf.vocabulary_.get(word) for word in words]
    filtered_words = [words[i] for i in range(len(words)) if indices[i] is not None]
    idf_coefs = [tfidf.idf_[i] for i in indices if not(i is None)]
    return filtered_words, idf_coefs

def average_word_vectors(model, words, idf_coefs = None):
    """
    This function computes the average word vector of a list of words.
    If no idf_coefs are provided, it computes the simple average.
    If idf_coefs are provided, it computes the weighted average.
    """
    word_vectors = [model.get_word_vector(word) for word in words]
    if not(idf_coefs is None):
        # Your code here
        # The list of coefs should be normalized to sum to 1 using the method of your choice
        return None
    else:
        return np.average(word_vectors, axis=0)


In [1]:
# 3.4 Make idf weighted average representation in splits called `weighted_fasttext`
# For reviews that don't have any words in the tf-idf model, use the simple average representation

In [2]:
# 3.5 Train a logistic regression model and print its classification_report

### Todo

Are the results as expected? (1-2 sentences)

## 4 - Tiny Bert (Bonus part)

In this part we concentrate on the use of a pretrained encoder-model composed of two BertEncoder layers.

We're going to perform finetuning of this model in two settings (full-finetuning and Lora finetuning (which is related to dimensionality reduction))

In [4]:
# Load the model of interest
from transformers import AutoModel, AutoTokenizer
bert_model = AutoModel.from_pretrained("prajjwal1/bert-tiny")
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/285 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/17.8M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



In [None]:
# Below we supply two classes, one for the head classification and one for the full model
import torch

class HeadClassification(torch.nn.Module):
    def __init__(self, input_dim, output_dim):
        """
        input_dim: size of hidden layer (pooled_output)
        output_dim: number of classes
        """
        super(HeadClassification, self).__init__()
        self.linear = torch.nn.Linear(input_dim, output_dim)

    def forward(self, x):
        """
        x: tensor of shape (batch_size, input_dim)
        return unnormalized logits of shape (batch_size, output_dim)
        """
        return self.linear(x)


class BertForClassification(torch.nn.Module):
    def __init__(self, bert, input_dim, output_dim):
        """
        bert: bert model
        output_dim: number of classes
        """
        super(BertForClassification, self).__init__()
        self.bert = bert
        self.input_dim = input_dim
        self.classifier = HeadClassification(self.input_dim, output_dim)

    def forward(self, input_ids, attention_mask, **kwargs):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        logits = self.classifier(pooled_output)
        return logits


In [None]:
# Below we supply two classes, this time for the dataset and dataloader in pytorch
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class MyDataset(Dataset):
    def __init__(self, X: List[np.array], y: List[int]):
        self.X = X
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

class Datacollator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer


    def __call__(self, inputs):
        X = [elem[0] for elem in inputs]
        y = [elem[1] for elem in inputs]
        batch = tokenizer.batch_encode_plus(X, return_tensors="pt", padding="longest")
        labels = torch.tensor(y)
        return batch, labels


# Create the dataset
dataset = MyDataset(X_train, y_train)

data_collator = Datacollator(tokenizer)

# Create the dataloader
batch_size = 32
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=data_collator)``
f

In [None]:
class LinearWarmupScheduler:
    """
    This class implements a usual learning rate scheduling with warmups steps followed by linear decay of the learning rate after each `scheduler.step()` calls
    """
    def __init__(self, optimizer, warmup_steps, training_steps):
        self.optimizer = optimizer
        self.warmup_steps = warmup_steps
        self.training_steps = training_steps

    def __call__(self, current_step: int):
        return self.warmup(current_step)

    def warmup(self, current_step: int):
        if current_step < self.warmup_steps:  # current_step / warmup_steps * base_lr
            return float(current_step / self.warmup_steps)
        else:                                 # (num_training_steps - current_step) / (num_training_steps - warmup_steps) * base_lr
            return max(0.0, float(self.training_steps - current_step) / float(max(1, self.training_steps - self.warmup_steps)))



In [None]:
# 4.1 Load the validation dataloader

In [None]:
# 4.2 Complete the training loop
# Save the model that has the best validation loss after an epoch
# You can use `best_model` = copy.deepcopy(model)
from sklearn.metrics import f1_score
import torch.nn as nn
import torch.optim as optim
import copy
best_val_loss = float('inf')
epochs_without_improvement = 0
patience = 8  # Number of epochs with no improvement after which training will be stopped
num_epochs = 40
learning_rate = 1e-4
criterion = nn.CrossEntropyLoss()
best_model = None

### TODO: Modify

classification_model = None
train_loader = None
val_loader= None

###



optimizer = optim.Adam(classification_model.parameters(), lr=learning_rate)


training_steps = len(train_loader) * num_epochs
linearwarmup_func =  LinearWarmupScheduler(optimizer, warmup_steps=int(0.1*training_steps), training_steps=int(training_steps))
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=linearwarmup_func)

for epoch in range(num_epochs):
    # Training phase
    classification_model.train()
    train_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = classification_model.forward(**inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()
        train_loss += loss.item()
    train_loss /= len(train_loader)
    print(f"Epoch {epoch}, Learning Rate: {optimizer.param_groups[0]['lr']}")
    print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {train_loss:.4f}")

    # Validation phase
    validation_loss = 0
    y_pred = []
    y_true = []
    with torch.no_grad():
        classification_model.eval()
        for inputs, labels in val_loader:
            y_true += labels.squeeze().tolist()
            outputs =  classification_model.forward(**inputs)
            predictions = torch.argmax(outputs, dim=1)
            y_pred += predictions.squeeze().tolist()
            validation_loss += criterion(outputs, labels).item()
        f1_macro = f1_score(y_true, y_pred, average="macro")
        print(f"F1 Score (Macro): {f1_macro:.4f}")
    validation_loss /= len(val_loader)

    # Early stopping check
    print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {validation_loss:.4f}")
    if validation_loss < best_val_loss:
        best_val_loss = validation_loss
        epochs_without_improvement = 0
    else:
        epochs_without_improvement += 1

    if epochs_without_improvement >= patience:
        print(f"Early stopping at epoch {epoch+1}. No improvement for {patience} epochs.")
        break

In [None]:
# 4.3 Measure the model performance using the classification_report
from sklearn.metrics import classification_report

# Your code here

## Reducing the computational and memory complexity of transformer finetuning
### LoRA (Low-Rank Adaptation) for Transformer Fine-tuning

LoRA is a technique to reduce the number of trainable parameters during fine-tuning by injecting trainable low-rank matrices into the pre-trained model's weight matrices. Instead of updating the entire weight matrix, LoRA introduces low-rank factorization.


![lora](https://www.researchgate.net/profile/Ruibo-Fu/publication/371490294/figure/fig1/AS:11431281167009019@1686539371648/Transformer-architecture-in-wav2vec2-along-with-LoRA.png)


#### Key Concepts:

- **Rank (r):** The rank of the factorized matrices. It controls the dimensionality of the introduced low-rank matrices. A lower rank reduces the trainable parameters but may limit the model's capacity.
  
- **Alpha (α):** A scaling factor applied to the learned low-rank matrices before they are added to the original weight matrix. It controls how much influence the low-rank matrices have.

LoRA fine-tunes transformers efficiently with fewer trainable parameters by modifying only the low-rank matrices.


In [None]:
# Below is a manual implementation of LoRa where each BerEncoder layer is changed into its LoRa version
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class LoRALayer(nn.Module):
    def __init__(self, original_layer, r=4, alpha=16):
        super().__init__()
        self.original_layer = original_layer
        self.r = r
        self.alpha = alpha

        # Set requires_grad to False for the original layer
        for param in self.original_layer.parameters():
            param.requires_grad = False

        # Initialize LoRA components
        self.lora_A = nn.Parameter(torch.randn(original_layer.weight.size(0), r))
        self.lora_B = nn.Parameter(torch.randn(r, original_layer.weight.size(1)))
        self.scaling = alpha / r

        # Optional: Initialize LoRA matrices to zero to not affect the original layer at the start
        nn.init.zeros_(self.lora_A)
        nn.init.zeros_(self.lora_B)

    def forward(self, x):
        # Original layer output
        original_output = self.original_layer(x)

        # LoRA output: Adding scaled low-rank matrices' contribution
        lora_output = (x @ self.lora_B.T) @ self.lora_A.T
        lora_output *= self.scaling

        return original_output + lora_output

class BertWithLoRA(nn.Module):
    def __init__(self, model_name='prajjwal1/bert-tiny', r=8, alpha=16):
        super().__init__()
        # Load the pre-trained BERT model
        self.bert = BertModel.from_pretrained(model_name)

        # Freeze all parameters in the model
        for param in self.bert.parameters():
            param.requires_grad = False

        # Apply LoRA to the last encoder layer's attention projection (e.g., output layer of the last attention head)
        for layer in self.bert.encoder.layer:
            attention_head = layer.attention.self
            layer.attention.self.query = LoRALayer(attention_head.query, r, alpha)
            layer.attention.self.key = LoRALayer(attention_head.key, r, alpha)
            layer.attention.self.value = LoRALayer(attention_head.value, r, alpha)
            layer.attention.output.dense = LoRALayer(layer.attention.output.dense, r, alpha)

    def forward(self, input_ids, attention_mask=None):
        return self.bert(input_ids=input_ids, attention_mask=attention_mask)


In [None]:
# 4.4 Re-implement the training loop but using the Lora augmented model

In [None]:
# 4.5 Using the classification_report function compare the performances of the two models

#### Todo

What are your explanations for this underperforming of transformer models compared to much lighter approaches in this practical session?