# Step 0: Use a GPU (T4 is enough)
**Goal**: The goal of this lab is to predict the style of a music just by analyzing (classification task) the title (and more for the last part). We will use and finetune Transformers architecture.

# Step 1: Environment Setup and Installing Dependencies
In Google Colab, you need to install the required libraries, primarily Hugging Face's Transformers library and PyTorch. You can do this with the following commands:

In [None]:
!pip uninstall -y torch
!pip install transformers[torch]

Then restart the environment if you are on colab

# Step 2: Preparing Your Dataset

Since you're focusing on predicting playlist_genre based on track_name, you'll preprocess track_name as input and playlist_genre as labels. Here's how you can prepare your dataset:

## Import Necessary Libraries:

In [None]:
import pandas as pd
from transformers import BertTokenizer
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Dataset

## Load and Preprocess the Dataset:

Read your dataset.
Tokenize track_name.
Convert playlist_genre into numerical labels.

In [None]:
data_path = # Your path

# Load dataset
data = pd.read_csv(data_path)

# Drop nan values
# TODO

# TODO: Analyse and plot data we will use (genre_label)

# Initialize BERT tokenizer
tokenizer = # TODO: Use the pretrained bert-base-uncased (from transformers)

# Tokenize track names
tokenized_data = # TODO: Use the tokenizer previously defined

# Convert genres to categorical labels
data['genre_label'] = # TODO: Use pandas.Categorial -> Labels to numbers

# Split the dataset into training and validation sets
train_data, val_data = # TODO: Recommanded test_size=0.2


## Create a Custom Dataset Class:
For use with PyTorch, you need to create a custom dataset class.

In [None]:
import torch
from torch.utils.data import Dataset  # Import the Dataset class specifically

# Classe TracksDataset pour stocker et fournir des données musicales tokenisées et leurs labels pour l'entraînement de modèles avec PyTorch.
class TracksDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Example of how to tokenize and encode the track names
def encode_tracks(track_names):
    return tokenizer(track_names, padding=True, truncation=True, return_tensors="pt")

# Encode your data
encoded_train_tracks = # TODO: Use the previously defined function and the feature 'track_name'
encoded_val_tracks = # TODO: Same

# Assuming genre labels are categorical and need to be converted to numerical labels
# Example: using LabelEncoder from sklearn
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(train_data['genre_label'])
val_labels = label_encoder.transform(val_data['genre_label'])

# Now create your dataset using the encoded tracks and numerical labels
train_dataset = # TODO: Use the previously created class
val_dataset = # TODO: Same

## Create Data Loaders:
You need data loaders to efficiently feed data to the model during training.

In [None]:
train_loader = # TODO
val_loader = # TODO

# Step 3: Choose the Appropriate Model Architecture
For a classification task, you'll use a BERT model specifically designed for sequence classification. Hugging Face provides a model called BertForSequenceClassification that is suitable for this purpose.

First, import the necessary classes:

In [None]:
from transformers import BertForSequenceClassification, AdamW

Then, initialize the BERT model for sequence classification:

In [None]:
# Number of classification labels: the number of genres in your dataset
num_labels = # TODO

# Load pre-trained BERT model for sequence classification
model = # TODO (Tips, we are doing SequenceClassification)

# Step 4: Customize the Model’s Head
In BertForSequenceClassification, the model's head is already designed for a classification task. It adds a fully connected layer on top of the pooled output, specifically for the purpose of classification. This means you don't need to manually customize the head for a basic classification task, as it's already set up for you.

If you want to further customize this layer or add additional layers, you can modify the BertForSequenceClassification class, but for most standard classification tasks, this isn't necessary.

Remember to move the model to the GPU if you're using one, to speed up training:

In [None]:
%%capture
# Check if a GPU is available and if not, use a CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the specified device
model.to(device)

# Step 5: Define Hyperparameters
Before training, you need to set various hyperparameters for the training process. These include the learning rate, number of epochs, and the optimizer. Here's how you can do it:

In [None]:
# Define Hyperparameters
learning_rate = # TODO
epochs = # TODO

# Use AdamW optimizer - it's a version of Adam with a different weight decay
optimizer = # TODO



For the learning rate scheduler, you can use a scheduler that has a warm-up period and then linearly decays the learning rate:

In [None]:
from transformers import get_linear_schedule_with_warmup

# Total number of training steps
total_steps = len(train_loader) * epochs

# Create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0, # Default value
                                            num_training_steps=total_steps)

# Step 6: Training the Model
Now, you can train the model. This involves multiple epochs where each epoch consists of a training phase followed by a validation phase.

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score

# Function to calculate the accuracy of predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Move the model to the GPU
model.to(device)

# Training loop
for epoch_i in range(0, epochs):

    # Training
    model.train()
    total_train_loss = 0

    for step, batch in enumerate(train_loader):

        b_input_ids = batch['input_ids'].to(device)
        b_input_mask = batch['attention_mask'].to(device)
        b_labels = batch['labels'].to(device)

        model.zero_grad()

        outputs = model(b_input_ids,
                        token_type_ids=None,
                        attention_mask=b_input_mask,
                        labels=b_labels)

        loss = outputs.loss
        total_train_loss += loss.item()

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

    avg_train_loss = total_train_loss / len(train_loader)

    # Validation
    model.eval()
    total_eval_accuracy = 0
    total_eval_loss = 0

    for batch in val_loader:

        b_input_ids = batch['input_ids'].to(device)
        b_input_mask = batch['attention_mask'].to(device)
        b_labels = batch['labels'].to(device)

        with torch.no_grad():
            outputs = model(b_input_ids,
                            token_type_ids=None,
                            attention_mask=b_input_mask,
                            labels=b_labels)

        loss = outputs.loss
        total_eval_loss += loss.item()

        logits = outputs.logits
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        total_eval_accuracy += flat_accuracy(logits, label_ids)


    avg_val_accuracy = total_eval_accuracy / len(val_loader)
    print(f'Accuracy: {avg_val_accuracy}')

print("Training complete!")


# Step 7: Evaluation
After training your model, you should evaluate its performance on a test set (or validation set, if a separate test set isn't available). Evaluation helps you understand how well your model generalizes to unseen data. Here's a basic framework for evaluating your model:

In [None]:
from sklearn.metrics import classification_report

def evaluate(model, val_loader, device):
    model.eval()
    predictions, true_labels = [], []

    for batch in val_loader:
        # Move tensors to the GPU
        b_input_ids = batch['input_ids'].to(device)
        b_input_mask = batch['attention_mask'].to(device)
        b_labels = batch['labels'].to(device)

        with torch.no_grad():
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

        logits = outputs.logits
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        predictions.extend(np.argmax(logits, axis=1).flatten())
        true_labels.extend(label_ids.flatten())


    print(classification_report(true_labels, predictions, target_names=data['playlist_genre'].unique()))

evaluate(model, val_loader, device)


# Step 8: Inference Function
To let users try out the model with their own input, you can create an inference function. This function will take a track name as input, process it, and then use the model to predict the genre.

In [None]:
def predict_genre(track_name, model, tokenizer, device):
    # TODO: Use the eval mode of the model
    # TODO: Tokenize your input
    # TODO: Put the input_ids and attention_mask on the GPU (.to(device))
    # TODO: Make the prediction
    # TODO: Map the prediction with labels
    # TODO: Return the predicted genre
    return predicted_genre

# Example Usage
track_name = "All the Day (Don Rokoko Remix)"
predicted_genre = predict_genre(track_name, model, tokenizer, device)
print(f"Predicted Genre: {predicted_genre}")


Predicted Genre: pop


# Step 9: Enhanced Finetuning
Here, the goal is to redo all the steps but with differents features (adding the artist name, the release date, ...) and improve the result.