## **Sentiment analysis**
In this stage of the project, we will integrate a pre-trained model from Hugging Face that has been fine-tuned specifically for sentiment analysis. The chosen model is capable of analyzing the lyrical content of each song and classifying it into one of seven emotion categories: **Anger**, **Joy**, **Disgust**, **Fear**, **Sadness**, **Surprise**, and **Neutral**. By leveraging this model, we can capture the underlying emotional tone of the lyrics, which will not only enrich the song metadata but also enhance the recommendation system by enabling mood-based song suggestions. This approach ensures that recommendations are not solely based on textual similarity but also on the emotional resonance conveyed in the lyrics.

### Fine‑tuning **roberta-base** for Lyrics Emotion Classification

#### Preparing **GoEmotions Dataset** for Lyrics Emotion Classification

We use the GoEmotions dataset from kaggle, which contains Reddit comments labeled with 27 fine-grained emotions plus neutral. Since our goal is to classify song lyrics into 7 broader emotion categories **Anger, Joy, Disgust, Fear, Sadness, Surprise, and Neutral** , we map the original labels to these categories.

The steps are:

- **Load** the GoEmotions dataset.
- **Map** the original 27 emotion labels to our 7 target categories using a predefined mapping dictionary.
- **Assign** each example a single emotion label by selecting the first mapped label (or defaulting to Neutral if none match).
- **Remove** all unnecessary columns, keeping only the text and the new emotion label.

The resulting dataset is ready for fine-tuning transformer models like `roberta-base` for emotion classification on lyrics.

This process allows us to leverage a large, labeled dataset to train a model capable of accurately predicting emotions in song lyrics, even when no labeled lyrics data is initially available.


In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("shivamb/go-emotions-google-emotions-dataset")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\zoro\.cache\kagglehub\datasets\shivamb\go-emotions-google-emotions-dataset\versions\1


In [2]:
import os

print("Files in dataset folder:", os.listdir(path))


Files in dataset folder: ['go_emotions_dataset.csv']


In [3]:
import pandas as pd

file_path = os.path.join(path, "go_emotions_dataset.csv")
df = pd.read_csv(file_path)
print(df.head())

        id                                               text  \
0  eew5j0j                                    That game hurt.   
1  eemcysk   >sexuality shouldn’t be a grouping category I...   
2  ed2mah1     You do right, if you don't care then fuck 'em!   
3  eeibobj                                 Man I love reddit.   
4  eda6yn6  [NAME] was nowhere near them, he was by the Fa...   

   example_very_unclear  admiration  amusement  anger  annoyance  approval  \
0                 False           0          0      0          0         0   
1                  True           0          0      0          0         0   
2                 False           0          0      0          0         0   
3                 False           0          0      0          0         0   
4                 False           0          0      0          0         0   

   caring  confusion  ...  love  nervousness  optimism  pride  realization  \
0       0          0  ...     0            0         0      0 

In [4]:
print(df.columns)

Index(['id', 'text', 'example_very_unclear', 'admiration', 'amusement',
       'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity',
       'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment',
       'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love',
       'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse',
       'sadness', 'surprise', 'neutral'],
      dtype='object')


In [5]:
# Mapping the 27 emotions to the 7 emotions we want to work on
label_map = {
    "admiration": "Joy",
    "amusement": "Joy",
    "anger": "Anger",
    "annoyance": "Anger",
    "approval": "Joy",
    "caring": "Joy",
    "confusion": "Surprise",
    "curiosity": "Surprise",
    "desire": "Joy",
    "disappointment": "Sadness",
    "disapproval": "Disgust",
    "disgust": "Disgust",
    "embarrassment": "Sadness",
    "excitement": "Joy",
    "fear": "Fear",
    "gratitude": "Joy",
    "grief": "Sadness",
    "joy": "Joy",
    "love": "Joy",
    "nervousness": "Fear",
    "optimism": "Joy",
    "pride": "Joy",
    "realization": "Surprise",
    "relief": "Joy",
    "remorse": "Sadness",
    "sadness": "Sadness",
    "surprise": "Surprise",
    "neutral": "Neutral"
}

In [6]:
def map_emotion(row):
    for col in label_map.keys():
        if row[col] == 1 or row[col] is True:
            return label_map[col]
    return "Neutral"  # fallback if no label found

df["emotion"] = df.apply(map_emotion, axis=1)

In [7]:
df = df[["text", "emotion"]]

In [8]:
print(df.head())
print(df["emotion"].value_counts())

                                                text  emotion
0                                    That game hurt.  Sadness
1   >sexuality shouldn’t be a grouping category I...  Neutral
2     You do right, if you don't care then fuck 'em!  Neutral
3                                 Man I love reddit.      Joy
4  [NAME] was nowhere near them, he was by the Fa...  Neutral
emotion
Joy         79436
Neutral     58709
Surprise    22904
Anger       19885
Sadness     14494
Disgust     12337
Fear         3460
Name: count, dtype: int64


### Training the model 

##### Dependencies


In [9]:
!pip install transformers datasets evaluate accelerate

Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.34.0->transformers)
  Using cached fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Using cached fsspec-2025.3.0-py3-none-any.whl (193 kB)
Installing collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.9.0
    Uninstalling fsspec-2025.9.0:
      Successfully uninstalled fsspec-2025.9.0
Successfully installed fsspec-2025.3.0



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
from datasets import Dataset, DatasetDict
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification, TrainingArguments, Trainer
import evaluate
import numpy as np
import joblib

In [11]:
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

In [12]:
# Split into train/val/test
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

# Convert Pandas → Hugging Face datasets
goemotions = DatasetDict({
    "train": Dataset.from_pandas(train_df.reset_index(drop=True)),
    "validation": Dataset.from_pandas(val_df.reset_index(drop=True)),
    "test": Dataset.from_pandas(test_df.reset_index(drop=True))
})

##### Encoding emotions

In [13]:
label_encoder = LabelEncoder()
label_encoder.fit(goemotions["train"]["emotion"])
print("Classes:", label_encoder.classes_)

Classes: ['Anger' 'Disgust' 'Fear' 'Joy' 'Neutral' 'Sadness' 'Surprise']


##### Tokenizing the text

In [14]:
tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")

def tokenize_function(example):
    return tokenizer(
        example["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

goemotions = goemotions.map(tokenize_function, batched=True)

Map:   0%|          | 0/168980 [00:00<?, ? examples/s]

Map:   0%|          | 0/21122 [00:00<?, ? examples/s]

Map:   0%|          | 0/21123 [00:00<?, ? examples/s]

##### Encoding labels as integers

In [15]:
def encode_labels(example):
    example["label"] = label_encoder.transform([example["emotion"]])[0]
    return example

goemotions = goemotions.map(encode_labels)

Map:   0%|          | 0/168980 [00:00<?, ? examples/s]

Map:   0%|          | 0/21122 [00:00<?, ? examples/s]

Map:   0%|          | 0/21123 [00:00<?, ? examples/s]

##### Removing unused columns : The dataset will be ready to fine tune to model

In [16]:
goemotions = goemotions.remove_columns(["text", "emotion"])
goemotions.set_format("torch")

#### Training the RoBERTa model

##### Loading the RoBERTa model

In [17]:
num_labels = len(label_encoder.classes_)

In [18]:
from transformers import RobertaForSequenceClassification

model = RobertaForSequenceClassification.from_pretrained(
    "roberta-base",
    num_labels=num_labels
)


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: dc2e7be6-bbb9-4250-bf63-6d96bb020498)')' thrown while requesting HEAD https://huggingface.co/roberta-base/resolve/main/config.json
Retrying in 1s [Retry 1/5].
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#####  Setting training settings — batch size, learning rate, etc.

In [19]:
from transformers import TrainingArguments

In [20]:
!pip install --upgrade transformers




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [21]:
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01
)


##### Defining how to check **accuracy**
We want to measure accuracy and F1 score while training.

In [22]:
accuracy = evaluate.load("accuracy")
f1 = evaluate.load("f1")

In [23]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=predictions, references=labels)
    f1_score = f1.compute(predictions=predictions, references=labels, average="weighted")
    return {"accuracy": acc["accuracy"], "f1": f1_score["f1"]}


##### Creating the trainer and train the model

In [24]:
pip install --upgrade transformers

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [25]:
!pip install --upgrade --force-reinstall transformers


Collecting transformers
  Using cached transformers-4.56.2-py3-none-any.whl.metadata (40 kB)
Collecting filelock (from transformers)
  Using cached filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Using cached huggingface_hub-0.35.1-py3-none-any.whl.metadata (14 kB)
Collecting numpy>=1.17 (from transformers)
  Using cached numpy-2.3.3-cp313-cp313-win_amd64.whl.metadata (60 kB)
Collecting packaging>=20.0 (from transformers)
  Using cached packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
Collecting pyyaml>=5.1 (from transformers)
  Using cached PyYAML-6.0.2-cp313-cp313-win_amd64.whl.metadata (2.1 kB)
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2025.9.18-cp313-cp313-win_amd64.whl.metadata (41 kB)
Collecting requests (from transformers)
  Using cached requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting tokenizers<=0.23.0,>=0.22.0 (from transformers)
  Using cached tokenizers-0.22.1-cp39-abi

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 4.0.0 requires fsspec[http]<=2025.3.0,>=2023.1.0, but you have fsspec 2025.9.0 which is incompatible.

[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [26]:
import transformers
print(transformers.__version__)

4.56.2


In [27]:
import sys
print(sys.executable)

c:\Users\zoro\Desktop\Recommender_music\.env\Scripts\python.exe


In [28]:
!{sys.executable} -m pip install --upgrade transformers




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [29]:
import torch
print("GPU available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Using GPU:", torch.cuda.get_device_name(0))

GPU available: False


In [30]:
from transformers import TrainingArguments
import torch

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=goemotions["train"],
    eval_dataset=goemotions["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


  trainer = Trainer(


In [None]:
trainer.train()



Step,Training Loss


This will:
- Go through your data multiple times (epochs).
- Adjust the model weights to improve predictions.

###### Testing the model

In [None]:
trainer.evaluate(goemotions["test"])

##### Saving the model & label encoder

In [None]:
trainer.save_model("./emotion_model")
joblib.dump(label_encoder, "label_encoder.pkl")

### **Emotions prediction** on the lyrics

In [None]:
import pandas as pd

# Load the dataset
songs_sentiment_df = pd.read_csv("sentiment_analysis.csv")

In [None]:
def predict_emotion(text):
    # Tokenize input text (cleaned lyrics)
    inputs = tokenizer(text, truncation=True, padding="max_length", max_length=128, return_tensors="pt")

    # Forward pass through model (no gradient needed)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

    # Get predicted class index
    predicted_idx = torch.argmax(logits, dim=1).item()

    # Convert index to emotion label
    predicted_emotion = label_encoder.inverse_transform([predicted_idx])[0]
    return predicted_emotion

##### Apply prediction on songs_sentiment_df dataframe

In [None]:
songs_sentiment_df["predicted_emotion"] = songs_sentiment_df["lyrics_cleaned"].apply(predict_emotion)

##### Prediction confidence (softmax probabilities)

In [None]:
import torch.nn.functional as F

def predict_emotion_with_confidence(text):
    inputs = tokenizer(text, truncation=True, padding="max_length", max_length=128, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = F.softmax(logits, dim=1)

    predicted_idx = torch.argmax(probs, dim=1).item()
    confidence = probs[0, predicted_idx].item()
    predicted_emotion = label_encoder.inverse_transform([predicted_idx])[0]
    return predicted_emotion, confidence

##### Using the prediction funtion

In [None]:
# Usage
results = songs_sentiment_df["lyrics_cleaned"].apply(predict_emotion_with_confidence)
songs_sentiment_df["predicted_emotion"] = results.apply(lambda x: x[0])
songs_sentiment_df["prediction_confidence"] = results.apply(lambda x: x[1])

##### Saving the results

In [None]:
import pandas as pd

# Save to CSV
songs_sentiment_df.to_csv('songs_with_predicted_emotions.csv', index=False)