# Notebook 1: BERT sentiment analysis

Este book tiene como propósito calificar el sentimiento de los textos asociados a los tracks. 

- lyrics
- track_name
- track_album_name


Para esto podríamos usar LSTM sin embargo, esta arquitectura requiere que de antemano se conozca la columna conocida que no es el caso. Por tanto, utilizaremos una arquitectura preentrenada que utiliza tranformadores: *Bert*. 

En concreto el BERT proporcionado por Hugging face: [bert-base-multilingual-uncased-sentiment](https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment)

El proceso de predicción de sentimiento de acuerdo a la documentación obtiene como resultado una vector que indica la probabildad de sentimiento para 5 posiciones, siendo la primera la más negativa y la última la más positiva. Seleccionamos el sentimiento seleccinoando el que tiene mayor porbabildiad. 


Por cuestiones de eficiencia computacional, utilizamos procesamiento por lotes y CUDA para acelerar la predicción de sentimiento.



In [8]:
import pandas as pd
import re

In [9]:
data = pd.read_csv("./data/spotify_songs.csv")

In [10]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18454 entries, 0 to 18453
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  18454 non-null  object 
 1   track_name                18454 non-null  object 
 2   track_artist              18454 non-null  object 
 3   lyrics                    18194 non-null  object 
 4   track_popularity          18454 non-null  int64  
 5   track_album_id            18454 non-null  object 
 6   track_album_name          18454 non-null  object 
 7   track_album_release_date  18454 non-null  object 
 8   playlist_name             18454 non-null  object 
 9   playlist_id               18454 non-null  object 
 10  playlist_genre            18454 non-null  object 
 11  playlist_subgenre         18454 non-null  object 
 12  danceability              18454 non-null  float64
 13  energy                    18454 non-null  float64
 14  key   

In [11]:
def clean_text(lyric):
     return  re.sub(r'[^a-zA-Z\s]', '', lyric)

In [12]:
data = data[data['language'] == 'en']
data['lyrics'] = data['lyrics'].apply(clean_text)
data['track_name'] = data['track_name'].apply(clean_text)
data['track_album_name'] = data['track_album_name'].apply(clean_text)

In [13]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import re

In [14]:
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

In [15]:
# Verifica si CUDA está disponible
if torch.cuda.is_available():
    model = model.to('cuda')
else: 
    raise Exception ("Must have CUDA available")

Exception: Must have CUDA available

In [None]:
def sentiment_score(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    
    # Mueve los tensores de entrada al dispositivo correcto
    inputs = {k: v.to('cuda') for k, v in inputs.items()} if torch.cuda.is_available() else inputs

    with torch.no_grad():
        outputs = model(**inputs)

    scores = outputs[0][0].softmax(0)
    scores = scores.detach().cpu().numpy()  # Mueve el resultado a la CPU para convertirlo a numpy
    max_score = scores.argmax()
    return max_score


In [None]:
def process_in_batches(data, column, batch_size=32):
    num_batches = len(data) // batch_size + (0 if len(data) % batch_size == 0 else 1)
    results = []

    for i in range(num_batches):
        batch = data[column][i * batch_size:(i + 1) * batch_size]
        batch_result = batch.apply(sentiment_score)
        results.extend(batch_result)
    
    return results

In [None]:
data['lyrics_sentiment'] = process_in_batches(data, 'lyrics', batch_size=32)

In [None]:
data['album_name_sentiment'] = process_in_batches(data, 'track_album_name', batch_size=32)

In [None]:
data['track_name_sentiment'] = process_in_batches(data, 'track_name', batch_size=32)

In [None]:
data['playlist_name_sentiment'] = process_in_batches(data, 'playlist_name', batch_size=32)

In [None]:
data.to_csv("spotify_songs_processed.csv", index=False)