# Recolección y Creación de Datos

In [15]:
!pip install datasets



## Dataset

El dataset de canciones se obtiene a través de la página de *Hugging Face*: [link](https://huggingface.co/datasets/tsterbak/lyrics-dataset). Contiene 158k canciones con los siguientes datos:
* artista
* lyrics
* nombre
* label   

El objetivo es añadir una nueva columna denominada *emotions* con hasta tres emociones que identifican a cada canción.

In [16]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("tsterbak/lyrics-dataset")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/187M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [17]:
split_name = 'train'
music_data = dataset[split_name].to_pandas()
music_data['emotions'] = None
music_data

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label,emotions
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626,
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.630,
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.240,
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536,
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371,
...,...,...,...,...,...,...
158348,158348,Adam Green,"And we live on borrowed time,\r\nBut this head...",Friends of Mine,0.737,
158349,158349,Adam Green,Frozin in time forever\r\nCarrying that torch ...,Frozen in Time,0.482,
158350,158350,Adam Green,Hard to be a girl. \r\nSo nice to be a boy. \r...,Hard to Be a Girl,0.733,
158351,158351,Adam Green,"I want to chose to die,\r\nAnd be buried with ...",I Wanna Die,0.361,


## Clasificar canciones

Se utiliza un modelo que realizo *fine-tuning* a DistilRoBERTa-base. Nos permite clasificar las emociones de un texto. Su predicción puede caer dentro de 6 emociones básicas y una neutral: enojo, disgusto, miedo, alegría, tristeza y sorpresa
* Jochen Hartmann, "Emotion English DistilRoBERTa-base". https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/, 2022.

In [None]:
from transformers import pipeline

classifier = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=3)

config.json:   0%|          | 0.00/1.00k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/294 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Dado que las canciones no siempre deben representar un solo sentimiento, se añade la posibilidad de taggear una canción con un máximo de tres. Esto elimina la posibilidad de perder sentimientos que tal vez no sean los principales, pero están presentes, en gran medida, dentro de la canción.

Se decide tener un límite de similitud del 25%, ya que, en conjunto, llegaríamos a un poco más de 75% con tres opciones.

In [None]:
def generate_response(input_text):
    # Codificar el texto de entrada y agregarlo al tensor
    result = classifier(input_text)
    emotions = []

    for emotion_dict in result[0]:
      if emotion_dict['score'] > 0.25:
        emotions.append(emotion_dict['label'])

    return emotions

Se deja al usuario la elección de correr la siguiente casilla según sus necesidades y tiempo. Nosotros la detuvimos al llegar casi al 50% de las canciones.

Posteriormente, eliminamos las canciones sin tag y guardamos nuestros resultados en un csv para un uso posterior sin tener que repetir el proceso

In [None]:
for index, row in music_data.iterrows():

    input = music_data.at[index, 'seq']
    input = input[:511]
    output = generate_response(input)

    # Assign the greeting to the 'Greeting' column
    music_data.at[index, 'emotions'] = output

In [None]:
df = music_data.copy()
df.drop(columns='label', inplace=True)
df = df.dropna(subset=['emotions'])

In [11]:
from datasets import Dataset
import pandas as pd

df['song'] = df['song'].astype(str)
df['artist'] = df['artist'].astype(str)
df['seq'] = df['seq'].astype(str)

df = df[~df['emotions'].astype(str).str.contains(',')]
df = df[df['emotions'] != '[]']

# Assuming 'df' is your DataFrame
dataset = Dataset.from_pandas(df)


In [12]:
df['emotions'].value_counts(normalize=True, dropna=False)

['fear']        0.296447
['sadness']     0.292680
['neutral']     0.154999
['anger']       0.109033
['joy']         0.078082
['surprise']    0.046237
['disgust']     0.022522
Name: emotions, dtype: float64

## Importación a HugginFace

In [13]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [14]:
# Save the dataset to Hugging Face Hub
dataset.save_to_disk('manoh2f2/tsterbak-lyrics-dataset-with-emotions')

# Upload to Hugging Face Hub (replace 'your_username/your_dataset_name' with appropriate values)
dataset.push_to_hub('manoh2f2/tsterbak-lyrics-dataset-with-emotions')


Saving the dataset (0/1 shards):   0%|          | 0/36897 [00:00<?, ? examples/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/37 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/422 [00:00<?, ?B/s]