<a href="https://colab.research.google.com/github/Nacho2904/orga_de_datos/blob/main/tp3_parte_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP3 Parte II: Baseline

Del análisis exploratorio de la parte I hemos aprendido algunas cosas:
- Hay algunas columnas que no nos aportan información. En particular *did* tiene utilidad por no estar en la mayoría de columnas, y *s-label* no sabemos cómo interpretarlo. *Language* tampoco aporta demasiado debido a que falta en muchas canciones. El *artist_name* no debería proporcionar demasiada información al modelo teniendo en cuenta que tenemos su número de canciones y sus géneros predilectos.

- Algunas columnas requieren preprocessing. Las tres columnas de texto que tenemos, *track-name*, *lyrics* y *artist*, no pueden ser usadas directamente. *mode* y *key*, por otro lado, son features categóricas. *a_genres* también es una variable categórica que contiene varias clases. Luego hay que preprocesar las features de texto para crear nuevos features útiles, y preprocesar las features categóricas para poder utilizarlas en la regresión logística.

- No tenemos suficientes observaciones para algunos de los posibles valores del target.

In [83]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from google.colab import drive 
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
import functools

drive.mount('/content/gdrive')
path_a_training_set = 'gdrive/MyDrive/TP3 dataset music/train.parquet'

df_music_train = pd.read_parquet(path_a_training_set)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [84]:
df_music_train_filtered = df_music_train.drop(labels=["s-label", "did", "language"], axis=1)
df_music_train_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31383 entries, 0 to 34336
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_name        31383 non-null  object 
 1   lyric             31380 non-null  object 
 2   genre             31383 non-null  object 
 3   popularity        31383 non-null  int64  
 4   artist            31383 non-null  object 
 5   a_genres          31383 non-null  object 
 6   a_songs           31383 non-null  float64
 7   a_popularity      31383 non-null  float64
 8   acousticness      31383 non-null  float64
 9   danceability      31383 non-null  float64
 10  duration_ms       31383 non-null  int64  
 11  energy            31383 non-null  float64
 12  instrumentalness  31383 non-null  float64
 13  key               31383 non-null  object 
 14  liveness          31383 non-null  float64
 15  loudness          31383 non-null  float64
 16  mode              31383 non-null  object

En primer lugar notamos que las variables categóricas *key* y *time-signature* ambas son ordinales. La primera representa el tono dominante en la canción., y tomaremos el orden dado en [este blog](https://viva.pressbooks.pub/openmusictheory/chapter/pitch-and-pitch-class/). El *time-signature* es una medida de la cantidad de pulsos por unidad, y también está ordenado naturalmente. *Mode* es una variable binaria así que la encodeamos como 0 y 1.

In [99]:
ordinalEncoder = preprocessing.OrdinalEncoder(categories = [["Minor", "Major"],['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B'],
                                                            ['1/4', '3/4', '4/4', '5/4']])

Para las letras de los se me ocurre aplicar algo de NLP para realizar sentiment analysis.

In [86]:
def apply_sentiment_analysis_to_lyrics(df_music: pd.DataFrame) -> pd.DataFrame:
  df_music["lyric"] = df_music["lyric"].map(lambda lyric: lyric if lyric else "instrumental")
  sia = SentimentIntensityAnalyzer()
  negative, neutral, positive, compound = 0, 1, 2, 3
  sentimentAnalysisOfLyrics = df_music["lyric"].map(lambda lyric: list(sia.polarity_scores(lyric).values()))
  negativeScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[negative])
  positiveScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[positive])
  neutralScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[neutral])
  compoundScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[compound])
  return pd.DataFrame(pd.concat([negativeScoreOfLyrics, positiveScoreOfLyrics,neutralScoreOfLyrics,compoundScoreOfLyrics], axis = 1))

Otra cosa que se me ocurre es usar la suma del tf-idf para medir qué tan "rico" es el vocabulario de una canción: La idea es que canciones de géneros como el pop que son más masivos y apuntan a un público general tendrán una suma de tf-idf alto, mientras que géneros como la música alternativa deberían usar un vocabulario más "peculiar".

In [97]:
def get_sum_tfidf_from_lyrics(df_music: pd.DataFrame) -> pd.DataFrame:
  nltk.download("stopwords")
  stopwords = set(nltk.corpus.stopwords.words("english")).union(set(nltk.corpus.stopwords.words("spanish"))).union(set(nltk.corpus.stopwords.words("french")))
  df_music["lyric"] = df_music["lyric"].map(lambda lyric: lyric if lyric else "instrumental")
  vectorizer = TfidfVectorizer(input = "content", stop_words = stopwords)
  vectorizer.fit(df_music["lyric"])
  return pd.DataFrame([np.sum(tfidf_vector) for tfidf_vector in vectorizer.transform(df_music["lyric"])])

Para los géneros simplemente utilizaré un one-hot encoding pues no quiero nublar la información super valiosa que provee los géneros usuales de los artistas con mean encoding, y además no tenemos demasiadas clases.

In [130]:
def one_hot_encode_genres(df_music: pd.DataFrame) -> pd.DataFrame:
  genres = sorted(df_music["genre"].unique())
  create_one_hot_vector_for_artist_genres = lambda a_genres: [1 if genre in a_genres else 0 for genre in genres]
  df_music["one_hot_encoded_a_genres"] = df_music["a_genres"].map(create_one_hot_vector_for_artist_genres)
  return pd.DataFrame(df_music["one_hot_encoded_a_genres"].to_list())

In [161]:
def eliminate_genres_without_enough_observations(df_music: pd.DataFrame) -> pd.DataFrame:
  df_music_recuento_filas_por_genero = df_music.groupby("genre").count().reset_index()[["genre", "track_name"]].rename(
    columns = {"track_name": "rowCount"}).sort_values("rowCount")
  problematic_genres = list(df_music_recuento_filas_por_genero[df_music_recuento_filas_por_genero["rowCount"] < 50].genre)[1:]
  return df_music[~df_music["genre"].isin(problematic_genres)]

In [154]:
labelEncoder = preprocessing.LabelEncoder()
labelEncoder.fit_transform(df_music_train_filtered ["genre"])

array([ 8, 16, 17, ..., 25, 25, 25])

In [174]:
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression

text_features = ["track_name", "lyric", "artist"]

numerical_features = ["a_songs", "a_popularity", "popularity", "acousticness", "danceability", "duration_ms",
                   "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence"]

ordinal_features = ["mode", "key", "time_signature"]

artist_genres = ["a_genres", "genre"]

label = ["genre"]

identity_transformer = preprocessing.FunctionTransformer(None)

full_processor = ColumnTransformer(transformers=[
    ('text_sentiment_analysis', preprocessing.FunctionTransformer(apply_sentiment_analysis_to_lyrics), text_features),
    ('text_tf_idf', preprocessing.FunctionTransformer(get_sum_tfidf_from_lyrics), text_features),
    ('artist_genres', preprocessing.FunctionTransformer(one_hot_encode_genres), artist_genres),
    ('numerical', identity_transformer, numerical_features),
    ('ordinal', ordinalEncoder, ordinal_features),
])

logistic_regression_pipeline = Pipeline(steps = [
    ('preprocess_X', full_processor),
    ('model', LogisticRegression(penalty='l2', C = 1, solver = "liblinear", max_iter = 50))
])

In [176]:
labelEncoder = preprocessing.LabelEncoder()
y = labelEncoder.fit_transform(eliminate_genres_without_enough_observations(df_music_train_filtered)["genre"])

logistic_regression_pipeline.fit(eliminate_genres_without_enough_observations(df_music_train), y)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Pipeline(steps=[('preprocess_X',
                 ColumnTransformer(transformers=[('text_sentiment_analysis',
                                                  FunctionTransformer(func=<function apply_sentiment_analysis_to_lyrics at 0x7f09342eec20>),
                                                  ['track_name', 'lyric',
                                                   'artist']),
                                                 ('text_tf_idf',
                                                  FunctionTransformer(func=<function get_sum_tfidf_from_lyrics at 0x7f093426e830>),
                                                  ['track_name', 'lyric',
                                                   'artist']),
                                                 ('artist_genr...
                                                   'danceability',
                                                   'duration_ms', 'energy',
                                                   'instrumentalness'

In [189]:
preds = logistic_regression_pipeline.predict_proba(eliminate_genres_without_enough_observations(df_music_train))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [190]:
from sklearn.metrics import top_k_accuracy_score
top_k_accuracy_score(y, preds, k=2)

0.3156024963994239

In [188]:
preds

array([12, 12, 12, ...,  7,  0,  0])