<a href="https://colab.research.google.com/github/Nacho2904/orga_de_datos/blob/main/tp3_parte_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TP3 Parte II: Baseline

Del análisis exploratorio de la parte I hemos aprendido algunas cosas:
- Hay algunas columnas que no nos aportan información. En particular *did* tiene utilidad por no estar en la mayoría de columnas, y *s-label* no sabemos cómo interpretarlo. *Language* tampoco aporta demasiado debido a que falta en muchas canciones 

- Algunas columnas requieren preprocessing. Las tres columnas de texto que tenemos, *track-name*, *lyrics* y *artist*, no pueden ser usadas directamente. *mode* y *key*, por otro lado, son features categóricas. *a_genres* también es una variable categórica que contiene varias clases. Luego hay que preprocesar las features de texto para crear nuevos features útiles, y preprocesar las features categóricas para poder utilizarlas en la regresión logística.

- Noo tenemos suficientes observaciones para algunos de los posibles valores del target.

In [135]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from google.colab import drive 
import nltk
import functools

drive.mount('/content/gdrive')
path_a_training_set = 'gdrive/MyDrive/TP3 dataset music/train.parquet'

df_music_train = pd.read_parquet(path_a_training_set)

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [62]:
df_music_train_filtered = df_music_train.drop(labels=["s-label", "did", "language"], axis=1)
df_music_train_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31383 entries, 0 to 34336
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_name        31383 non-null  object 
 1   lyric             31380 non-null  object 
 2   genre             31383 non-null  object 
 3   popularity        31383 non-null  int64  
 4   artist            31383 non-null  object 
 5   a_genres          31383 non-null  object 
 6   a_songs           31383 non-null  float64
 7   a_popularity      31383 non-null  float64
 8   acousticness      31383 non-null  float64
 9   danceability      31383 non-null  float64
 10  duration_ms       31383 non-null  int64  
 11  energy            31383 non-null  float64
 12  instrumentalness  31383 non-null  float64
 13  key               31383 non-null  object 
 14  liveness          31383 non-null  float64
 15  loudness          31383 non-null  float64
 16  mode              31383 non-null  object

En primer lugar notamos que las variables categóricas *key* y *time-signature* ambas son ordinales. La primera representa el tono dominante en la canción., y tomaremos el orden dado en [este blog](https://viva.pressbooks.pub/openmusictheory/chapter/pitch-and-pitch-class/). El *time-signature* es una medida de la cantidad de pulsos por unidad, y también está ordenado naturalmente. *Mode* es una variable binaria así que la encodeamos como 0 y 1.

In [40]:
ordinalEncoder = preprocessing.OrdinalEncoder(categories = [["Minor", "Major"],['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B'],
                                                            ['1/4', '3/4', '4/4', '5/4']])
ordinalEncoder.fit(df_music_train_filtered[["mode","key", "time_signature"]])

OrdinalEncoder(categories=[['Minor', 'Major'],
                           ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#',
                            'A', 'A#', 'B'],
                           ['1/4', '3/4', '4/4', '5/4']])

Para las letras de los se me ocurre aplicar algo de NLP para realizar sentiment analysis.

In [142]:
def apply_sentiment_analysis_to_lyrics(df_music: pd.DataFrame) -> pd.DataFrame:
  import nltk
  nltk.download(["stopwords", "averaged_perceptron_tagger", "vader_lexicon", "punkt"])
  from nltk.sentiment import SentimentIntensityAnalyzer
  df_music.loc[~df_music["lyric"], "lyric"] = ""
  sia = SentimentIntensityAnalyzer()
  negative, neutral, positive, compound = 0, 1, 2, 3
  sentimentAnalysisOfLyrics = df_music["lyric"].map(lambda lyric: list(sia.polarity_scores(lyric).values()))
  negativeScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[negative])
  positiveScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[positive])
  neutralScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[neutral])
  compoundScoreOfLyrics = sentimentAnalysisOfLyrics.map(lambda row: row[compound])
  return pd.DataFrame(pd.concat([negativeScoreOfLyrics, positiveScoreOfLyrics,neutralScoreOfLyrics,compoundScoreOfLyrics], axis = 1))


Otra cosa que se me ocurre es usar el tf-idf promedio para medir qué tan "rico" es el vocabulario de una canción: La idea es que canciones de géneros como el pop que son más masivos y apuntan a un público general tendrán tf-idf alto, mientras que géneros como la música alternativa deberían usar un vocabulario más "peculiar".

In [153]:
def get_mean_tfidf_from_lyrics(df_music: pd.DataFrame) -> pd.DataFrame:
  nltk.download("stopwords")
  stopwords = set(nltk.corpus.stopwords.words("english")).union(set(nltk.corpus.stopwords.words("spanish"))).union(set(nltk.corpus.stopwords.words("french")))
  df_music["lyric"] = df_music["lyric"].map(lambda lyric: lyric if lyric else "instrumental")
  vectorizer = TfidfVectorizer(input = "content", stop_words = stopwords)
  vectorizer.fit(df_music["lyric"])
  return [np.mean(tfidf_vector) for tfidf_vector in vectorizer.transform(df_music["lyric"])]
np.max(get_mean_tfidf_from_lyrics(df_music_train_filtered))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


1.0

In [121]:
from sklearn.compose import ColumnTransformer

text_features = ["track_name", "lyric", "artist"]

numerical_features = ["a_songs", "a_popularity", "popularity", "acousticness", "danceability", "duration_ms",
                   "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence"]

ordinal_features = ["mode", "key", "time_signature"]

identity_transformer = preprocessing.FunctionTransformer(None)

full_processor = ColumnTransformer(transformers=[
    ('text', preprocessing.FunctionTransformer(apply_sentiment_analysis_to_lyrics), text_features),
    ('numerical', identity_transformer, numerical_features),
    ('ordinal', ordinalEncoder, ordinal_features)
])

pd.DataFrame(full_processor.fit_transform(df_music_train_filtered))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.036,0.125,0.839,0.9855,276.0,205.5,79.0,0.2720,0.508,261640.0,0.720,0.000000,0.0563,-5.908,0.0628,79.983,0.472,0.0,11.0,2.0
1,0.036,0.125,0.839,0.9855,276.0,205.5,80.0,0.2720,0.508,261640.0,0.720,0.000000,0.0563,-5.908,0.0628,79.983,0.472,0.0,11.0,2.0
2,0.036,0.125,0.839,0.9855,276.0,205.5,80.0,0.2720,0.508,261640.0,0.720,0.000000,0.0563,-5.908,0.0628,79.983,0.472,0.0,11.0,2.0
3,0.035,0.157,0.808,0.9921,276.0,205.5,71.0,0.0296,0.412,319467.0,0.441,0.072600,0.3060,-11.523,0.2910,185.571,0.174,0.0,11.0,2.0
4,0.035,0.157,0.808,0.9921,276.0,205.5,71.0,0.0296,0.412,319467.0,0.441,0.072600,0.3060,-11.523,0.2910,185.571,0.174,0.0,11.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31378,0.025,0.059,0.915,0.5219,89.0,1.0,19.0,0.6120,0.616,190733.0,0.822,0.000003,0.1650,-8.063,0.0561,119.962,0.748,0.0,4.0,2.0
31379,0.319,0.141,0.540,-0.8176,89.0,1.0,34.0,0.8890,0.457,230200.0,0.369,0.011100,0.2310,-12.515,0.0353,80.214,0.868,0.0,4.0,2.0
31380,0.000,0.000,1.000,0.0000,89.0,1.0,18.0,0.7630,0.717,275640.0,0.566,0.812000,0.1440,-14.200,0.0405,117.321,0.793,0.0,6.0,1.0
31381,0.104,0.000,0.896,-0.8807,89.0,1.0,28.0,0.8040,0.633,204373.0,0.553,0.000866,0.1390,-8.851,0.0376,87.608,0.738,1.0,0.0,2.0
