### PAIR X MÓDULO 3: Métricas

Para los ejercicios de Pair hemos utilizado el siguiente dataset: [Spotify Tracks Dataset](https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset)

In [6]:
# Tratamiento de datos
# ==============================================================================
import numpy as np
import pandas as pd

# Gráficos
# ==============================================================================
import matplotlib.pyplot as plt
from matplotlib import style
import matplotlib.ticker as ticker
import seaborn as sns

# Gráficos
# ==============================================================================
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Asunciones y Preprocesamiento
# ==============================================================================
from scipy import stats
import math
from scipy.stats import levene
import researchpy as rp
from sklearn.preprocessing import StandardScaler
import itertools

# ANOVA
# ==============================================================================
import statsmodels.api as sm
from statsmodels.formula.api import ols

#  Modelado y evaluación
# ------------------------------------------------------------------------------
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

#  Crossvalidation
# ------------------------------------------------------------------------------
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn import metrics
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

Los objetivo de este pairprogramming son:
- Calculéis las métricas para vuestro modelo
- Discutid los resultados de las métricas y extraed conclusiones
- Guardad los resultados de las métricas en un csv para usarlo más adelante.

In [7]:
df = pd.read_pickle('data/spotify_songs_estandarizado_encoding.pkl')
df.head(2)

Unnamed: 0,popularity,duration_ms,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,...,owners_F_G,owners_G,owners_G_A,owners_major,owners_minor,owners_compas_1,owners_compas_2,owners_compas_3,owners_compas_4,owners_compas_5
0,18,-0.184255,-0.157447,0.025974,-0.153345,4.038855,-0.065621,-0.000916,0.100897,-0.102381,...,0,0,0,0,1,0,0,0,1,0
1,58,-0.832951,0.612766,-0.194805,0.225755,7.351738,-0.116098,-0.000916,-0.266256,-0.038095,...,0,1,0,1,0,0,0,0,1,0


In [10]:
X = df.drop("popularity", axis = 1)
y = df["popularity"]

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

lr = LinearRegression(n_jobs=-1)

lr.fit(x_train, y_train)

y_predict_train = lr.predict(x_train)
y_predict_test = lr.predict(x_test)

train_df = pd.DataFrame({'Real': y_train, 'Predicted': y_predict_train, 'Set': ['Train']*len(y_train)})
test_df  = pd.DataFrame({'Real': y_test,  'Predicted': y_predict_test,  'Set': ['Test']*len(y_test)})
resultados = pd.concat([train_df,test_df], axis = 0)


resultados['residuos'] = resultados['Real'] - resultados['Predicted']

In [11]:
resultados_metricas = {'MAE': [mean_absolute_error(y_test, y_predict_test), mean_absolute_error(y_train, y_predict_train)],
                'MSE': [mean_squared_error(y_test, y_predict_test), mean_squared_error(y_train, y_predict_train)],
                'RMSE': [np.sqrt(mean_squared_error(y_test, y_predict_test)), np.sqrt(mean_squared_error(y_train, y_predict_train))],
                'R2':  [r2_score(y_test, y_predict_test), r2_score(y_train, y_predict_train)],
                 "set": ["test", "train"], 
                 "modelo": ["Linear Regresion", "Linear Regresion"]}

df_resultados = pd.DataFrame(resultados_metricas)

df_resultados

Unnamed: 0,MAE,MSE,RMSE,R2,set,modelo
0,18.232066,475.870761,21.814462,0.033295,test,Linear Regresion
1,18.38884,485.86475,22.04234,0.027021,train,Linear Regresion


In [12]:
df_resultados.to_csv("data/resultados_spotify.csv",index=False)

Fijándonos en R2 y RMSE:

- En este caso, parece que este modelo no precide de manera adecuada nuestra variable respuesta (underfitting - ninguna métrica es buena): 

    - las variables predictoras apenas explican un 2-3% de la variación de nuestra variable dependiente.
    - nuestro modelo se estaría equivocando en la predicción de la variable respuesta en unos 22 puntos.
