**Objetivo del Modelo**

El objetivo puede ser predecir el género de una playlist basado en las características de las canciones

In [39]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [40]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
# Modulo de optimizacion en scipy
from scipy import optimize

In [41]:
df = pd.read_csv('/content/drive/MyDrive/datasets/spotify_songs.csv')
df.head()

Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,6f807x0ima9a1j3VPbc7VN,I Don't Care (with Justin Bieber) - Loud Luxur...,Ed Sheeran,66,2oCs0DGTsRO98Gh5ZSl2Cx,I Don't Care (with Justin Bieber) [Loud Luxury...,2019-06-14,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,0r7CVbZTWZgbTCYdfa2P31,Memories - Dillon Francis Remix,Maroon 5,67,63rPSO264uRjW1X5E6cWv6,Memories (Dillon Francis Remix),2019-12-13,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,1z1Hg7Vb0AhHDiEmnDE79l,All the Time - Don Diablo Remix,Zara Larsson,70,1HoSmj2eLcsrR0vE9gThr4,All the Time (Don Diablo Remix),2019-07-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,75FpbthrwQmzHlBJLuGdC7,Call You Mine - Keanu Silva Remix,The Chainsmokers,60,1nqYsOef1yKKuGOVchbsk6,Call You Mine - The Remixes,2019-07-19,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,1e8PAfcKUYoKkxPhrHqw4x,Someone You Loved - Future Humans Remix,Lewis Capaldi,69,7m7vv9wlQ4i0LFuJiE2zsQ,Someone You Loved (Future Humans Remix),2019-03-05,Pop Remix,37i9dQZF1DXcZDD7cfEKhW,pop,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


In [42]:
df.shape

(32833, 23)

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32833 entries, 0 to 32832
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  32833 non-null  object 
 1   track_name                32828 non-null  object 
 2   track_artist              32828 non-null  object 
 3   track_popularity          32833 non-null  int64  
 4   track_album_id            32833 non-null  object 
 5   track_album_name          32828 non-null  object 
 6   track_album_release_date  32833 non-null  object 
 7   playlist_name             32833 non-null  object 
 8   playlist_id               32833 non-null  object 
 9   playlist_genre            32833 non-null  object 
 10  playlist_subgenre         32833 non-null  object 
 11  danceability              32833 non-null  float64
 12  energy                    32833 non-null  float64
 13  key                       32833 non-null  int64  
 14  loudne

In [44]:
pd.DataFrame({'count': df.shape[0],
              'nulls': df.isnull().sum(),
              'nulls%': df.isnull().mean() * 100,
              'cardinality': df.nunique(),
             })

Unnamed: 0,count,nulls,nulls%,cardinality
track_id,32833,0,0.0,28356
track_name,32833,5,0.015229,23449
track_artist,32833,5,0.015229,10692
track_popularity,32833,0,0.0,101
track_album_id,32833,0,0.0,22545
track_album_name,32833,5,0.015229,19743
track_album_release_date,32833,0,0.0,4530
playlist_name,32833,0,0.0,449
playlist_id,32833,0,0.0,471
playlist_genre,32833,0,0.0,6



Como podemos ver, hay algunos valores nulos en track_name, track_artist y track_album_name, así que vamos a eliminar esos registros correspondientes del conjunto de datos.

In [45]:
df.duplicated().sum() #El resultado es cero, lo que significa que no hay valores duplicados.

0

In [46]:
df['track_artist'].fillna('N/A', inplace=True)  # Rellenar valores faltantes en una columna específica
df['track_album_name'].fillna('N/A', inplace=True)  # Rellenar valores faltantes en una columna específica
df['track_name'].fillna('N/A', inplace=True)  # Rellenar valores faltantes en una columna específica

In [47]:
df.isnull().sum()

track_id                    0
track_name                  0
track_artist                0
track_popularity            0
track_album_id              0
track_album_name            0
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64

#### Encoding

Convertimos las variables categóricas a numéricas.

¿Por qué se Utiliza el Encoding?

El uso de encoding permite que los modelos de aprendizaje automático interpreten adecuadamente la información categórica y la utilicen para hacer predicciones

In [48]:
from sklearn.preprocessing import LabelEncoder  # Import the LabelEncoder class
# Identificamos las columnas categóricas:
categorical_columns = df.select_dtypes(include=['object']).columns

# Creando un label encoder object
le = LabelEncoder()

# Apply label encoding to each categorical column
for col in categorical_columns:
   df[col] = le.fit_transform(df[col])

#### Separación en variables independientes y dependiente

In [49]:
#dependent variable
y=df['playlist_genre']
y

0        2
1        2
2        2
3        2
4        2
        ..
32828    0
32829    0
32830    0
32831    0
32832    0
Name: playlist_genre, Length: 32833, dtype: int64

In [50]:
#independent variable
X=df.drop(columns=['playlist_genre'],axis=1)
X.head()

Unnamed: 0,track_id,track_name,track_artist,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_subgenre,...,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms
0,24150,8898,2782,66,8225,7614,4315,292,235,3,...,6,-2.634,1,0.0583,0.102,0.0,0.0653,0.518,122.036,194754
1,3061,12520,6084,67,17650,10410,4492,292,235,3,...,11,-4.969,1,0.0373,0.0724,0.00421,0.357,0.693,99.972,162600
2,7219,924,10417,70,3798,985,4335,292,235,3,...,1,-3.432,0,0.0742,0.0794,2.3e-05,0.11,0.613,124.008,176616
3,25699,3020,9216,60,5293,2798,4348,292,235,3,...,7,-3.778,1,0.102,0.0287,9e-06,0.204,0.277,121.956,169093
4,5987,17911,5402,69,21936,14844,4220,292,235,3,...,1,-4.672,1,0.0359,0.0803,0.0,0.0833,0.725,123.976,189052


#### Train Split

In [51]:
cols_to_drop = ['track_id', 'track_name', 'track_album_id', 'track_album_name', 'track_album_release_date', 'playlist_name', 'playlist_id']
X = df.drop(cols_to_drop + ['playlist_genre'], axis=1)
y = df['playlist_genre']

In [52]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
#0.2 indicates 20% test dataset and remaining 80% training dataset

# *Standar Scale*

In [53]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train_s = sc.fit_transform(X_train)
X_test_s = sc.transform(X_test)

In [54]:
# Calculate the number of labels:
num_labels = len(le.classes_)
X_train
print("Forma de X_train_s:", X_train_s.shape)
print("Número de etiquetas (num_labels):", num_labels)

Forma de X_train_s: (26266, 15)
Número de etiquetas (num_labels): 24


#### Aplicando el Modelo de Regresión Logística Multiclase Usando One-vs-All

**¿Qué es la Regresión Logística Multiclase?**

La regresión logística es un método estadístico para modelar la probabilidad de una variable dependiente categórica. Tradicionalmente, está diseñada para problemas binarios. Sin embargo, en muchas situaciones reales, las etiquetas pueden ser múltiples y no binarias. Aquí es donde la regresión logística multiclase se vuelve relevante.

**¿Por qué usar One-vs-All?**

"One-vs-All" es una técnica que permite utilizar la regresión logística, que es intrínsecamente binaria, para problemas de clasificación multiclase. Funciona entrenando un clasificador binario para cada clase, donde cada clasificador predice si una observación pertenece o no a su clase "propia".


In [61]:
def sigmoid(z):
    """
    Calcula la sigmoide de z.
    """
    return 1.0 / (1.0 + np.exp(-z))


def lrCostFunction(theta, X, y, lambda_):
    m = y.size

    # convierte las etiquetas a valores enteros si son boleanos
    if y.dtype == bool:
        y = y.astype(int)

    J = 0
    grad = np.zeros(theta.shape)

    h = sigmoid(X.dot(theta.T))

    temp = theta
    temp[0] = 0

#     J = (1 / m) * np.sum(-y.dot(np.log(h)) - (1 - y).dot(np.log(1 - h)))
    J = (1 / m) * np.sum(-y.dot(np.log(h)) - (1 - y).dot(np.log(1 - h))) + (lambda_ / (2 * m)) * np.sum(np.square(temp))

    grad = (1 / m) * (h - y).dot(X)
#     theta = theta - (alpha / m) * (h - y).dot(X)
    grad = grad + (lambda_ / m) * temp

    return J, grad
#    return J, theta

#### Clasificación One Vs All


In [56]:
def oneVsAll(X, y, num_labels, lambda_):
    # algunas variables utiles
    m, n = X_train_s.shape

    all_theta = np.zeros((num_labels, n + 1))

    # Agrega unos a la matriz X
    X = np.concatenate([np.ones((m, 1)), X], axis=1)

    for c in np.arange(num_labels):
        initial_theta = np.zeros(n + 1)
        options = {'maxiter': 50}
        res = optimize.minimize(lrCostFunction,
                                initial_theta,
                                (X, (y == c), lambda_),
                                jac=True,
                                method='CG',
                                options=options)

        all_theta[c] = res.x

    return all_theta

In [57]:
num_labels = len(np.unique(y_train))
lambda_ = 0.1
all_theta = oneVsAll(X_train_s, y_train, num_labels, lambda_)
print("Forma de all_theta:", all_theta.shape)
print("Theta para cada clase:\n", all_theta)

Forma de all_theta: (6, 16)
Theta para cada clase:
 [[-2.07954436e+00 -6.28243935e-03 -2.88438685e-01 -3.88885305e-02
   4.96875940e-01  6.25918368e-01 -1.26883624e-02  4.30561199e-01
  -1.24855920e-01 -3.49040298e-01 -2.97128130e-01  4.98298109e-01
   8.37701379e-02 -7.66759598e-01  2.46011035e-01 -1.96034595e-01]
 [-2.06529497e+00 -5.09908246e-02  1.23666243e-01  7.31488511e-01
   3.98273240e-01  2.61778127e-01  2.76075249e-02  6.68092534e-02
   4.39057806e-02 -2.13969821e-01  3.07688096e-01 -1.57540582e-01
  -3.14043564e-02  3.68492802e-01 -3.47942182e-02 -1.61497162e-01]
 [-1.88579608e+00 -4.49614388e-02  2.33304563e-01 -6.52738711e-01
  -2.60000260e-03 -2.39600430e-01 -1.75572904e-03  3.34751889e-01
   2.52219532e-03 -5.98109993e-01 -1.26315848e-02 -1.72755931e-01
  -1.04477330e-01 -1.54392240e-02  1.82886562e-02 -1.69804937e-01]
 [-2.05023585e+00 -9.55216643e-02 -1.54555128e-01  6.03912478e-01
  -8.87355279e-02 -7.27261979e-01 -1.35845913e-03 -8.87897336e-03
  -1.10887779e-01  5.

#### PredicciónOneVsAll

In [58]:
def predictOneVsAll(all_theta, X):
    m = X.shape[0];
    num_labels = all_theta.shape[0]

    p = np.zeros(m)

    # Add ones to the X data matrix
    X = np.concatenate([np.ones((m, 1)), X], axis=1)
    p = np.argmax(sigmoid(X.dot(all_theta.T)), axis = 1)

    return p

In [59]:
print("Dimensiones de X_train_s:", X_train_s.shape)
print("Dimensiones de X_test_s:", X_test_s.shape)
print("Dimensiones de y_train:", y_train.shape)
print("Dimensiones de y_test:", y_test.shape)

pred = predictOneVsAll(all_theta, X_train_s)
print('Precision del conjuto de entrenamiento: {:.2f}%'.format(np.mean(pred == y_train) * 100))

# Predicciones en el conjunto de prueba
predTest = predictOneVsAll(all_theta, X_test_s)
print('Precisión del conjunto de prueba: {:.2f}%'.format(np.mean(predTest == y_test) * 100))

Dimensiones de X_train_s: (26266, 15)
Dimensiones de X_test_s: (6567, 15)
Dimensiones de y_train: (26266,)
Dimensiones de y_test: (6567,)
Precision del conjuto de entrenamiento: 50.66%
Precisión del conjunto de prueba: 51.83%


##### Model Building, Training, Testing and comparing actual vs predicted values

In [60]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd

model = LogisticRegression(max_iter=5000, multi_class='multinomial')
model.fit(X_train_s, y_train)

# Evaluación del modelo
y_pred = model.predict(X_test_s)
y_predTrain = model.predict(X_train_s)

print("Precisión test:", accuracy_score(y_test, y_pred))
print("Precisión train:", accuracy_score(y_train, y_predTrain))
print("\nMatriz de Confusión:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Precisión test: 0.5124105375361657
Precisión train: 0.5078428386507272

Matriz de Confusión:
 [[723 137 151  75  82  74]
 [135 427 122 143 155  44]
 [157 123 426  88  80 218]
 [ 48 154 116 517 197  46]
 [116 157 111 171 571  20]
 [ 83  27 111  49  12 701]]

Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.58      0.58      1242
           1       0.42      0.42      0.42      1026
           2       0.41      0.39      0.40      1092
           3       0.50      0.48      0.49      1078
           4       0.52      0.50      0.51      1146
           5       0.64      0.71      0.67       983

    accuracy                           0.51      6567
   macro avg       0.51      0.51      0.51      6567
weighted avg       0.51      0.51      0.51      6567

