In [1]:
import pandas as pd

### Extraccion 

#### Cargar el dataset desde un archivo CSV

In [2]:
df = pd.read_csv("../data/spotify_dataset.csv")
print(df.head())  

   Unnamed: 0                track_id                 artists  \
0           0  5SuOikwiRyPMVoIQDJUgSV             Gen Hoshino   
1           1  4qPNDBW1i3p13qLCt0Ki3A            Ben Woodward   
2           2  1iJBSr7s7jYXzM8EGcbK5b  Ingrid Michaelson;ZAYN   
3           3  6lfxq3CG4xtTiEg7opyCyx            Kina Grannis   
4           4  5vjLSffimiIP26QG5WcN2K        Chord Overstreet   

                                          album_name  \
0                                             Comedy   
1                                   Ghost (Acoustic)   
2                                     To Begin Again   
3  Crazy Rich Asians (Original Motion Picture Sou...   
4                                            Hold On   

                   track_name  popularity  duration_ms  explicit  \
0                      Comedy          73       230666     False   
1            Ghost - Acoustic          55       149610     False   
2              To Begin Again          57       210826     False   


### Borrar columna "unnamed: 0"

In [3]:
df = df.drop(columns=["Unnamed: 0"])

La presencia de Unnamed: 0 sugiere que el CSV se generó con un índice que no se nombró, justificando su eliminación.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 114000 entries, 0 to 113999
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   track_id          114000 non-null  object 
 1   artists           113999 non-null  object 
 2   album_name        113999 non-null  object 
 3   track_name        113999 non-null  object 
 4   popularity        114000 non-null  int64  
 5   duration_ms       114000 non-null  int64  
 6   explicit          114000 non-null  bool   
 7   danceability      114000 non-null  float64
 8   energy            114000 non-null  float64
 9   key               114000 non-null  int64  
 10  loudness          114000 non-null  float64
 11  mode              114000 non-null  int64  
 12  speechiness       114000 non-null  float64
 13  acousticness      114000 non-null  float64
 14  instrumentalness  114000 non-null  float64
 15  liveness          114000 non-null  float64
 16  valence           11

El DataFrame tiene 114,000 filas y 20 columnas tras eliminar Unnamed: 0.
La mayoría de las columnas tienen 114,000 valores no nulos, excepto artists, album_name, y track_name con 113,999 (1 nulo cada una).
Tipos de datos: 5 objetos (cadenas), 9 flotantes, 5 enteros, 1 booleano.

## Nulos

In [5]:
null_values= df.isnull().sum()
print("Valores nulos:\n", null_values)

Valores nulos:
 track_id            0
artists             1
album_name          1
track_name          1
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64


Los nulos están concentrados en una sola fila (misma cantidad en tres columnas), probablemente una entrada corrupta o incompleta.

In [6]:
df[df["artists"].isnull()]

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
65900,1kR4gIb7nGxHPI3D2ifs59,,,,0,0,False,0.501,0.583,7,-9.46,0,0.0605,0.69,0.00396,0.0747,0.734,138.391,4,k-pop


Al no aportar información util sobre la canción, eliminar ese dato es razonable.

In [7]:
df = (df
      .dropna()
      .reset_index(drop=True))

In [8]:
null_values= df.isnull().sum()
print("Valores nulos:\n", null_values)

Valores nulos:
 track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64


## Duplicados 

### Identificar duplicados idénticos (filas completamente iguales)

In [9]:
duplicados = df.duplicated()

print(df[duplicados])

                      track_id                  artists  \
1925    0CDucx9lKxuCZplLXUz0iX   Buena Onda Reggae Club   
2155    2aibwv5hGXSgw7Yru8IYTO    Red Hot Chili Peppers   
3738    7mULVp0DJrI2Nd6GesLvxn             Joy Division   
4648    6d3RIvHfVkoOtW1WHXmbX3          Little Symphony   
5769    481beimUiUnMUzSbOAFcUT             SUPER BEAVER   
...                        ...                      ...   
111245  0sSjIvTvd6fUSZZ5rnTPDW  Everything But The Girl   
111361  2zg3iJW4fK7KZgHOvJU67z                Faithless   
111979  46FPub2Fewe7XrgM0smTYI                Morcheeba   
112967  6qVA1MqDrDKfk9144bhoKp              Acil Servis   
113344  5WaioelSGekDk3UNQy8zaw              Matt Redman   

                                              album_name  \
1925                                             Disco 2   
2155                                    Stadium Arcadium   
3738                                  Timeless Rock Hits   
4648                                            Ser

In [10]:
df = df.drop_duplicates()

### Identificar duplicados por 'track_id' (mismo ID pero posiblemente diferentes valores en otras columnas)

In [11]:
duplicados_id = df[df.duplicated(subset='track_id')]
print(duplicados_id)

                      track_id                                     artists  \
3000    5E30LdtzQTGqRvNd7l6kG5                           The Neighbourhood   
3002    2K7xn816oNHJZ0aVqdQsha                           The Neighbourhood   
3003    2QjOHCTQ1Jl3zawyYOpxh6                           The Neighbourhood   
3011    6rrKbzJGGDlSZgLphopS49                                   The Score   
3012    0AUyNF6iFxMNQsNx2nhtrw                                    grandson   
...                        ...                                         ...   
113571  1saXUvvFlAQaefZUFVmhCn                   Bethel Music;Paul McClure   
113604  1Q5jFp1g2Ns4gBsHRpcqhu  Bethel Music;Jenn Johnson;Michaela Gentile   
113616  71dLJx3qHOTQMTvvoE2dmd                    Bethel Music;Amanda Cook   
113618  6OG5TBCmuTOuWCzSGsETrE     Bethel Music;Brian Johnson;Jenn Johnson   
113640  7xsirhcgFWOnItsGuBfrv9            Bethel Music;Steffany Gretzinger   

                                             album_name  \
3000

Hay 23,809 filas duplicadas por track_id, lo que indica canciones repetidas con diferentes track_genre.
Ejemplo: "Sweater Weather" de The Neighbourhood aparece múltiples veces.

In [12]:
conteo_ids = df['track_id'].value_counts()
duplicados_reales = conteo_ids[conteo_ids > 1]

print(f"Total de track_id únicos con duplicados: {duplicados_reales.shape[0]}")
print(duplicados_reales.head())

Total de track_id únicos con duplicados: 16299
track_id
6S3JlDAGk3uu3NtZbPnuhS    9
2kkvB3RNRzwjFdGhaUA0tz    8
2Ey6v4Sekh3Z0RUSISRosD    8
2aaClnypAakdAmLw74JXxB    7
5sqkarfxe7UejHTlCtHCLS    7
Name: count, dtype: int64


In [13]:
# Ejemplo con uno de los duplicados
df[df['track_id'] == '2kkvB3RNRzwjFdGhaUA0tz']

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
8262,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,blues
11170,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,british
19915,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,country
34974,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,folk
47257,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,hard-rock
84163,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,psych-rock
99874,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,singer-songwriter
102883,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,songwriter


La duplicación por track_id se debe únicamente a track_genre, mientras que otros atributos son idénticos.
Esto sugiere que el dataset clasifica una misma canción en múltiples géneros, lo que requiere consolidación.

#### Mapeo simple de géneros a categorías más amplias

In [14]:
df["track_genre"].unique()


array(['acoustic', 'afrobeat', 'alt-rock', 'alternative', 'ambient',
       'anime', 'black-metal', 'bluegrass', 'blues', 'brazil',
       'breakbeat', 'british', 'cantopop', 'chicago-house', 'children',
       'chill', 'classical', 'club', 'comedy', 'country', 'dance',
       'dancehall', 'death-metal', 'deep-house', 'detroit-techno',
       'disco', 'disney', 'drum-and-bass', 'dub', 'dubstep', 'edm',
       'electro', 'electronic', 'emo', 'folk', 'forro', 'french', 'funk',
       'garage', 'german', 'gospel', 'goth', 'grindcore', 'groove',
       'grunge', 'guitar', 'happy', 'hard-rock', 'hardcore', 'hardstyle',
       'heavy-metal', 'hip-hop', 'honky-tonk', 'house', 'idm', 'indian',
       'indie-pop', 'indie', 'industrial', 'iranian', 'j-dance', 'j-idol',
       'j-pop', 'j-rock', 'jazz', 'k-pop', 'kids', 'latin', 'latino',
       'malay', 'mandopop', 'metal', 'metalcore', 'minimal-techno', 'mpb',
       'new-age', 'opera', 'pagode', 'party', 'piano', 'pop-film', 'pop',
       'pow

In [None]:


genre_categories = {
    'rock': 'Rock', 'rockabilly': 'Rock', 'alt-rock': 'Rock', 'alternative': 'Rock', 'emo': 'Rock', 
    'goth': 'Rock', 'grunge': 'Rock', 'hard-rock': 'Rock', 'psych-rock': 'Rock', 'punk-rock': 'Rock', 
    'rock-n-roll': 'Rock',
    'pop': 'Pop', 'power-pop': 'Pop', 'j-pop': 'Pop', 'k-pop': 'Pop', 'synth-pop': 'Pop', 
    'indie-pop': 'Pop', 'cantopop': 'Pop', 'mandopop': 'Pop',
    'electronic': 'Electronic', 'edm': 'Electronic', 'techno': 'Electronic', 'house': 'Electronic', 
    'trance': 'Electronic', 'idm': 'Electronic', 'hardstyle': 'Electronic', 'progressive-house': 'Electronic', 
    'minimal-techno': 'Electronic', 'electro': 'Electronic', 'breakbeat': 'Electronic', 'drum-and-bass': 'Electronic', 
    'dubstep': 'Electronic', 'chicago-house': 'Electronic', 'deep-house': 'Electronic', 'detroit-techno': 'Electronic', 
    'industrial': 'Electronic', 'trip-hop': 'Electronic',
    'classical': 'Classical', 'opera': 'Classical', 'new-age': 'Classical', 'piano': 'Classical', 
    'ambient': 'Classical',
    'folk': 'Folk', 'acoustic': 'Folk', 'bluegrass': 'Folk', 'country': 'Folk', 'honky-tonk': 'Folk', 
    'guitar': 'Folk', 'singer-songwriter': 'Folk', 'songwriter': 'Folk',
    'jazz': 'Jazz/Blues', 'blues': 'Jazz/Blues', 'soul': 'Jazz/Blues', 'funk': 'Jazz/Blues', 'groove': 'Jazz/Blues', 
    'r-n-b': 'Jazz/Blues',
    'latin': 'Latin', 'latino': 'Latin', 'salsa': 'Latin', 'samba': 'Latin', 'sertanejo': 'Latin', 
    'pagode': 'Latin', 'tango': 'Latin', 'forro': 'Latin', 'mpb': 'Latin', 'reggaeton': 'Latin',
    'hip-hop': 'Hip-Hop', 'afrobeat': 'Hip-Hop', 'dancehall': 'Hip-Hop',
    'metal': 'Metal', 'heavy-metal': 'Metal', 'death-metal': 'Metal', 'black-metal': 'Metal', 
    'metalcore': 'Metal', 'grindcore': 'Metal',
    'punk': 'Punk', 'ska': 'Punk', 'hardcore': 'Punk',
    'reggae': 'Reggae', 'dub': 'Reggae',
    'happy': 'Moods', 'sleep': 'Moods', 'chill': 'Moods', 'sad': 'Moods', 'study': 'Moods', 
    'romance': 'Moods', 'party': 'Moods',
    'french': 'Regional', 'german': 'Regional', 'british': 'Regional', 'turkish': 'Regional', 
    'iranian': 'Regional', 'spanish': 'Regional', 'swedish': 'Regional', 'indian': 'Regional', 
    'malay': 'Regional', 'brazil': 'Regional',
    'kids': 'Other', 'anime': 'Other', 'comedy': 'Other', 'disney': 'Other', 'show-tunes': 'Other', 
    'club': 'Other', 'gospel': 'Other', 'world-music': 'Other', 'children': 'Other', 'pop-film': 'Other', 
    'j-idol': 'Pop', 'j-dance': 'Electronic'
}

# Function to assign category
def get_category(genre):
    """Asigna una categoría más amplia a un género musical específico.

    Args:
        genre: Género musical como cadena de texto.

    Returns:
        str: Categoría asignada o 'Other' si no coincide con ninguna clave.
    """
    if not genre: 
        return 'Unknown'
    genre = genre.lower()
    for key, category in genre_categories.items():
        if key in genre:
            return category
    return 'Other'  


df['track_genre'] = df['track_genre'].apply(get_category)


In [16]:
df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,Folk
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,Folk
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,Folk
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,Folk
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,Folk


In [17]:
df[df['track_id'] == '2kkvB3RNRzwjFdGhaUA0tz']

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
8262,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Jazz/Blues
11170,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Regional
19915,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Folk
34974,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Folk
47257,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Rock
84163,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Rock
99874,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Folk
102883,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Folk


#### Agrupar los generos y seleccionar el que más se repite

In [None]:
duplicate_columns = ['artists', 'track_id']  


def pick_genre(group):
    """Selecciona una fila de un grupo de duplicados basada en el género más frecuente.

    Args:
        group: Sub-DataFrame con filas duplicadas por 'artists' y 'track_id'.

    Returns:
        Series: Fila seleccionada (más frecuente por género o primera si no hay modo).
    """
    if len(group) == 1:
        return group.iloc[0]  
    most_common = group['track_genre'].mode()
    if not most_common.empty:
        return group[group['track_genre'] == most_common[0]].iloc[0]
    return group.iloc[0] 


df = df.groupby(duplicate_columns, as_index=False).apply(pick_genre).reset_index(drop=True)

In [19]:
df[df['track_id'] == '2kkvB3RNRzwjFdGhaUA0tz']

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
20373,2kkvB3RNRzwjFdGhaUA0tz,Derek & The Dominos,Layla And Other Assorted Love Songs (Remastere...,Layla,74,423840,False,0.404,0.902,1,-3.88,1,0.0665,0.577,0.297,0.287,0.497,115.669,4,Folk


La consolidación por artists y track_id seleccionó "Folk" como el género más frecuente (4 de 8), eliminando las otras 7 entradas.
Esto asegura que cada canción tenga un único registro.

In [20]:
conteo_ids = df['track_id'].value_counts()
duplicados_reales = conteo_ids[conteo_ids > 1]

print(f"Total de track_id únicos con duplicados: {duplicados_reales.shape[0]}")
print(duplicados_reales.head())

Total de track_id únicos con duplicados: 0
Series([], Name: count, dtype: int64)


In [21]:
df.shape

(89740, 20)

### Identificar duplicados por combinación de 'track_name' y 'artists'

In [22]:
song_counts = df.value_counts(subset=['track_name', 'artists']).reset_index(name='counts')

# Filtrar solo las combinaciones que aparecen más de una vez
repeated_songs = song_counts[song_counts['counts'] > 1]

# Mostrar resultados
print("=== Canciones duplicadas por 'track_name' y 'artists' ===")
print(f"Número de combinaciones duplicadas: {len(repeated_songs)}")
print(repeated_songs.head(10))

=== Canciones duplicadas por 'track_name' y 'artists' ===
Número de combinaciones duplicadas: 4657
                                    track_name          artists  counts
0            Rockin' Around The Christmas Tree       Brenda Lee      45
1               Little Saint Nick - 1991 Remix   The Beach Boys      41
2                              Run Rudolph Run      Chuck Berry      40
3                           Frosty The Snowman  Ella Fitzgerald      34
4       Let It Snow! Let It Snow! Let It Snow!      Dean Martin      32
5                                    Mistletoe    Justin Bieber      31
6                                  Sleigh Ride  Ella Fitzgerald      30
7              I Saw Mommy Kissing Santa Claus    The Jackson 5      27
8                Santa Claus Is Coming To Town    The Jackson 5      26
9  The Christmas Song (Merry Christmas To You)    Nat King Cole      26


Hay 4,657 combinaciones únicas de track_name y artists que aparecen más de una vez.
Las canciones más repetidas son navideñas, como "Rockin' Around The Christmas Tree" (45 veces) y "Run Rudolph Run" (40 veces).

In [23]:
duplicadas = df[(df['track_name'] == "Run Rudolph Run") & (df['artists'] == 'Chuck Berry')]
print(duplicadas.shape)
duplicadas.head()


(40, 20)


Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
15631,02NjAsgrCjbFioTyGI0FeT,Chuck Berry,All I Want For Christmas Is You,Run Rudolph Run,0,162897,False,0.647,0.876,10,-5.662,1,0.185,0.881,3.6e-05,0.26,0.949,151.925,4,Jazz/Blues
15632,03MW3H9B2P7tgpvzG3klNI,Chuck Berry,pov: you hear the bells jingle,Run Rudolph Run,0,162897,False,0.647,0.876,10,-5.662,1,0.185,0.881,3.6e-05,0.26,0.949,151.925,4,Jazz/Blues
15633,03Uo2utHEdsOoEKwPs4w0G,Chuck Berry,pov: you are walking in a winter wonderland,Run Rudolph Run,0,162897,False,0.647,0.876,10,-5.662,1,0.185,0.881,3.6e-05,0.26,0.949,151.925,4,Jazz/Blues
15635,0A2fX1kBVpevzGnMjx5fnX,Chuck Berry,Christmas Playlist 2022,Run Rudolph Run,0,162897,False,0.647,0.876,10,-5.662,1,0.185,0.881,3.6e-05,0.26,0.949,151.925,4,Jazz/Blues
15636,0FJ5C03igALJMeicHGorCo,Chuck Berry,Christmas Best Hits 2022,Run Rudolph Run,0,162897,False,0.647,0.876,10,-5.662,1,0.185,0.881,3.6e-05,0.26,0.949,151.925,4,Jazz/Blues


Las 40 entradas reflejan la misma canción en diferentes álbumes o recopilaciones, con variaciones mínimas en popularidad y audio, sugiriendo que el dataset incluye múltiples versiones o reubicaciones de una misma grabación.

#### Crear una lista de columnas para eliminar duplicados, excluyendo 'track_id' y 'album_name'

In [None]:

subset_cols = [col for col in df.columns if col not in ["track_id", "album_name"]]

df = df.drop_duplicates(subset=subset_cols, keep="first")

df.shape



(86061, 20)

In [25]:
duplicadas = df[(df['track_name'] == "Run Rudolph Run") & (df['artists'] == 'Chuck Berry')]
print(duplicadas.shape)
duplicadas.head()

(4, 20)


Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
15631,02NjAsgrCjbFioTyGI0FeT,Chuck Berry,All I Want For Christmas Is You,Run Rudolph Run,0,162897,False,0.647,0.876,10,-5.662,1,0.185,0.881,3.6e-05,0.26,0.949,151.925,4,Jazz/Blues
15638,0HeG8Hl4Upx3zqwK7VNY2l,Chuck Berry,Christmas Songs 2022,Run Rudolph Run,1,162897,False,0.647,0.876,10,-5.662,1,0.185,0.881,3.6e-05,0.26,0.949,151.925,4,Jazz/Blues
15645,1GqAFWj0HSbX055zDc94Wj,Chuck Berry,Classic Christmas Greatest Hits,Run Rudolph Run,0,164160,False,0.682,0.776,7,-8.314,0,0.114,0.83,1.1e-05,0.17,0.959,151.799,4,Jazz/Blues
15655,2pnPe4pJtq7689i5ydzvJJ,Chuck Berry,Rock 'N' Roll Rarities,Run Rudolph Run,59,165733,False,0.681,0.715,7,-10.609,0,0.0912,0.812,9e-06,0.0777,0.957,152.132,4,Rock


#### Seleccionar la fila con mayor popularidad por combinación de 'track_name' y 'artists'

In [26]:
idx = df.groupby(['track_name', 'artists'])['popularity'].idxmax()
df = df.loc[idx].reset_index(drop=True)


In [27]:
df.shape

(81343, 20)

In [28]:
duplicadas = df[(df['track_name'] == "Run Rudolph Run") & (df['artists'] == 'Chuck Berry')]
print(duplicadas.shape)
duplicadas.head()

(1, 20)


Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
55570,2pnPe4pJtq7689i5ydzvJJ,Chuck Berry,Rock 'N' Roll Rarities,Run Rudolph Run,59,165733,False,0.681,0.715,7,-10.609,0,0.0912,0.812,9e-06,0.0777,0.957,152.132,4,Rock


La selección por popularidad eligió la versión más conocida, eliminando las de popularidad baja (0, 1), lo que es coherente con el objetivo de mantener registros representativos.

### Mostrar estadísticas descriptivas de las columnas numéricas

In [29]:
df.describe()


Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0,81343.0
mean,35.242111,231397.1,0.559285,0.634894,5.285335,-8.598587,0.632519,0.088989,0.329703,0.184738,0.219751,0.463301,122.132604,3.896893
std,19.414223,116474.5,0.177734,0.258651,3.557453,5.305827,0.482122,0.116622,0.339945,0.331602,0.198269,0.263406,30.122034,0.456624
min,0.0,8586.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,21.0,173866.0,0.446,0.455,2.0,-10.459,0.0,0.0361,0.0159,0.0,0.0986,0.241,99.3765,4.0
50%,35.0,215200.0,0.573,0.678,5.0,-7.267,1.0,0.0491,0.19,8.9e-05,0.133,0.449,122.028,4.0
75%,49.0,267333.0,0.69,0.8565,8.0,-5.143,1.0,0.087,0.629,0.153,0.283,0.676,140.1235,4.0
max,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


El dataset tiene una distribución variada en métricas de audio y popularidad, con canciones típicamente de duración media (~3-4 min) y moderadamente bailables/energéticas.

### Numericas a categoricas 

In [None]:
def categorize_popularity(p):
    """Categoriza la popularidad en niveles bajo, medio y alto.

    Args:
        p: Valor de popularidad (int).

    Returns:
        str: Categoría asignada ('low', 'medium', 'high').
    """
    if p < 30:
        return 'low'
    elif p < 70:
        return 'medium'
    else:
        return 'high'

df['popularity_cat'] = df['popularity'].apply(categorize_popularity)


df.head()


Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,...,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,popularity_cat
0,0fROT4kK5oTm8xO8PX6EJF,Rilès,!I'll Be Back!,!I'll Be Back!,52,178533,True,0.823,0.612,1,...,1,0.248,0.168,0.0,0.109,0.688,142.959,4,Regional,medium
1,1hH0t381PIXmUVWyG1Vj3p,Brian Hyland,The Bashful Blond,"""A"" You're Adorable",39,151680,False,0.615,0.375,0,...,0,0.0319,0.482,0.0,0.111,0.922,110.72,4,Rock,medium
2,1B45DvGMoFWdbAEUH2qliG,Little Apple Band,The Favorite Songs Of Sesame Street,"""C"" IS FOR COOKIE",32,84305,False,0.553,0.812,3,...,1,0.0558,0.132,1e-05,0.0794,0.871,118.368,4,Other,medium
3,73lXf5if6MWVWnsgXhK8bd,Little Apple Band,Sesame Street and Friends,"""C"" is for Cookie",8,86675,False,0.664,0.611,3,...,1,0.0886,0.12,0.0,0.0408,0.758,118.443,4,Other,low
4,0jmz4aHEIBCRgrcV2xEkwB,Traditional;Sistine Chapel Choir;Massimo Palom...,Classical Christmas,"""Christe, Redemptor omnium""",0,289133,False,0.111,0.0568,10,...,1,0.0551,0.99,0.697,0.11,0.0395,169.401,1,Moods,low


In [None]:
def categorize_duration(d):
    """Categoriza la duración en minutos en corta, media y larga.

    Args:
        d: Duración en minutos (float).

    Returns:
        str: Categoría asignada ('short', 'medium', 'long').
    """
    if d < 2.5:
        return 'short'
    elif d <= 4:
        return 'medium'
    else:
        return 'long'

df['duration_cat'] = df['duration_min'].apply(categorize_duration)
df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,...,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre,popularity_cat,duration_min,duration_cat
0,0fROT4kK5oTm8xO8PX6EJF,Rilès,!I'll Be Back!,!I'll Be Back!,52,178533,True,0.823,0.612,1,...,0.168,0.0,0.109,0.688,142.959,4,Regional,medium,2.97555,medium
1,1hH0t381PIXmUVWyG1Vj3p,Brian Hyland,The Bashful Blond,"""A"" You're Adorable",39,151680,False,0.615,0.375,0,...,0.482,0.0,0.111,0.922,110.72,4,Rock,medium,2.528,medium
2,1B45DvGMoFWdbAEUH2qliG,Little Apple Band,The Favorite Songs Of Sesame Street,"""C"" IS FOR COOKIE",32,84305,False,0.553,0.812,3,...,0.132,1e-05,0.0794,0.871,118.368,4,Other,medium,1.405083,short
3,73lXf5if6MWVWnsgXhK8bd,Little Apple Band,Sesame Street and Friends,"""C"" is for Cookie",8,86675,False,0.664,0.611,3,...,0.12,0.0,0.0408,0.758,118.443,4,Other,low,1.444583,short
4,0jmz4aHEIBCRgrcV2xEkwB,Traditional;Sistine Chapel Choir;Massimo Palom...,Classical Christmas,"""Christe, Redemptor omnium""",0,289133,False,0.111,0.0568,10,...,0.99,0.697,0.11,0.0395,169.401,1,Moods,low,4.818883,long


In [None]:
def categorize_level(x):
    """Categoriza un valor numérico entre 0 y 1 en bajo, medio y alto.

    Args:
        x: Valor numérico entre 0 y 1 (float).

    Returns:
        str: Categoría asignada ('low', 'medium', 'high').
    """
    if x < 0.33:
        return 'low'
    elif x < 0.66:
        return 'medium'
    else:
        return 'high'

df['danceability_cat'] = df['danceability'].apply(categorize_level)
df['energy_cat'] = df['energy'].apply(categorize_level)

df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,...,liveness,valence,tempo,time_signature,track_genre,popularity_cat,duration_min,duration_cat,danceability_cat,energy_cat
0,0fROT4kK5oTm8xO8PX6EJF,Rilès,!I'll Be Back!,!I'll Be Back!,52,178533,True,0.823,0.612,1,...,0.109,0.688,142.959,4,Regional,medium,2.97555,medium,high,medium
1,1hH0t381PIXmUVWyG1Vj3p,Brian Hyland,The Bashful Blond,"""A"" You're Adorable",39,151680,False,0.615,0.375,0,...,0.111,0.922,110.72,4,Rock,medium,2.528,medium,medium,medium
2,1B45DvGMoFWdbAEUH2qliG,Little Apple Band,The Favorite Songs Of Sesame Street,"""C"" IS FOR COOKIE",32,84305,False,0.553,0.812,3,...,0.0794,0.871,118.368,4,Other,medium,1.405083,short,medium,high
3,73lXf5if6MWVWnsgXhK8bd,Little Apple Band,Sesame Street and Friends,"""C"" is for Cookie",8,86675,False,0.664,0.611,3,...,0.0408,0.758,118.443,4,Other,low,1.444583,short,high,medium
4,0jmz4aHEIBCRgrcV2xEkwB,Traditional;Sistine Chapel Choir;Massimo Palom...,Classical Christmas,"""Christe, Redemptor omnium""",0,289133,False,0.111,0.0568,10,...,0.11,0.0395,169.401,1,Moods,low,4.818883,long,low,low


In [None]:
def categorize_valence(v):
    """Categoriza la valencia en niveles de estado emocional.

    Args:
        v: Valor de valencia entre 0 y 1 (float).

    Returns:
        str: Categoría asignada ('very sad', 'sad', 'neutral', 'happy', 'very happy').
    """
    if v < 0.2:
        return 'very sad'
    elif v < 0.4:
        return 'sad'
    elif v < 0.6:
        return 'neutral'
    elif v < 0.8:
        return 'happy'
    else:
        return 'very happy'

df['valence_cat'] = df['valence'].apply(categorize_valence)


df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,...,valence,tempo,time_signature,track_genre,popularity_cat,duration_min,duration_cat,danceability_cat,energy_cat,valence_cat
0,0fROT4kK5oTm8xO8PX6EJF,Rilès,!I'll Be Back!,!I'll Be Back!,52,178533,True,0.823,0.612,1,...,0.688,142.959,4,Regional,medium,2.97555,medium,high,medium,happy
1,1hH0t381PIXmUVWyG1Vj3p,Brian Hyland,The Bashful Blond,"""A"" You're Adorable",39,151680,False,0.615,0.375,0,...,0.922,110.72,4,Rock,medium,2.528,medium,medium,medium,very happy
2,1B45DvGMoFWdbAEUH2qliG,Little Apple Band,The Favorite Songs Of Sesame Street,"""C"" IS FOR COOKIE",32,84305,False,0.553,0.812,3,...,0.871,118.368,4,Other,medium,1.405083,short,medium,high,very happy
3,73lXf5if6MWVWnsgXhK8bd,Little Apple Band,Sesame Street and Friends,"""C"" is for Cookie",8,86675,False,0.664,0.611,3,...,0.758,118.443,4,Other,low,1.444583,short,high,medium,happy
4,0jmz4aHEIBCRgrcV2xEkwB,Traditional;Sistine Chapel Choir;Massimo Palom...,Classical Christmas,"""Christe, Redemptor omnium""",0,289133,False,0.111,0.0568,10,...,0.0395,169.401,1,Moods,low,4.818883,long,low,low,very sad


### Crear columnas booleanas para 'loudness' y 'liveness'

In [36]:
df['is_loud'] = df['loudness'] > -5
df['is_live'] = df['liveness'] > 0.8

df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,...,time_signature,track_genre,popularity_cat,duration_min,duration_cat,danceability_cat,energy_cat,valence_cat,is_loud,is_live
0,0fROT4kK5oTm8xO8PX6EJF,Rilès,!I'll Be Back!,!I'll Be Back!,52,178533,True,0.823,0.612,1,...,4,Regional,medium,2.97555,medium,high,medium,happy,False,False
1,1hH0t381PIXmUVWyG1Vj3p,Brian Hyland,The Bashful Blond,"""A"" You're Adorable",39,151680,False,0.615,0.375,0,...,4,Rock,medium,2.528,medium,medium,medium,very happy,False,False
2,1B45DvGMoFWdbAEUH2qliG,Little Apple Band,The Favorite Songs Of Sesame Street,"""C"" IS FOR COOKIE",32,84305,False,0.553,0.812,3,...,4,Other,medium,1.405083,short,medium,high,very happy,False,False
3,73lXf5if6MWVWnsgXhK8bd,Little Apple Band,Sesame Street and Friends,"""C"" is for Cookie",8,86675,False,0.664,0.611,3,...,4,Other,low,1.444583,short,high,medium,happy,False,False
4,0jmz4aHEIBCRgrcV2xEkwB,Traditional;Sistine Chapel Choir;Massimo Palom...,Classical Christmas,"""Christe, Redemptor omnium""",0,289133,False,0.111,0.0568,10,...,1,Moods,low,4.818883,long,low,low,very sad,False,False


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81343 entries, 0 to 81342
Data columns (total 28 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   track_id          81343 non-null  object 
 1   artists           81343 non-null  object 
 2   album_name        81343 non-null  object 
 3   track_name        81343 non-null  object 
 4   popularity        81343 non-null  int64  
 5   duration_ms       81343 non-null  int64  
 6   explicit          81343 non-null  bool   
 7   danceability      81343 non-null  float64
 8   energy            81343 non-null  float64
 9   key               81343 non-null  int64  
 10  loudness          81343 non-null  float64
 11  mode              81343 non-null  int64  
 12  speechiness       81343 non-null  float64
 13  acousticness      81343 non-null  float64
 14  instrumentalness  81343 non-null  float64
 15  liveness          81343 non-null  float64
 16  valence           81343 non-null  float6

## Eliminar columnas numéricas originales y otras no esenciales

In [None]:
df = df.drop(columns= [
    'popularity', 'duration_ms', 'danceability', 'energy', 'valence', 
    'loudness', 'liveness', 'key', 'mode', 'time_signature', 'tempo', "speechiness", "acousticness", "instrumentalness"
])


In [39]:
df.head()

Unnamed: 0,track_id,artists,explicit,track_genre,popularity_cat,duration_min,duration_cat,danceability_cat,energy_cat,valence_cat,is_loud,is_live
0,0fROT4kK5oTm8xO8PX6EJF,Rilès,True,Regional,medium,2.97555,medium,high,medium,happy,False,False
1,1hH0t381PIXmUVWyG1Vj3p,Brian Hyland,False,Rock,medium,2.528,medium,medium,medium,very happy,False,False
2,1B45DvGMoFWdbAEUH2qliG,Little Apple Band,False,Other,medium,1.405083,short,medium,high,very happy,False,False
3,73lXf5if6MWVWnsgXhK8bd,Little Apple Band,False,Other,low,1.444583,short,high,medium,happy,False,False
4,0jmz4aHEIBCRgrcV2xEkwB,Traditional;Sistine Chapel Choir;Massimo Palom...,False,Moods,low,4.818883,long,low,low,very sad,False,False


## Filtrar géneros no deseados ('Other' y 'Moods')

In [46]:
df["track_genre"].value_counts()

track_genre
Electronic    7590
Other         3745
Regional      3161
Latin         2537
Pop           2439
Moods         2219
Folk          1759
Rock          1698
Classical     1670
Jazz/Blues    1524
Hip-Hop       1115
Metal          994
Punk           828
Reggae         158
Name: count, dtype: int64

In [47]:
df = df[df["track_genre"].str.lower() != "other"].reset_index(drop=True)

In [54]:
df = df[~df["track_genre"].str.lower().isin(["moods"])].reset_index(drop=True)


In [56]:
df["track_genre"].value_counts()

track_genre
Electronic    7590
Regional      3161
Latin         2537
Pop           2439
Folk          1759
Rock          1698
Classical     1670
Jazz/Blues    1524
Hip-Hop       1115
Metal          994
Punk           828
Reggae         158
Name: count, dtype: int64

El filtrado enfocó el dataset en géneros musicales definidos, eliminando ~19% de las filas para análisis más específicos. Se eliminaron estos datos ya que representaban Podcast y no canciones. 