# **Algorhythm**

## Data Exploration

This notebook focuses on the initial exploration and analysis of the **_Algorhythm_** dataset. The goal is to understand the structure, data types, and potential data quality issues, as well as to unify missing value representation and ensure consistency across all columns. After cleaning and validating the data, the processed dataset will be saved in parquet format for further modeling and development.

`Simón Correa Marín`  
`Luis Felipe Ospina Giraldo`


### **1. Import Libraries**


In [1]:
# base libraries for data science
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa

### **2. Load Data**


In [2]:
# Define the data directory path (relative, matching your save location)
DATA_DIR = Path("../../data/01_raw")

# Read the CSV file into a DataFrame
algorhythm_df = pd.read_csv(DATA_DIR / "algorhythm_final.csv")

### **3. Data Description**


In [3]:
algorhythm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2805 entries, 0 to 2804
Data columns (total 26 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   age                 2332 non-null   float64
 1   album_name          2332 non-null   object 
 2   album_popularity    2332 non-null   float64
 3   artist_genres       1118 non-null   object 
 4   artist_name         2332 non-null   object 
 5   artist_popularity   2332 non-null   float64
 6   chart_chart_name    473 non-null    object 
 7   chart_genres        387 non-null    object 
 8   chart_track_name    473 non-null    object 
 9   chart_popularity    473 non-null    float64
 10  chart_position      473 non-null    float64
 11  features_vector     0 non-null      float64
 12  gender              2332 non-null   object 
 13  is_liked            2332 non-null   float64
 14  is_recent_play      2332 non-null   float64
 15  is_top_track        2332 non-null   float64
 16  locati

In [4]:
algorhythm_df.sample(10)

Unnamed: 0,age,album_name,album_popularity,artist_genres,artist_name,artist_popularity,chart_chart_name,chart_genres,chart_track_name,chart_popularity,...,location,music_profile,track_name,track_popularity,album_age_days,chart_age_days,track_age_days,played_day_of_week,played_hour,is_recommended
1389,21.0,Hasta el Fin Del Mundo,0.0,"latin folk, mexican indie",Kevin Kaarl,77.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Vámonos a Marte,78.0,2286.0,,2286.0,2.0,7.0,1
2095,21.0,reputation,0.0,,Taylor Swift,98.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Delicate,82.0,2761.0,,2761.0,2.0,7.0,1
436,21.0,Blue Banisters,0.0,,Lana Del Rey,92.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Cherry Blossom,61.0,1319.0,,1319.0,2.0,7.0,1
1740,21.0,Manic,0.0,,Halsey,83.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Graveyard,67.0,1963.0,,1963.0,2.0,7.0,1
2291,21.0,Eyes Wide Open,0.0,,Sabrina Carpenter,91.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",We'll Be The Stars,56.0,3706.0,,3706.0,2.0,7.0,1
2136,21.0,OASIS,0.0,"reggaeton, latin",J Balvin,89.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",COMO UN BEBÉ,73.0,2166.0,,2166.0,2.0,7.0,1
2642,,,,,,,Top 50 Global,,Te Mentiría 420,7.0,...,,,,,,885.0,,,,0
364,21.0,Back To Black (Deluxe Edition),0.0,,Amy Winehouse,78.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Back To Black,83.0,,,,2.0,7.0,1
1405,21.0,Prisoner (feat. Dua Lipa),0.0,,Miley Cyrus,85.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Prisoner (feat. Dua Lipa),2.0,1655.0,,1655.0,2.0,7.0,1
80,21.0,The Secret of Us (Deluxe),0.0,,Gracie Abrams,87.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...","I Love You, I'm Sorry - Live From Vevo",73.0,227.0,,227.0,2.0,7.0,1


### **4. Null Values**


In [5]:
# Dataset lenght
len(algorhythm_df)

2805

In [6]:
# Percentage of missing values for each column
missing_values = algorhythm_df.isnull().sum() * 100 / len(algorhythm_df)

for column, percentage in missing_values.items():
    print(f"{column}: {percentage:.3f}%")

age: 16.863%
album_name: 16.863%
album_popularity: 16.863%
artist_genres: 60.143%
artist_name: 16.863%
artist_popularity: 16.863%
chart_chart_name: 83.137%
chart_genres: 86.203%
chart_track_name: 83.137%
chart_popularity: 83.137%
chart_position: 83.137%
features_vector: 100.000%
gender: 16.863%
is_liked: 16.863%
is_recent_play: 16.863%
is_top_track: 16.863%
location: 16.863%
music_profile: 16.863%
track_name: 16.863%
track_popularity: 16.863%
album_age_days: 17.968%
chart_age_days: 83.137%
track_age_days: 17.968%
played_day_of_week: 16.863%
played_hour: 16.863%
is_recommended: 0.000%


In [7]:
#Check if there are another representation of missing values
mv = ["?", " ", "", "nan", "N/A", "na", "NA", "NAN", "None", "none", "NONE", "null", "NULL", "Null"]
for col in algorhythm_df.columns:
    print(col, algorhythm_df[col].isin(mv).sum())

age 0
album_name 3
album_popularity 0
artist_genres 0
artist_name 0
artist_popularity 0
chart_chart_name 0
chart_genres 0
chart_track_name 0
chart_popularity 0
chart_position 0
features_vector 0
gender 0
is_liked 0
is_recent_play 0
is_top_track 0
location 0
music_profile 0
track_name 0
track_popularity 0
album_age_days 0
chart_age_days 0
track_age_days 0
played_day_of_week 0
played_hour 0
is_recommended 0


In [8]:
cols_to_drop = ["features_vector"]

algorhythm_df = algorhythm_df.drop(columns=cols_to_drop)

### **5. Data Types**


#### **Categorical Values**

- **Ordinal**

  - **chart_position**: Position of the track in the chart (may be dropped due to high nulls).

- **Nominal**

  - **album_name**: Name of the album.
  - **artist_genres**: List of genres for the artist.
  - **artist_name**: Name of the artist.
  - **gender**: User’s gender (e.g., male, female, other).
  - **location**: User’s location.
  - **music_profile**: User’s reported music preferences or profile.
  - **track_name**: Name of the track.
  - **chart_chart_name**: Name of the chart (if present).
  - **chart_genres**: Genres associated with the chart (if present).
  - **chart_track_name**: Track name as shown in the chart (if present).

- **Boolean**
  - **is_liked**: If the user liked/saved the track (0 = No, 1 = Yes).
  - **is_recent_play**: If the track was played recently (0 = No, 1 = Yes).
  - **is_top_track**: If the track is among the user's top tracks (0 = No, 1 = Yes).
  - **is_recommended**: If the track was recommended (0 = No, 1 = Yes).

---

#### **Numerical Values**

- **Discrete**

  - **age**: User’s age (years).
  - **album_popularity**: Popularity score of the album (0–100).
  - **artist_popularity**: Popularity score of the artist (0–100).
  - **track_popularity**: Popularity score of the track (0–100).
  - **chart_popularity**: Popularity score within the chart (0–100, if present).
  - **played_day_of_week**: Day of the week when the track was played (0 = Monday, 6 = Sunday).
  - **played_hour**: Hour of the day when the track was played (0–23).

- **Continuous**
  - **album_age_days**: Number of days since the album release.
  - **track_age_days**: Number of days since the track release.
  - **chart_age_days**: Number of days since the track was added to the chart.

---

#### **Other / Dropped or Empty**

- **features_vector**: Dropped (100% nulls).


#### **Convert data types**


In [9]:
#Unique values for each column
for col in algorhythm_df.columns:
    print(col, algorhythm_df[col].unique())

age [21. nan]
album_name ['The Secret of Us' 'THE TORTURED POETS DEPARTMENT' 'Submarine' ...
 'Ins and Outs (Originally Performed by Sofia Carson) (Karaoke Version)'
 'Adiós (Cover en Español)' nan]
album_popularity [ 0. 68. 81. 83. 80. 75. 74. 86. 87. 69. 93. 79.  2. 82. nan]
artist_genres [nan 'bedroom pop' 'reggaeton, urbano latino' 'pop'
 'dream pop, shoegaze, slowcore' 'art pop' 'hip hop, west coast hip hop'
 'latin pop' 'r&b' 'opera' 'salsa, merengue, son cubano' 'argentine trap'
 'k-pop' 'argentine rock, latin rock, rock en español, latin alternative'
 'latin'
 'christmas, big band, adult standards, swing music, vocal jazz, jazz'
 'progressive metal, metalcore' 'reggaeton'
 'neoperreo, cloud rap, hyperpop' 'art pop, pop'
 'reggaeton chileno, chilean trap, reggaeton mexa, chilean mambo'
 'corrido, corridos tumbados, corridos bélicos, música mexicana, sad sierreño, banda, electro corridos, dembow belico'
 'soft pop' 'reggaeton, latin' 'bolero, bacardi' 'neoclassical, classical'
 '

There are strange values in the variables, we'll handle them.


In [10]:
# Ordinal categorical columns
ordinal_categorical_cols = [
    "chart_position"
]

# Nominal categorical columns
nominal_categorical_cols = [
    "album_name",
    "artist_genres",
    "artist_name",
    "gender",
    "location",
    "music_profile",
    "track_name",
    "chart_chart_name",
    "chart_genres",
    "chart_track_name"
]

# Boolean columns
boolean_cols = [
    "is_liked",
    "is_recent_play",
    "is_top_track",
    "is_recommended"
]

# Discrete numerical columns
disc_numerical_cols = [
    "age",
    "album_popularity",
    "artist_popularity",
    "track_popularity",
    "chart_popularity",
    "played_day_of_week",
    "played_hour"
]

# Continuous numerical columns
cont_numerical_cols = [
    "album_age_days",
    "track_age_days",
    "chart_age_days"
]

# Other / dropped or empty
dropped_cols = [
    "features_vector"
]

In [11]:
# Ordinal and Nominal Categorical columns
categorical_cols = ordinal_categorical_cols + nominal_categorical_cols
algorhythm_df[categorical_cols] = algorhythm_df[categorical_cols].astype("category")

# Boolean columns
algorhythm_df[boolean_cols] = algorhythm_df[boolean_cols].astype("bool")

# Discrete numerical columns
algorhythm_df[disc_numerical_cols] = algorhythm_df[disc_numerical_cols].astype("float")  # Use float to allow for NaN

# Continuous numerical columns
algorhythm_df[cont_numerical_cols] = algorhythm_df[cont_numerical_cols].astype("float")

In [12]:
#Unique values for each categorical column
for col in algorhythm_df.select_dtypes(include="category").columns:
    unique_vals = algorhythm_df[col].unique()
    if len(unique_vals) < 10:
        print(f"{col}: {unique_vals}")

chart_chart_name: [NaN, 'Today's Top Hits', 'Top 50 Global', 'Top 50 Colombia']
Categories (3, object): ['Today's Top Hits', 'Top 50 Colombia', 'Top 50 Global']
gender: ['male', NaN]
Categories (1, object): ['male']
location: ['Colombia', NaN]
Categories (1, object): ['Colombia']
music_profile: ['reggaeton, country, urbano latino, latin pop,..., NaN]
Categories (1, object): ['reggaeton, country, urbano latino, latin pop,...]


In [13]:
#Unique values for each boolean column
for col in algorhythm_df.select_dtypes(include="boolean").columns:
    print(col, algorhythm_df[col].unique())

is_liked [False  True]
is_recent_play [False  True]
is_top_track [ True False]
is_recommended [ True False]


In [14]:
# Numerical columns overview
algorhythm_df.describe()

Unnamed: 0,age,album_popularity,artist_popularity,chart_popularity,track_popularity,album_age_days,chart_age_days,track_age_days,played_day_of_week,played_hour
count,2332.0,2332.0,2332.0,473.0,2332.0,2301.0,473.0,2301.0,2332.0,2332.0
mean,21.0,1.783019,80.762436,64.585624,61.821612,2477.791395,465.215645,2477.791395,2.0,7.0
std,0.0,11.932425,16.697249,22.653281,24.695397,1938.830203,411.94466,1938.830203,0.0,0.0
min,21.0,0.0,0.0,0.0,0.0,24.0,4.0,24.0,2.0,7.0
25%,21.0,0.0,77.0,59.0,56.0,1407.0,44.0,1407.0,2.0,7.0
50%,21.0,0.0,86.0,70.0,70.0,2166.0,388.0,2166.0,2.0,7.0
75%,21.0,0.0,91.0,80.0,77.25,2986.0,743.0,2986.0,2.0,7.0
max,21.0,93.0,100.0,97.0,100.0,23436.0,1539.0,23436.0,2.0,7.0


In [15]:
algorhythm_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2805 entries, 0 to 2804
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   age                 2332 non-null   float64 
 1   album_name          2332 non-null   category
 2   album_popularity    2332 non-null   float64 
 3   artist_genres       1118 non-null   category
 4   artist_name         2332 non-null   category
 5   artist_popularity   2332 non-null   float64 
 6   chart_chart_name    473 non-null    category
 7   chart_genres        387 non-null    category
 8   chart_track_name    473 non-null    category
 9   chart_popularity    473 non-null    float64 
 10  chart_position      473 non-null    category
 11  gender              2332 non-null   category
 12  is_liked            2805 non-null   bool    
 13  is_recent_play      2805 non-null   bool    
 14  is_top_track        2805 non-null   bool    
 15  location            2332 non-null   ca

In [16]:
algorhythm_df.sample(3)

Unnamed: 0,age,album_name,album_popularity,artist_genres,artist_name,artist_popularity,chart_chart_name,chart_genres,chart_track_name,chart_popularity,...,location,music_profile,track_name,track_popularity,album_age_days,chart_age_days,track_age_days,played_day_of_week,played_hour,is_recommended
1689,21.0,Tu Peor Error,0.0,"reggaeton, trap latino, dembow, urbano latino",Darell,77.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Tu Peor Error,0.0,2537.0,,2537.0,2.0,7.0,True
1501,21.0,Platonicos,0.0,reggaeton,Jay Wheeler,78.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",La Curiosidad,81.0,1816.0,,1816.0,2.0,7.0,True
1486,21.0,"good kid, m.A.A.d city (Deluxe)",0.0,"hip hop, west coast hip hop",Kendrick Lamar,95.0,,,,,...,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Money Trees,70.0,4606.0,,4606.0,2.0,7.0,True


### **6. Save dataframe with data types**


In [17]:
schema = pa.Table.from_pandas(algorhythm_df).schema

In [18]:
DATA_DIR = Path("../../data/02_intermediate")

# Save DataFrame in parquet format
algorhythm_df.to_parquet(
    DATA_DIR / "algorhythm_fixed.parquet", index=False, schema=schema)

## **Analysis of Results**

- The dataset contains a mix of categorical, boolean, and numerical features, with a larger proportion of categorical columns (nominal and ordinal) representing track, artist, album, and user information.

- Columns with 100% null values (such as `features_vector`) were removed to streamline the dataset.

- High-null chart-related columns were reviewed; their retention or removal was based on project requirements and data utility.

- Missing and inconsistent values were identified, and non-standard representations of missing data (e.g., "nan", "null", "?", etc.) were checked across all columns.

- All columns were converted to the most appropriate data types (`category`, `bool`, or `float`), enabling accurate downstream analysis and modeling.

- Categorical columns with fewer than 10 unique values were identified for potential encoding or review.

- The cleaned and type-consistent dataset was saved in parquet format for efficient storage and further modeling.
