# **Algorhythm**

## Data Collection

This notebook focuses on downloading and collecting the dataset required for the project **_Algorhythm_**. The goal is to gather the necessary data, ensure its integrity, and prepare it for further processing and analysis.

This notebook loads, cleans, and merges music listening and chart data from CSV files to create a unified dataset for a music recommendation system. It removes duplicates, engineers features, labels tracks, and prepares the data for further analysis or modeling.

`Simón Correa Marín`

`Luis Felipe Ospina Giraldo`


### **1. Import Libraries**


In [1]:
# Base libraries for data science
from pathlib import Path
import pandas as pd
from datetime import datetime
import re
import numpy as np
pd.set_option('display.max_columns', None)

### **2. Collect Data**


- **What is the objective of the AlgoRythms recommendation problem?**

  The main goal is to build a recommendation system that predicts and suggests songs and albums personalized to each user, based on their listening history, favorite artists, and music preferences. This aims to enhance the user experience by helping them discover music they are likely to enjoy.

  > A key task is to analyze user listening data to anticipate which tracks or albums a user might like and to extract valuable insights into musical habits.

- **How will the solution be used?**

  The solution will function as a smart recommendation engine, delivering tailored music suggestions directly to users on the platform. It can be integrated into web apps or dashboards and updated dynamically as users interact with music.

- **What are the current solutions?**

  Major music streaming services (Spotify, Apple Music, YouTube Music) already use collaborative, content-based, and hybrid recommendation models. AlgoRythms stands out by combining real Spotify API data with synthetic user data, enabling flexible experimentation with different algorithms to maximize recommendation quality.


### **Load data - CSV files**


In [2]:
# Define the path to the CSV files directory
csv_dir = Path("../../data/01_raw/csv")
csv_files = list(csv_dir.glob("*.csv"))

# Load each CSV into a variable, using a standardized variable name
for csv_file in csv_files:

    stem = csv_file.stem

    # Standardize variable name:
    # - Insert underscores before capital letters (camelCase to snake_case)
    # - Replace dashes and spaces with underscores
    # - Convert everything to lowercase
    var_name = re.sub(r'(?<!^)(?=[A-Z])', '_', stem)  # camelCase to camel_Case
    var_name = var_name.replace("-", "_").replace(" ", "_").lower()
    
    # Load CSV into a DataFrame assigned to the standardized variable name
    globals()[var_name] = pd.read_csv(csv_file, low_memory=False)
    print(f"Loaded {csv_file.name} as variable '{var_name}' (shape: {globals()[var_name].shape})")

Loaded Artist.csv as variable 'artist' (shape: (748, 4))
Loaded Album.csv as variable 'album' (shape: (1772, 6))
Loaded Track.csv as variable 'track' (shape: (2468, 8))
Loaded UserTrackHistory.csv as variable 'user_track_history' (shape: (3052, 6))
Loaded User.csv as variable 'user' (shape: (1, 5))
Loaded ChartTrack.csv as variable 'chart_track' (shape: (473, 10))


In [3]:
# Display the first few rows of each DataFrame
for var_name in ['artist', 'album', 'track', 'user_track_history', 'user', 'chart_track']:
    print(f"{var_name} df:")
    display(globals()[var_name].head(3))

artist df:


Unnamed: 0,artist_id,name,genres,popularity
0,4tuJ0bMpJh08umKkEXKUI5,Gracie Abrams,,87
1,2sSGPbdZJkaSE2AbcGOACx,The Marías,bedroom pop,86
2,4Uc8Dsxct0oMqx0P6i60ea,Conan Gray,,80


album df:


Unnamed: 0,album_id,name,artist_id,release_date,popularity,genres
0,4XXTsu7r9865VvXdvF2iQP,The Secret of Us,4tuJ0bMpJh08umKkEXKUI5,2024-06-20,0,
1,56bdWeO40o3WfAD2Lja4dl,The Secret of Us,4tuJ0bMpJh08umKkEXKUI5,2024-06-21,0,
2,1Mo4aZ8pdj6L1jx8zSwJnt,THE TORTURED POETS DEPARTMENT,06HL4z0CvFAxyc27GXpf02,2024-04-18,0,


track df:


Unnamed: 0,track_id,name,artist_id,album_id,popularity,genres,features_vector,release_date
0,6nN8W5zHOii0P61I8eSdR3,Free Now,4tuJ0bMpJh08umKkEXKUI5,4XXTsu7r9865VvXdvF2iQP,69,,,2024-06-20
1,51rfRCiUSvxXlCSCfIztBy,"I Love You, I'm Sorry",4tuJ0bMpJh08umKkEXKUI5,56bdWeO40o3WfAD2Lja4dl,90,,,2024-06-21
2,5wbg8kepMFoMzHOEuxiI0q,Close To You,4tuJ0bMpJh08umKkEXKUI5,4XXTsu7r9865VvXdvF2iQP,85,,,2024-06-20


user_track_history df:


Unnamed: 0,user_id,track_id,played_at,is_top_track,is_recent_play,is_liked
0,94aet5tklbier5yaxaofrdl0w,6nN8W5zHOii0P61I8eSdR3,2025-05-28T07:20:47.132908,1.0,0.0,0
1,94aet5tklbier5yaxaofrdl0w,51rfRCiUSvxXlCSCfIztBy,2025-05-28T07:20:47.270594,1.0,0.0,0
2,94aet5tklbier5yaxaofrdl0w,5wbg8kepMFoMzHOEuxiI0q,2025-05-28T07:20:47.386019,1.0,0.0,0


user df:


Unnamed: 0,user_id,age,gender,location,music_profile
0,94aet5tklbier5yaxaofrdl0w,21,male,Colombia,"reggaeton, country, urbano latino, latin pop, ..."


chart_track df:


Unnamed: 0,chart_id,track_id,name,artist_id,album_id,popularity,genres,chart_name,position,added_at
0,4Jb4PDWREzNnbZcOHPcZPy,5IZXB5IKAD2qlvTPJYDCFB,I Had Some Help (Feat. Morgan Wallen),246dkjvS1zLTtiykXe5h60,4BbsHmXEghoPPevQjPnHXx,88,,Today's Top Hits,1,2024-10-23T15:33:22Z
1,4Jb4PDWREzNnbZcOHPcZPy,2uqYupMHANxnwgeiXTZXzd,Austin (Boots Stop Workin'),7Ez6lTtSMjMf2YSYpukP1I,40HsqPqeSR9Xe3IyAJWr6e,88,,Today's Top Hits,2,2024-05-09T22:50:38Z
2,4Jb4PDWREzNnbZcOHPcZPy,3Rfre3qkrhwdZZ7dyznwbN,Lonely Road (with Jelly Roll),6TIYQ3jFPwQSRmorSezPxX,4tU0FNnuiBD1P6IRTARHww,79,,Today's Top Hits,3,2024-10-23T14:37:11Z


In [4]:
# artist info
display("Artist", artist.info())
# album info
display("Album", album.info())
# track info
display("Track", track.info())
# user_track_history info
display("User Track History", user_track_history.info())
# user info
display("User", user.info())
# chart_track info
display("Chart Track", chart_track.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   artist_id   748 non-null    object
 1   name        748 non-null    object
 2   genres      396 non-null    object
 3   popularity  748 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 23.5+ KB


'Artist'

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1772 entries, 0 to 1771
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   album_id      1772 non-null   object 
 1   name          1772 non-null   object 
 2   artist_id     1772 non-null   object 
 3   release_date  1772 non-null   object 
 4   popularity    1772 non-null   int64  
 5   genres        0 non-null      float64
dtypes: float64(1), int64(1), object(4)
memory usage: 83.2+ KB


'Album'

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2468 entries, 0 to 2467
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   track_id         2468 non-null   object 
 1   name             2468 non-null   object 
 2   artist_id        2468 non-null   object 
 3   album_id         2468 non-null   object 
 4   popularity       2468 non-null   int64  
 5   genres           367 non-null    object 
 6   features_vector  0 non-null      float64
 7   release_date     2468 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 154.4+ KB


'Track'

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3052 entries, 0 to 3051
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   user_id         3052 non-null   object 
 1   track_id        3052 non-null   object 
 2   played_at       3052 non-null   object 
 3   is_top_track    2336 non-null   float64
 4   is_recent_play  2336 non-null   float64
 5   is_liked        3052 non-null   int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 143.2+ KB


'User Track History'

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   user_id        1 non-null      object
 1   age            1 non-null      int64 
 2   gender         1 non-null      object
 3   location       1 non-null      object
 4   music_profile  1 non-null      object
dtypes: int64(1), object(4)
memory usage: 172.0+ bytes


'User'

None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 473 entries, 0 to 472
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   chart_id    473 non-null    object
 1   track_id    473 non-null    object
 2   name        473 non-null    object
 3   artist_id   473 non-null    object
 4   album_id    473 non-null    object
 5   popularity  473 non-null    int64 
 6   genres      387 non-null    object
 7   chart_name  473 non-null    object
 8   position    473 non-null    int64 
 9   added_at    473 non-null    object
dtypes: int64(2), object(8)
memory usage: 37.1+ KB


'Chart Track'

None

In [5]:
user_tracks_in_chart = user_track_history[user_track_history['track_id'].isin(chart_track['track_id'])]
user_track_history = user_track_history[~user_track_history['track_id'].isin(chart_track['track_id'])]

### **Merge into a flat dataframe**


In [6]:
# Merge user information into the user_track_history DataFrame
user_track = user_track_history.merge(user, on='user_id', how='left')

# Rename columns in the track DataFrame to prevent name conflicts when merging
track_renamed = track.rename(columns={
    'popularity': 'track_popularity',
    'genres': 'track_genres',
    'release_date': 'track_release_date',
    'name': 'track_name'
})

# Merge track details into user_track
user_track = user_track.merge(track_renamed, on='track_id', how='left')

# Rename columns in the artist DataFrame to prevent name conflicts when merging
artist_renamed = artist.rename(columns={
    'name': 'artist_name',
    'genres': 'artist_genres',
    'popularity': 'artist_popularity'
})

# Merge artist details into user_track
user_track = user_track.merge(artist_renamed, on='artist_id', how='left')

# Remove 'artist_id' from album DataFrame before merging to avoid duplication
album_cleaned = album.drop(columns=['artist_id'])

# Rename columns in the album DataFrame to prevent name conflicts when merging
album_renamed = album_cleaned.rename(columns={
    'name': 'album_name',
    'genres': 'album_genres',
    'popularity': 'album_popularity',
    'release_date': 'album_release_date'
})

# Merge album details into user_track
user_track = user_track.merge(album_renamed, on='album_id', how='left')

# Display the shape and the first 10 rows of the final merged DataFrame
display(user_track.shape)
display(user_track.head(3))

(2332, 24)

Unnamed: 0,user_id,track_id,played_at,is_top_track,is_recent_play,is_liked,age,gender,location,music_profile,track_name,artist_id,album_id,track_popularity,track_genres,features_vector,track_release_date,artist_name,artist_genres,artist_popularity,album_name,album_release_date,album_popularity,album_genres
0,94aet5tklbier5yaxaofrdl0w,6nN8W5zHOii0P61I8eSdR3,2025-05-28T07:20:47.132908,1.0,0.0,0,21,male,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Free Now,4tuJ0bMpJh08umKkEXKUI5,4XXTsu7r9865VvXdvF2iQP,69,,,2024-06-20,Gracie Abrams,,87,The Secret of Us,2024-06-20,0,
1,94aet5tklbier5yaxaofrdl0w,51rfRCiUSvxXlCSCfIztBy,2025-05-28T07:20:47.270594,1.0,0.0,0,21,male,Colombia,"reggaeton, country, urbano latino, latin pop, ...","I Love You, I'm Sorry",4tuJ0bMpJh08umKkEXKUI5,56bdWeO40o3WfAD2Lja4dl,90,,,2024-06-21,Gracie Abrams,,87,The Secret of Us,2024-06-21,0,
2,94aet5tklbier5yaxaofrdl0w,5wbg8kepMFoMzHOEuxiI0q,2025-05-28T07:20:47.386019,1.0,0.0,0,21,male,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Close To You,4tuJ0bMpJh08umKkEXKUI5,4XXTsu7r9865VvXdvF2iQP,85,,,2024-06-20,Gracie Abrams,,87,The Secret of Us,2024-06-20,0,


In [7]:
# 1. Add 'is_recommended' column: 1 for user tracks, 0 for chart tracks
user_track['is_recommended'] = 1
chart_track['is_recommended'] = 0

# 2. Add the suffix '_chart' to all columns in chart_track except 'is_recommended'
cols_to_rename = {col: f"chart_{col}" for col in chart_track.columns if col != 'is_recommended'}
chart_track = chart_track.rename(columns=cols_to_rename)

# 3. Make sure both DataFrames have the same columns before concatenation
all_cols = sorted(set(user_track.columns).union(set(chart_track.columns)))
user_track = user_track.reindex(columns=all_cols)
chart_track = chart_track.reindex(columns=all_cols)

# 4. Concatenate the DataFrames
combined_tracks = pd.concat([user_track, chart_track], ignore_index=True)

# 5. Show results
display(combined_tracks.shape)
display(combined_tracks.head(3))
display(combined_tracks['is_recommended'].value_counts())

(2805, 35)

Unnamed: 0,age,album_genres,album_id,album_name,album_popularity,album_release_date,artist_genres,artist_id,artist_name,artist_popularity,chart_added_at,chart_album_id,chart_artist_id,chart_chart_id,chart_chart_name,chart_genres,chart_name,chart_popularity,chart_position,chart_track_id,features_vector,gender,is_liked,is_recent_play,is_recommended,is_top_track,location,music_profile,played_at,track_genres,track_id,track_name,track_popularity,track_release_date,user_id
0,21.0,,4XXTsu7r9865VvXdvF2iQP,The Secret of Us,0.0,2024-06-20,,4tuJ0bMpJh08umKkEXKUI5,Gracie Abrams,87.0,,,,,,,,,,,,male,0.0,0.0,1,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",2025-05-28T07:20:47.132908,,6nN8W5zHOii0P61I8eSdR3,Free Now,69.0,2024-06-20,94aet5tklbier5yaxaofrdl0w
1,21.0,,56bdWeO40o3WfAD2Lja4dl,The Secret of Us,0.0,2024-06-21,,4tuJ0bMpJh08umKkEXKUI5,Gracie Abrams,87.0,,,,,,,,,,,,male,0.0,0.0,1,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",2025-05-28T07:20:47.270594,,51rfRCiUSvxXlCSCfIztBy,"I Love You, I'm Sorry",90.0,2024-06-21,94aet5tklbier5yaxaofrdl0w
2,21.0,,4XXTsu7r9865VvXdvF2iQP,The Secret of Us,0.0,2024-06-20,,4tuJ0bMpJh08umKkEXKUI5,Gracie Abrams,87.0,,,,,,,,,,,,male,0.0,0.0,1,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",2025-05-28T07:20:47.386019,,5wbg8kepMFoMzHOEuxiI0q,Close To You,85.0,2024-06-20,94aet5tklbier5yaxaofrdl0w


is_recommended
1    2332
0     473
Name: count, dtype: int64

In [8]:
# For each column, percentage of null values in combined_tracks
for col in combined_tracks.columns:
    null_pct = 100 * combined_tracks[col].isnull().mean()
    print(f"{col}: {null_pct:.2f}% null values")

age: 16.86% null values
album_genres: 100.00% null values
album_id: 16.86% null values
album_name: 16.86% null values
album_popularity: 16.86% null values
album_release_date: 16.86% null values
artist_genres: 60.14% null values
artist_id: 16.86% null values
artist_name: 16.86% null values
artist_popularity: 16.86% null values
chart_added_at: 83.14% null values
chart_album_id: 83.14% null values
chart_artist_id: 83.14% null values
chart_chart_id: 83.14% null values
chart_chart_name: 83.14% null values
chart_genres: 86.20% null values
chart_name: 83.14% null values
chart_popularity: 83.14% null values
chart_position: 83.14% null values
chart_track_id: 83.14% null values
features_vector: 100.00% null values
gender: 16.86% null values
is_liked: 16.86% null values
is_recent_play: 16.86% null values
is_recommended: 0.00% null values
is_top_track: 16.86% null values
location: 16.86% null values
music_profile: 16.86% null values
played_at: 16.86% null values
track_genres: 100.00% null values
t

### **Drop unnecessary columns**


In [9]:
columns_in_df = [
    'age', 'album_genres', 'album_id', 'album_name', 'album_popularity',
    'album_release_date', 'artist_genres', 'artist_id', 'artist_name',
    'artist_popularity', 'chart_added_at', 'chart_album_id',
    'chart_artist_id', 'chart_chart_id', 'chart_chart_name', 'chart_genres',
    'chart_name', 'chart_popularity', 'chart_position', 'chart_track_id',
    'features_vector', 'gender', 'is_liked', 'is_recent_play',
    'is_recommended', 'is_top_track', 'location', 'music_profile',
    'played_at', 'track_genres', 'track_id', 'track_name',
    'track_popularity', 'track_release_date', 'user_id'
]

# Remove columns
cols_to_remove = [
    'album_id', 'artist_id', 'track_id',
    'album_genres', 'track_genres', 'chart_album_id', 'chart_artist_id',
    'chart_track_id', 'chart_chart_id', 'user_id'
]
# Filter to only those that exist
cols_to_remove_filtered = [col for col in cols_to_remove if col in columns_in_df]

# Drop unnecessary columns (ignore errors to avoid KeyErrors if some columns are missing)
combined_tracks_processed = combined_tracks.drop(columns=cols_to_remove_filtered, errors='ignore')

# Rename 'chart_name' to 'chart_track_name' for clarity
combined_tracks_processed = combined_tracks_processed.rename(columns={'chart_name': 'chart_track_name'})

### **Parse datetime features and create new ones**


In [10]:
# Get the current date (naive, without timezone)
ref_date = datetime.now()
ref_date = ref_date.replace(tzinfo=None)

# Ensure 'chart_added_at' is in datetime format and remove timezone info
combined_tracks_processed['chart_added_at'] = pd.to_datetime(
    combined_tracks_processed['chart_added_at'], errors='coerce'
).dt.tz_localize(None)

# Convert 'album_release_date' to datetime and create 'album_age_days' feature
# This variable indicates the number of days since the album release up to today
combined_tracks_processed['album_release_date'] = pd.to_datetime(
    combined_tracks_processed['album_release_date'], errors='coerce'
)
combined_tracks_processed['album_age_days'] = (
    ref_date - combined_tracks_processed['album_release_date']
).dt.days

# Create 'chart_age_days': days since the track was added to the chart
combined_tracks_processed['chart_age_days'] = (
    ref_date - combined_tracks_processed['chart_added_at']
).dt.days

# Convert 'track_release_date' to datetime and create 'track_age_days'
# This shows days since the song was released
combined_tracks_processed['track_release_date'] = pd.to_datetime(
    combined_tracks_processed['track_release_date'], errors='coerce'
)
combined_tracks_processed['track_age_days'] = (
    ref_date - combined_tracks_processed['track_release_date']
).dt.days

# Convert 'played_at' to datetime and extract useful features: day of week and hour
combined_tracks_processed['played_at'] = pd.to_datetime(
    combined_tracks_processed['played_at'], errors='coerce'
)
combined_tracks_processed['played_day_of_week'] = combined_tracks_processed['played_at'].dt.dayofweek
combined_tracks_processed['played_hour'] = combined_tracks_processed['played_at'].dt.hour

# Drop the original date columns after generating new features
combined_tracks_processed = combined_tracks_processed.drop(
    columns=['album_release_date', 'chart_added_at', 'track_release_date', 'played_at']
)

In [11]:
# Reorder columns to have 'is_recommended' as the last column
cols = list(combined_tracks_processed.columns)
if 'is_recommended' in cols:
    cols.remove('is_recommended')
    cols.append('is_recommended')  # Place 'is_recommended' at the end

combined_tracks_processed = combined_tracks_processed[cols]

# Display the first 10 rows and the final column order
display(combined_tracks_processed.head(10))
display("Final columns:", combined_tracks_processed.columns)

Unnamed: 0,age,album_name,album_popularity,artist_genres,artist_name,artist_popularity,chart_chart_name,chart_genres,chart_track_name,chart_popularity,chart_position,features_vector,gender,is_liked,is_recent_play,is_top_track,location,music_profile,track_name,track_popularity,album_age_days,chart_age_days,track_age_days,played_day_of_week,played_hour,is_recommended
0,21.0,The Secret of Us,0.0,,Gracie Abrams,87.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Free Now,69.0,347.0,,347.0,2.0,7.0,1
1,21.0,The Secret of Us,0.0,,Gracie Abrams,87.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...","I Love You, I'm Sorry",90.0,346.0,,346.0,2.0,7.0,1
2,21.0,The Secret of Us,0.0,,Gracie Abrams,87.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Close To You,85.0,347.0,,347.0,2.0,7.0,1
3,21.0,The Secret of Us,0.0,,Gracie Abrams,87.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Risk,80.0,347.0,,347.0,2.0,7.0,1
4,21.0,THE TORTURED POETS DEPARTMENT,0.0,,Taylor Swift,98.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",My Boy Only Breaks His Favorite Toys,72.0,410.0,,410.0,2.0,7.0,1
5,21.0,Submarine,0.0,bedroom pop,The Marías,86.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",No One Noticed,93.0,367.0,,367.0,2.0,7.0,1
6,21.0,The Secret of Us (Deluxe),0.0,,Gracie Abrams,87.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",That’s So True,97.0,227.0,,227.0,2.0,7.0,1
7,21.0,Submarine,0.0,bedroom pop,The Marías,86.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",Ride,61.0,367.0,,367.0,2.0,7.0,1
8,21.0,Kid Krow,0.0,,Conan Gray,80.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",The Cut That Always Bleeds,87.0,1900.0,,1900.0,2.0,7.0,1
9,21.0,Red (Taylor's Version),0.0,,Taylor Swift,98.0,,,,,,,male,0.0,0.0,1.0,Colombia,"reggaeton, country, urbano latino, latin pop, ...",All Too Well (10 Minute Version) (Taylor's Ver...,82.0,1298.0,,1298.0,2.0,7.0,1


'Final columns:'

Index(['age', 'album_name', 'album_popularity', 'artist_genres', 'artist_name',
       'artist_popularity', 'chart_chart_name', 'chart_genres',
       'chart_track_name', 'chart_popularity', 'chart_position',
       'features_vector', 'gender', 'is_liked', 'is_recent_play',
       'is_top_track', 'location', 'music_profile', 'track_name',
       'track_popularity', 'album_age_days', 'chart_age_days',
       'track_age_days', 'played_day_of_week', 'played_hour',
       'is_recommended'],
      dtype='object')

In [12]:
print("=== Quick Data Quality Overview ===\n")

df = combined_tracks_processed  # Adjust if your DataFrame is named differently

for col in df.columns:
    print(f"\033[1m{col}\033[0m")  # Bold column name for readability

    # Nulls
    pct_nulls = df[col].isnull().mean() * 100
    print(f"  Null values: {pct_nulls:.2f}%")
    
    # Duplicates (for likely identifier/text columns)
    if df[col].dtype == 'object' or 'name' in col or 'genres' in col:
        n_dups = df[col].duplicated().sum()
        print(f"  Duplicates: {n_dups}")
    
    # Outliers / Non-binary checks
    if col in ['age']:
        print(f"  Out-of-range (age <10 or >90): {((df[col]<10)|(df[col]>90)).sum()}")
    if col.endswith('popularity') or col == 'chart_position':
        print(f"  Out-of-range (<0 or >100): {((df[col]<0)|(df[col]>100)).sum()}")
    if col in ['is_liked', 'is_recent_play', 'is_top_track', 'is_recommended']:
        print(f"  Non-binary (not 0 or 1): {(~df[col].isin([0,1,True,False])).sum()}")
    if col in ['album_release_date', 'track_release_date', 'chart_added_at', 'played_at']:
        invalid_dates = df[col].isnull().sum()
        print(f"  Invalid dates (unparsable): {invalid_dates}")
    if col == 'features_vector':
        malformed = df[col].apply(lambda x: isinstance(x, str) and not x.startswith('[') if pd.notnull(x) else False).sum()
        print(f"  Malformed vectors: {malformed}")
    if col == 'gender':
        print(f"  Unexpected gender values: {df[~df[col].isin(['male', 'female', 'other', None, np.nan])][col].unique()}")
    print("-" * 40)

=== Quick Data Quality Overview ===

[1mage[0m
  Null values: 16.86%
  Out-of-range (age <10 or >90): 0
----------------------------------------
[1malbum_name[0m
  Null values: 16.86%
  Duplicates: 1478
----------------------------------------
[1malbum_popularity[0m
  Null values: 16.86%
  Out-of-range (<0 or >100): 0
----------------------------------------
[1martist_genres[0m
  Null values: 60.14%
  Duplicates: 2633
----------------------------------------
[1martist_name[0m
  Null values: 16.86%
  Duplicates: 2196
----------------------------------------
[1martist_popularity[0m
  Null values: 16.86%
  Out-of-range (<0 or >100): 0
----------------------------------------
[1mchart_chart_name[0m
  Null values: 83.14%
  Duplicates: 2801
----------------------------------------
[1mchart_genres[0m
  Null values: 86.20%
  Duplicates: 2739
----------------------------------------
[1mchart_track_name[0m
  Null values: 83.14%
  Duplicates: 2364
-------------------------------

#### **Dataset Variables and first detected anomalies**


### **Data Quality Overview: Key Anomalies Detected**

- **Null values:** Most columns have around 17% nulls; chart-related columns have >80% nulls; `features_vector` is entirely missing.
- **Duplicates:** High number of duplicates in textual fields (e.g., artist, album, chart, location).
- **Invalid dates:** Several date columns (e.g., `album_release_date`, `chart_added_at`, `played_at`, `track_release_date`) have both nulls and unparsable entries.
- **Out-of-range values:** Some chart positions are outside the 0–100 expected range; all popularity and age columns are within valid ranges.
- **Non-binary values:** Binary indicator columns (`is_liked`, `is_recent_play`, `is_top_track`) have some unexpected values (not 0 or 1).
- **Malformed/missing vectors:** `features_vector` is missing in all rows.
- **Unexpected categories:** No unexpected values detected for gender.

> **Conclusion:**  
> The dataset has substantial missing data, many duplicates in categorical/text fields, and several columns with potential format issues. Careful data cleaning, imputation, and validation are required before analysis or modeling.


### **Feature Descriptions**

- **age:**  
  User’s age (in years).

- **album_name:**  
  Name of the album the track belongs to.

- **album_popularity:**  
  Popularity score of the album, as provided by Spotify (range: 0–100).

- **album_release_date:**  
  Release date of the album.

- **artist_genres:**  
  List of musical genres associated with the main artist.

- **artist_name:**  
  Name of the main artist performing the track.

- **artist_popularity:**  
  Popularity score of the artist, as provided by Spotify (range: 0–100).

- **chart_added_at:**  
  Date the track was added to a music chart.

- **chart_chart_name:**  
  Name of the chart (e.g., "Top 50 Global") where the track appeared.

- **chart_genres:**  
  Genres commonly represented in the chart.

- **chart_track_name:**  
  Name of the track as listed in the chart.

- **chart_popularity:**  
  Popularity score of the track within the chart context (range: 0–100).

- **chart_position:**  
  Position or rank of the track on the chart (lower = higher rank).

- **features_vector:**  
  Numerical vector of audio features (e.g., danceability, energy, tempo).

- **gender:**  
  User’s gender identity (e.g., Male, Female, Other).

- **is_liked:**  
  Whether the user has liked/saved the track (1 = yes, 0 = no).

- **is_recent_play:**  
  Indicates if the track was played recently by the user (1 = yes, 0 = no).

- **is_top_track:**  
  Indicates if the track is among the user's top tracks (1 = yes, 0 = no).

- **location:**  
  User’s location (could be city, region, or country).

- **music_profile:**  
  User’s self-reported music preferences or listening profile.

- **played_at:**  
  Timestamp of when the user played the track.

- **track_name:**  
  Name of the track.

- **track_popularity:**  
  Popularity score of the track (range: 0–100).

- **track_release_date:**  
  Release date of the track.

- **is_recommended:**  
  Label indicating if the track was recommended by the system (1 = recommended, 0 = not recommended; may be used as target for training).


### **3. Data Information**


- **How should this problem be framed?**

  This is a supervised learning (classification) problem: we have a labeled dataset where the target variable (is_recommended) indicates whether a track was recommended or not for a given user. The model will be trained offline using historical user interaction data and then deployed to make real-time recommendations.

- **How should the solution’s performance be measured, as an initial intuition?**

  Performance will be evaluated using metrics common in recommender systems and classification tasks, such as accuracy, precision, recall, f1-score, and ROC-AUC. In recommendation scenarios, precision (fraction of recommended tracks that the user actually liked or interacted with) and recall (fraction of all relevant tracks that were recommended) are especially important. Additional metrics like Hit Rate@K or Mean Average Precision (MAP) can also be considered for ranked recommendations.

- **Is the performance metric aligned with the problem’s objective?**

  Yes, the chosen metrics directly assess how well the system recommends tracks that users are likely to enjoy, aligning with the project’s goal of delivering relevant and personalized music recommendations.

- **What is the minimum performance required to achieve the problem’s objective?**

  While perfect performance is unrealistic, a model with precision and recall above 70% would already provide clear value for music discovery, provided that its recommendations are validated on real or realistic user data. In deployed scenarios, a positive user response (e.g., skip rate decrease, increased listen duration) is a strong indicator of success.

- **What are similar problems? Can existing experiences or tools be reused?**

  Similar problems include movie or product recommendation systems (e.g., Netflix, Amazon) and personalized news feeds. Existing collaborative filtering, content-based filtering, and hybrid recommender techniques—many of which are implemented in libraries like Surprise, LightFM, or implicit—can be leveraged and adapted for this context.

- **Is there existing experience with this problem?**

  Yes, recommendation systems are a mature research area with extensive literature and open-source implementations. Many approaches from academic and industry applications (e.g., Spotify, YouTube) can inform the solution design for AlgoRythms.

- **How can the problem be solved manually?**

  A human curator could analyze a user’s listening history, favorite genres/artists, and behavioral patterns to select new tracks or albums they are likely to enjoy, possibly referencing popular charts or similar users’ preferences. However, this process is subjective, time-consuming, and does not scale—hence the need for automation via machine learning.


In [13]:
#Which columns are going to be used in the project?
combined_tracks_processed.columns

Index(['age', 'album_name', 'album_popularity', 'artist_genres', 'artist_name',
       'artist_popularity', 'chart_chart_name', 'chart_genres',
       'chart_track_name', 'chart_popularity', 'chart_position',
       'features_vector', 'gender', 'is_liked', 'is_recent_play',
       'is_top_track', 'location', 'music_profile', 'track_name',
       'track_popularity', 'album_age_days', 'chart_age_days',
       'track_age_days', 'played_day_of_week', 'played_hour',
       'is_recommended'],
      dtype='object')

**Current Assumptions**

- The available data (user interaction logs, Spotify metadata, and synthetic user profiles) is representative of the target user population.
- The selected features (e.g., age, genres, artist/album/track popularity, recent plays, and user profile information) are informative and relevant for predicting music preferences.
- here are no major biases in data collection or simulation that would affect model fairness or performance (e.g., no genre, artist, or demographic is systematically over- or under-represented).
- The trained recommendation model will generalize well to new users and unseen tracks, provided their behavior and attributes are similar to those in the training data.


### **4. Data Download**


In [14]:
algorhythm_df = combined_tracks_processed.copy()

# Set the folder as a Path object (your absolute path)
DATA_DIR = Path("../../data/01_raw")

# Build the file path
file_path = DATA_DIR / "algorhythm_final.csv"

# Create the folder if it does not exist
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Save the DataFrame as CSV
algorhythm_df.to_csv(file_path, index=False)