# Next Artist Up Predictor (Spotify)
The goal of this model is to use key elemenents/components from song data to identify which attributes become the most popular and from this figure out...
1. the next songs to become popular
2. artists that fit this profile through their music

### Imports
Create a new `.venv` or `conda` environment (Python 3.12) for this project and run `pip install requirements` in the main directory.

When finished, select it as your kernel if using VS Code.

In [1]:
# General Math
import numpy as np
import pandas as pd

# Dataset
import kagglehub
from kagglehub import KaggleDatasetAdapter

# Visualization
import seaborn as sbn
sbn.set_style("whitegrid")
from matplotlib import pyplot as plt


  from .autonotebook import tqdm as notebook_tqdm


### Dataset Loading

In [2]:
# Downloading the latest version of the dataset
path = kagglehub.dataset_download("maharshipandya/-spotify-tracks-dataset")

print("Path to dataset files:", path)


Downloading from https://www.kaggle.com/api/v1/datasets/download/maharshipandya/-spotify-tracks-dataset?dataset_version_number=1...


100%|██████████| 8.17M/8.17M [00:00<00:00, 14.3MB/s]

Extracting files...





Path to dataset files: /Users/carsonmcneill/.cache/kagglehub/datasets/maharshipandya/-spotify-tracks-dataset/versions/1


In [3]:
df = pd.read_csv(path + '/dataset.csv')
df = df.drop(df.columns[0], axis=1)
df.head()

Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


In [4]:
df.describe()

Unnamed: 0,popularity,duration_ms,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
count,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0,114000.0
mean,33.238535,228029.2,0.5668,0.641383,5.30914,-8.25896,0.637553,0.084652,0.31491,0.15605,0.213553,0.474068,122.147837,3.904035
std,22.305078,107297.7,0.173542,0.251529,3.559987,5.029337,0.480709,0.105732,0.332523,0.309555,0.190378,0.259261,29.978197,0.432621
min,0.0,0.0,0.0,0.0,0.0,-49.531,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,17.0,174066.0,0.456,0.472,2.0,-10.013,0.0,0.0359,0.0169,0.0,0.098,0.26,99.21875,4.0
50%,35.0,212906.0,0.58,0.685,5.0,-7.004,1.0,0.0489,0.169,4.2e-05,0.132,0.464,122.017,4.0
75%,50.0,261506.0,0.695,0.854,8.0,-5.003,1.0,0.0845,0.598,0.049,0.273,0.683,140.071,4.0
max,100.0,5237295.0,0.985,1.0,11.0,4.532,1.0,0.965,0.996,1.0,1.0,0.995,243.372,5.0


In [5]:
print(df.keys())

Index(['track_id', 'artists', 'album_name', 'track_name', 'popularity',
       'duration_ms', 'explicit', 'danceability', 'energy', 'key', 'loudness',
       'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'track_genre'],
      dtype='object')


## Data Cleaning

### Null Checking

To make sure our models run correctly, we need to make sure we don't include any "null" values in our dataset.

In [None]:
# Check original number of nulls
null_amount = pd.isnull(df).sum()
# print("ORIGINAL: \n", null_amount)

# Remove nulls
df = df.dropna(subset=df.keys())

# Check that no nulls remain
null_amount = pd.isnull(df).sum()
print("FINAL: \n", null_amount)

ORIGINAL: 
 track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64
FINAL: 
 track_id            0
artists             0
album_name          0
track_name          0
popularity          0
duration_ms         0
explicit            0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
time_signature      0
track_genre         0
dtype: int64


We can see that our dataset is complete and not missing any values. 

### Data Normalization

Duration_ms ranges from 0 to 5.24+e06 ms. To prevent this from distoring our data, we will normalize duration.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Features with continuous data
numeric_features = [
    'duration_ms', 'danceability', 'energy', 'key', 'loudness',
    'speechiness', 'acousticness', 'instrumentalness', 'liveness',
    'valence', 'tempo', 'time_signature'
]

# Features with 0 and 1 as values
binary_features = ['explicit', 'mode']


# Features with set of categories
categorical_features = ['track_genre']

# Process numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')), # This shouldn't matter because data is complete
    ('scaler', StandardScaler()) # rescales data so mean: 0, sd: 1
])

# Process binary features
binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')) # This shouldn't matter because data is complete
])

# Process categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='Unknown')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('bin', binary_transformer, binary_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Space for ML Algorithms
## Supervised
## Unsupervised