In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('/data.csv')
df.head()

Unnamed: 0,name,artists,id,popularity,release_date,year,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,valence
0,Keep A Song In Your Soul,['Mamie Smith'],0cS0A1fUEUd1EW3FcF8AEI,12,1920,1920,0.991,0.598,168333,0.224,0,0.000522,5,0.379,-12.628,0,0.0936,149.976,0.634
1,I Put A Spell On You,"[""Screamin' Jay Hawkins""]",0hbkKFIJm7Z05H8Zl9w30f,7,5/1/1920,1920,0.643,0.852,150200,0.517,0,0.0264,5,0.0809,-7.261,0,0.0534,86.889,0.95
2,Golfing Papa,['Mamie Smith'],11m7laMUgmOKqI3oYzuhne,4,1920,1920,0.993,0.647,163827,0.186,0,1.8e-05,0,0.519,-12.098,1,0.174,97.6,0.689
3,True House Music - Xavier Santos & Carlos Gomi...,['Oscar Velazquez'],19Lc5SfJJ5O1oaxY0fpwfh,17,1/1/1920,1920,0.000173,0.73,422087,0.798,0,0.801,2,0.128,-7.311,1,0.0425,127.997,0.0422
4,Xuniverxe,['Mixe'],2hJjbsLCytGsnAHfdsLejp,2,1/10/1920,1920,0.295,0.704,165224,0.707,1,0.000246,10,0.402,-6.036,0,0.0768,122.076,0.299


In [2]:
df.dtypes

name                 object
artists              object
id                   object
popularity            int64
release_date         object
year                  int64
acousticness        float64
danceability        float64
duration_ms           int64
energy              float64
explicit              int64
instrumentalness    float64
key                   int64
liveness            float64
loudness            float64
mode                  int64
speechiness         float64
tempo               float64
valence             float64
dtype: object

Data Types Description:

1.Name: Title of the track or song.

2.Artists: The name of the artist or artists performing the track.
ID: The Spotify-specific identifier for the track.

3.Popularity: A score between 0 and 100 that quantifies the track's popularity, based on the total number of plays and how recent those plays are. Tracks played more frequently in the recent past score higher.

4.Release Date: The date when the track was officially released.
Year: The year in which the track was released.

5.Acousticness: A confidence measure between 0.0 and 1.0 indicating the probability of the track being acoustic, with 1.0 being highly confident.

6.Danceability: A measure from 0.0 to 1.0 indicating how suitable a track is for dancing, based on a combination of musical elements such as tempo, rhythm stability, beat strength, and overall regularity. Higher values indicate higher suitability for dancing.

7.Duration_ms: The duration of the track measured in milliseconds.

8.Energy: A measure from 0.0 to 1.0 that represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. High energy values are characteristic of music that is more intense.

9.Explicit: Indicates whether the track contains explicit lyrics (1 = yes; 0 = no or unknown).

10.Instrumentalness: A measure from 0.0 to 1.0 that predicts whether a track contains no vocal content. Higher values closer to 1.0 indicate higher likelihood of the track being instrumental.

11.Key: The key the track is in, using standard Pitch Class notation where integers map to pitches (e.g., 0 = C, 1 = C♯/D♭, etc.). A value of -1 indicates that no key was detected.

12.Liveness: Detects the presence of an audience in the recording. Higher values indicate a higher probability that the track was performed live.

13.Loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are used for comparing relative loudness of tracks. Typical values range from -60 to 0 dB.

14.Mode: Indicates the modality (major or minor) of a track, derived from the type of scale from which its melodic content is constructed. Major is denoted by 1 and minor by 0.

15.Speechiness: Measures the presence of spoken words in a track with a scale from 0.0 to 1.0. Values above 0.66 typically indicate tracks that are probably made entirely of spoken words; values between 0.33 and 0.66 indicate tracks that may contain both music and speech; values below 0.33 most likely represent music and other non-speech-like tracks.

16.Tempo: The overall estimated tempo of the track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece.

17.Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g., happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g., sad, depressed, angry).

References: https://developer.spotify.com/documentation/web-api

Spotify API Documentation
            

Data Prep and Cleaning

In [4]:
def handle_dates(date_str):
    try:

        return pd.to_datetime(date_str, format='%Y-%m-%d', errors='raise')
    except ValueError:
        try:

            return pd.to_datetime(str(date_str) + '-01-01', format='%Y-%m-%d', errors='raise')
        except ValueError:

            return pd.NaT


cleaned_df = df.copy()
cleaned_df = cleaned_df.dropna()
cleaned_df = cleaned_df.drop(columns=['id', 'year'])


cleaned_df['release_date'] = cleaned_df['release_date'].apply(handle_dates)


cleaned_df = cleaned_df[cleaned_df['popularity'] != 0]


print(cleaned_df.head())

                                                name  \
0                           Keep A Song In Your Soul   
1                               I Put A Spell On You   
2                                       Golfing Papa   
3  True House Music - Xavier Santos & Carlos Gomi...   
4                                          Xuniverxe   

                     artists  popularity release_date  acousticness  \
0            ['Mamie Smith']          12   1920-01-01      0.991000   
1  ["Screamin' Jay Hawkins"]           7          NaT      0.643000   
2            ['Mamie Smith']           4   1920-01-01      0.993000   
3        ['Oscar Velazquez']          17          NaT      0.000173   
4                   ['Mixe']           2          NaT      0.295000   

   danceability  duration_ms  energy  explicit  instrumentalness  key  \
0         0.598       168333   0.224         0          0.000522    5   
1         0.852       150200   0.517         0          0.026400    5   
2         0.647  

In [5]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32393 entries, 0 to 39050
Data columns (total 17 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   name              32393 non-null  object        
 1   artists           32393 non-null  object        
 2   popularity        32393 non-null  int64         
 3   release_date      10269 non-null  datetime64[ns]
 4   acousticness      32393 non-null  float64       
 5   danceability      32393 non-null  float64       
 6   duration_ms       32393 non-null  int64         
 7   energy            32393 non-null  float64       
 8   explicit          32393 non-null  int64         
 9   instrumentalness  32393 non-null  float64       
 10  key               32393 non-null  int64         
 11  liveness          32393 non-null  float64       
 12  loudness          32393 non-null  float64       
 13  mode              32393 non-null  int64         
 14  speechiness       32393 non