<a href="https://colab.research.google.com/github/DAWEENOT/data_science_bootcamp_8/blob/main/Spotify_songs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spotify Dataset from Kaggle

### About dataset
Almost 30,000 Songs from the Spotify API. See the readme file for a formatted data dictionary table.

***
### **Data Dictionary:**
- variable class description
- **track_id:** character Song unique ID
- **track_name:** character Song Name
- **track_artist:** character Song Artist
- **track_popularity:** double Song Popularity (0-100) where higher is better
- **track_album_id:** character Album unique ID
- **track_album_name:** character Song album name
- **track_album_release_date:** character Date when album released
- **playlist_name:** character Name of playlist
- **playlist_id:** character Playlist ID
- **playlist_genre:** character Playlist genre
- **playlist_subgenre:** character Playlist subgenre
- **danceability:** double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
- **energy:** double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
- **key:** double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
- **loudness:** double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
- **mode:** double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
- **speechiness:** double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
- **acousticness:** double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
- **instrumentalness:** double Predicts whether a track contains no vocals. "Ooh" and "aah" sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly "vocal". The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
- **liveness:** double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
- **valence:** double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
- **tempo:** double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
- **duration_ms:** double Duration of song in milliseconds

In [1]:
## imprort pandas, numpy, ml
import pandas as pd
import numpy as np

## ML
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import train_test_split

In [None]:
## import dataset
sp_songs = pd.read_csv("spotify_songs.csv")

sp_songs

In [None]:
## info()
sp_songs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4319 entries, 0 to 4318
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_id                  4319 non-null   object 
 1   track_name                4319 non-null   object 
 2   track_artist              4319 non-null   object 
 3   track_popularity          4319 non-null   int64  
 4   track_album_id            4319 non-null   object 
 5   track_album_name          4319 non-null   object 
 6   track_album_release_date  4319 non-null   object 
 7   playlist_name             4319 non-null   object 
 8   playlist_id               4319 non-null   object 
 9   playlist_genre            4318 non-null   object 
 10  playlist_subgenre         4318 non-null   object 
 11  danceability              4318 non-null   float64
 12  energy                    4318 non-null   float64
 13  key                       4318 non-null   float64
 14  loudness

In [4]:
## checking NA
sp_songs.isna().sum()

track_id                    0
track_name                  5
track_artist                5
track_popularity            0
track_album_id              0
track_album_name            5
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64

In [5]:
## Fill missing values
sp_songs = sp_songs.fillna(value = 'NA')


In [6]:
## Recheck NA
print(sp_songs.isna().sum())

track_id                    0
track_name                  0
track_artist                0
track_popularity            0
track_album_id              0
track_album_name            0
track_album_release_date    0
playlist_name               0
playlist_id                 0
playlist_genre              0
playlist_subgenre           0
danceability                0
energy                      0
key                         0
loudness                    0
mode                        0
speechiness                 0
acousticness                0
instrumentalness            0
liveness                    0
valence                     0
tempo                       0
duration_ms                 0
dtype: int64


In [None]:
## select column[Test])
sp_songs[['track_id', 'track_name', 'track_popularity']]\
    .drop_duplicates()\
    .sort_values('track_popularity', ascending = False)\
    .head(10)

In [None]:
## Question 1: How many genres are in the dataset?
sp_songs.groupby(['playlist_genre'])['playlist_id']\
    .count()\
    .drop_duplicates()\
    .reset_index()\
    .sort_values('playlist_id', ascending = False)


Unnamed: 0,playlist_genre,playlist_id
0,edm,6043
4,rap,5746
2,pop,5507
3,r&b,5431
1,latin,5155
5,rock,4951


In [73]:
## set datetime
sp_songs['track_album_release_date'] = pd.to_datetime(sp_songs['track_album_release_date'])

In [59]:
## Question 2: What is the top 10 of track popularity on 2016
sp_2016 = sp_songs[sp_songs['track_album_release_date'].dt.year == 2016]

sp_2016[['track_artist',\
         'track_name',\
         'track_album_release_date',\
         'track_popularity']]\
         .drop_duplicates()\
         .sort_values('track_popularity', ascending = False)\
         .reset_index()\
         .head(10)

Unnamed: 0,index,track_artist,track_name,track_album_release_date,track_popularity


In [None]:
sp_songs.duration_ms

0        194754
1        162600
2        176616
3        169093
4        189052
          ...  
32828    204375
32829    353120
32830    210112
32831    367432
32832    337500
Name: duration_ms, Length: 32833, dtype: int64

In [None]:
194754 / 60000

3.2459

In [9]:
## Add a new column change milliseconds to minutes
sp_songs['duration_min'] = sp_songs['duration_ms'] / 60000

sp_songs[['duration_ms', 'duration_min']]

Unnamed: 0,duration_ms,duration_min
0,194754,3.245900
1,162600,2.710000
2,176616,2.943600
3,169093,2.818217
4,189052,3.150867
...,...,...
32828,204375,3.406250
32829,353120,5.885333
32830,210112,3.501867
32831,367432,6.123867


In [None]:
## Question 3: Which tracks has the longest duration in minutes?

sp_songs[['track_name',\
          'duration_min']]\
          .drop_duplicates()\
          .sort_values('duration_min', ascending = False)\
          .reset_index()\
          .head(5)

Unnamed: 0,index,track_name,duration_min
0,21327,47 - Remix,8.630167
1,11770,Kashmir - 2012 Remaster,8.61875
2,12379,American Pie,8.614883
3,20643,Jam On It (Re-Recorded Version),8.612667
4,11919,Roundabout - 2008 Remaster,8.599333


In [17]:
## Question 4: How many danceable song ?
    ## create a new column if 'danceability >= 0.5 == 1, else == 0'
sp_songs['danceability_logi'] = sp_songs['danceability']\
                                    .apply(lambda x: 1 if x >= 0.5 else 0)

## count 1
dance_logi_1 = (sp_songs['danceability_logi'] == 1)\
                    .sum()
print("Count of danceability equal to 1:", dance_logi_1)

## count 0
dance_logi_0 = (sp_songs['danceability_logi'] == 0)\
    .sum()
print("Count of danceability equal to 0:", dance_logi_0)



Count of danceability equal to 1: 27943
Count of danceability equal to 0: 4890


In [74]:
sp_songs['track_album_release_date'] = sp_songs['track_album_release_date'].apply(lambda x: x.timestamp())

In [118]:
## ML

## prepare data
x = sp_songs.drop(['track_id',\
                   'track_name',\
                   'track_artist',\
                   'track_popularity',\
                   'track_album_id',\
                   'track_album_name',\
                   'playlist_id',\
                   'playlist_name',\
                   'playlist_genre',\
                   'playlist_subgenre',\
                   'danceability_logi',\
                   'duration_min'], axis = 1)
y = sp_songs['tempo']


## split data
x_train, x_test, y_train, y_test =  train_test_split(x, y, test_size = 0.25, random_state = 42)


## train data
model = BaggingRegressor()
model.fit(x_train, y_train)


## prediction
p = model.predict(x_test)

## scoring
scoring = model.score(x_test, y_test)
print("Scoring:", scoring)

## MAE
mae = np.mean((y_test - p))
print("Mean Absolute Error:", mae )

## MSE
mse = np.mean((y_test - p)** 2)
print("Mean Square Error:", mse)

## RMSE
rmse = np.sqrt(mse)
print("Root Mean Square Error:", rmse)

Scoring: 0.9996990270783161
Mean Absolute Error: -0.005966354001705181
Mean Square Error: 0.2167072123206237
Root Mean Square Error: 0.4655182191070761
