# EECS 3401 Final Project

# Author: Jamie Fletcher, Arshvir Singh, Kwonmin Bok

**Original Dataset Source: Amitansh Joshi, Amit Parolkar, &amp; Vedant Das. (2023). <i>Spotify_1Million_Tracks</i> [Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/5987852**

**Modified Dataset: 2023 Spotify song data. https://raw.githubusercontent.com/MrP29/EECS3401_Project/main/data/spotify_data_2023.csv**

# 2023 Spotify Song Dataset Description

**Attributes for spotify_data_2023.csv dataset:**  
The below attributes are copied **AS IS** from the original dataset.

1. artist_name - Name of artist

2. track_name - Name of track

3. track_id - Unique id code of track

4. popularity - Track popularity (0 to 100)

5. year - Year released (2023)

6. genre - Genre of the song

7. danceability - Track suitability for dancing (0.0 to 1.0)

8. energy - The perceptual measure of intensity and activity (0.0 to 1.0)

9. key - The key, the track is in (-1 to -11)

10. loudness - Overall loudness of track in decibels (-60 to 0 dB)

11. mode - Modality of the track (Major '1 / Minor '0')

12. speechiness - Presence of spoken words in the track

13. acousticness - Confidence measure from 0 to 1 of whether the track is acoustic

14. instrumentalness - Whether tracks contain vocals (0.0 to 1.0)

15. liveness - Presence of audience in the recording (0.0 to 1.0)

16. valence - Musical positiveness (0.0 to 1.0)

17. tempo - Tempo of the track in beats per minute (BPM)

18. duration_ms - Duration of track in milliseconds

19. time_signature - Estimated time signature (3 to 7)

# 1 - Look at the big picture and frame the problem.

### Frame the problem
1. Supervised learning
2. A regression task
3. Batch learning

### Look at the big picture
Predictions will be used to help music producers to know which type of songs are trendy and popular these days.

In [None]:
# Import libraries

import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 2 - Load the dataset
Open the dataset using Pandas and load it into a DataFrame, which is the object Pandas uses to store tables of data.
Pandas uses two objects for storing datasets: the DataFrame and the Series. 
Series is used for datasets with only one column, and DataFrame is used for datasets of more than one column.

In [7]:
# Load the dataset

url = "https://raw.githubusercontent.com/MrP29/EECS3401_Project/main/data/spotify_data_2023.csv"
songs = pd.read_csv(url, sep=',')

# Let's create a backup copy of the dataset
songs_backup = songs

# 2.1 - Take a quick look at the data structure

In [8]:
songs

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,584002,Jason Mraz,I Feel Like Dancing,4xYlpJPENfM4DT0rUYFqSP,67,2023,acoustic,0.812,0.706,4,-6.054,0,0.0709,0.04120,0.000000,0.107,0.621,112.002,218702,4
1,584003,Drew Holcomb & The Neighbors,Find Your People,6GuyIXoGIaTw1Pg6Ug9enJ,54,2023,acoustic,0.678,0.526,5,-7.292,1,0.0281,0.32900,0.000000,0.302,0.492,87.005,194093,4
2,584004,Wilder Woods,Get It Back,29mhNauP6A7LSLqiMOWNlv,50,2023,acoustic,0.588,0.721,8,-5.691,1,0.0409,0.24600,0.000434,0.114,0.667,161.958,204787,4
3,584005,Wilder Woods,Maestro (Tears Don't Lie),6N8hCmutQjQ3zZevRbJk36,49,2023,acoustic,0.604,0.937,7,-5.498,0,0.0452,0.00156,0.000122,0.266,0.726,120.010,202040,4
4,584006,Ben Rector,Range Rover (A Capella),1X9XILnuFHH4G7mkXSNPsn,49,2023,acoustic,0.482,0.488,2,-8.144,1,0.1040,0.85700,0.000000,0.719,0.261,140.658,182826,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38645,622647,Groove Armada,Rescue Me - Roland Leesker's You Got To Dance ...,3ZZ7FgBdEv5Eq537nTvdYH,1,2023,trip-hop,0.769,0.604,9,-13.930,1,0.0634,0.00582,0.246000,0.114,0.330,122.999,423418,4
38646,622648,Brazilian Girls,Good Time - Radio Edit,2BRpsndU96kgMjlfKjyY8q,0,2023,trip-hop,0.803,0.847,10,-5.808,0,0.0765,0.08750,0.097600,0.289,0.573,129.987,180413,4
38647,622649,Brazilian Girls,Good Time,4da5NJr6Pm72tdEzzbK5Us,1,2023,trip-hop,0.794,0.839,10,-6.373,0,0.0526,0.04980,0.106000,0.137,0.767,129.980,228600,4
38648,622650,The Future Sound Of London,Dead Skin Cells,4yKyp7I1iyYTup8xs2JVgp,0,2023,trip-hop,0.308,0.468,0,-17.676,0,0.0573,0.83600,0.684000,0.161,0.114,118.376,410529,3


In [9]:
songs.head()

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,track_id,popularity,year,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,584002,Jason Mraz,I Feel Like Dancing,4xYlpJPENfM4DT0rUYFqSP,67,2023,acoustic,0.812,0.706,4,-6.054,0,0.0709,0.0412,0.0,0.107,0.621,112.002,218702,4
1,584003,Drew Holcomb & The Neighbors,Find Your People,6GuyIXoGIaTw1Pg6Ug9enJ,54,2023,acoustic,0.678,0.526,5,-7.292,1,0.0281,0.329,0.0,0.302,0.492,87.005,194093,4
2,584004,Wilder Woods,Get It Back,29mhNauP6A7LSLqiMOWNlv,50,2023,acoustic,0.588,0.721,8,-5.691,1,0.0409,0.246,0.000434,0.114,0.667,161.958,204787,4
3,584005,Wilder Woods,Maestro (Tears Don't Lie),6N8hCmutQjQ3zZevRbJk36,49,2023,acoustic,0.604,0.937,7,-5.498,0,0.0452,0.00156,0.000122,0.266,0.726,120.01,202040,4
4,584006,Ben Rector,Range Rover (A Capella),1X9XILnuFHH4G7mkXSNPsn,49,2023,acoustic,0.482,0.488,2,-8.144,1,0.104,0.857,0.0,0.719,0.261,140.658,182826,4


In [10]:
songs.describe()

Unnamed: 0.1,Unnamed: 0,popularity,year,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0,38650.0
mean,603333.795136,19.921759,2022.999483,0.539989,0.616256,5.281164,-9.632492,0.623726,0.088411,0.343714,0.28468,0.216473,0.424243,121.650334,227761.0,3.890841
std,11261.762774,17.925338,0.101731,0.186339,0.286571,3.573919,6.610375,0.484456,0.10261,0.367293,0.381647,0.187803,0.265711,29.779188,124211.1,0.464543
min,584002.0,0.0,2003.0,0.0554,9.7e-05,0.0,-46.995,0.0,0.022,0.0,0.0,0.0132,0.0,32.15,15282.0,0.0
25%,593664.25,2.0,2023.0,0.416,0.413,2.0,-11.358,0.0,0.0382,0.008682,2e-06,0.101,0.194,98.97725,162691.2,4.0
50%,603326.5,17.0,2023.0,0.556,0.677,5.0,-7.69,1.0,0.0513,0.174,0.00437,0.133,0.394,123.003,204480.0,4.0
75%,612988.75,32.0,2023.0,0.681,0.865,8.0,-5.429,1.0,0.0886,0.699,0.742,0.282,0.634,139.993,263980.0,4.0
max,904608.0,100.0,2023.0,0.982,1.0,11.0,2.64,1.0,0.96,0.996,0.999,0.999,0.999,247.465,3715161.0,5.0


In [11]:
songs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38650 entries, 0 to 38649
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        38650 non-null  int64  
 1   artist_name       38650 non-null  object 
 2   track_name        38650 non-null  object 
 3   track_id          38650 non-null  object 
 4   popularity        38650 non-null  int64  
 5   year              38650 non-null  int64  
 6   genre             38650 non-null  object 
 7   danceability      38650 non-null  float64
 8   energy            38650 non-null  float64
 9   key               38650 non-null  int64  
 10  loudness          38650 non-null  float64
 11  mode              38650 non-null  int64  
 12  speechiness       38650 non-null  float64
 13  acousticness      38650 non-null  float64
 14  instrumentalness  38650 non-null  float64
 15  liveness          38650 non-null  float64
 16  valence           38650 non-null  float6