# Spotify
## Cleaning Dataset
* Identify and remove duplicate rows
* Look for missing values and fill them in as appropriate
* Ensure consistency in text fields
* Handle Missing Values in Numerical Columns
* Use the `AVERAGE()` or `MEDIAN()` to calculate the replacement values
 
## Preliminary Exploratory Data Analysis
* Use the Data Analysis Toolpak to generate descriptive statistics for numerical columns
* Create histograms to visualize the distribution of numerical variables
* Generate scatter plots to examine relationships between different numerical variables

In [1]:
import pandas as pd
# from ydata_profiling import ProfileReport

In [2]:
df = pd.read_csv("spotify_data.txt", sep=",", index_col=0)
df.head()

Unnamed: 0_level_0,track_url,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1.0,-6.746,0.0,0.143,0.0322,1e-06,0.358,0.715,87.917,4.0,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,,Ghost - Acoustic,55,149610,False,0.42,0.166,1.0,-17.235,1.0,0.0763,0.924,6e-06,0.101,0.267,77.489,4.0,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0.0,-9.734,1.0,0.0557,0.21,0.0,0.117,0.12,76.332,4.0,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0.0,-18.515,1.0,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3.0,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2.0,-9.681,1.0,0.0526,0.469,0.0,0.0829,0.167,119.949,4.0,acoustic


In [3]:
# profile = ProfileReport(df, title='Spotify Tracks Data')
# profile.to_notebook_iframe()
# profile.to_file("spotify_data.html")

The report shows there are duplicates within the data.<br>
Use the below to remove duplicates from the dataframe.

In [4]:
df.drop_duplicates

<bound method DataFrame.drop_duplicates of                        track_url                 artists  \
track_id                                                   
0         5SuOikwiRyPMVoIQDJUgSV             Gen Hoshino   
1         4qPNDBW1i3p13qLCt0Ki3A            Ben Woodward   
2         1iJBSr7s7jYXzM8EGcbK5b  Ingrid Michaelson;ZAYN   
3         6lfxq3CG4xtTiEg7opyCyx            Kina Grannis   
4         5vjLSffimiIP26QG5WcN2K        Chord Overstreet   
...                          ...                     ...   
113995    2C3TZjDRiAzdyViavDJ217           Rainy Lullaby   
113996    1hIz5L4IB9hN3WRYPOCGPw           Rainy Lullaby   
113997    6x8ZfSoqDjuNa5SVP5QjvX           Cesária Evora   
113998    2e6sXL2bYv4bSz6VTdnfLs        Michael W. Smith   
113999    2hETkH7cOfqmz3LqZDHZf5           Cesária Evora   

                                                 album_name  \
track_id                                                      
0                                                 

We have tracks without a cosresponding album.<br>
Use the below code to take the track name and append ` single` to the make the album name.

In [5]:
df['album_name'] = df['album_name'].fillna(df['track_name']+' (single)')

Drop the row where the artists name is NaN.<br>
Keep the main artist and remove the other artist from the data.

In [6]:
# df['featured'] = df['artists'].str.split(';').str[1]
df.dropna(subset=['artists'])
df['artists'] = df['artists'].str.split(';').str[0]


In [7]:
df.drop(columns='track_url')

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.4610,1.0,-6.746,0.0,0.1430,0.0322,0.000001,0.3580,0.7150,87.917,4.0,acoustic
1,Ben Woodward,Ghost - Acoustic (single),Ghost - Acoustic,55,149610,False,0.420,0.1660,1.0,-17.235,1.0,0.0763,0.9240,0.000006,0.1010,0.2670,77.489,4.0,acoustic
2,Ingrid Michaelson,To Begin Again,To Begin Again,57,210826,False,0.438,0.3590,0.0,-9.734,1.0,0.0557,0.2100,0.000000,0.1170,0.1200,76.332,4.0,acoustic
3,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0.0,-18.515,1.0,0.0363,0.9050,0.000071,0.1320,0.1430,181.740,3.0,acoustic
4,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.4430,2.0,-9.681,1.0,0.0526,0.4690,0.000000,0.0829,0.1670,119.949,4.0,acoustic
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
113995,Rainy Lullaby,#mindfulness - Soft Rain for Mindful Meditatio...,Sleep My Little Boy,21,384999,False,0.172,0.2350,5.0,-16.393,1.0,0.0422,0.6400,0.928000,0.0863,0.0339,125.995,5.0,world-music
113996,Rainy Lullaby,#mindfulness - Soft Rain for Mindful Meditatio...,Water Into Light,22,385000,False,0.174,0.1170,0.0,-18.318,0.0,0.0401,0.9940,0.976000,0.1050,0.0350,85.239,4.0,world-music
113997,Cesária Evora,Best Of,Miss Perfumado,22,271466,False,0.629,0.3290,0.0,-10.895,0.0,0.0420,0.8670,0.000000,0.0839,0.7430,132.378,4.0,world-music
113998,Michael W. Smith,Change Your World,Friends,41,283893,False,0.587,0.5060,7.0,-10.889,1.0,0.0297,0.3810,0.000000,0.2700,0.4130,135.960,4.0,world-music


In [8]:
df.head()

Unnamed: 0_level_0,track_url,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1.0,-6.746,0.0,0.143,0.0322,1e-06,0.358,0.715,87.917,4.0,acoustic
1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost - Acoustic (single),Ghost - Acoustic,55,149610,False,0.42,0.166,1.0,-17.235,1.0,0.0763,0.924,6e-06,0.101,0.267,77.489,4.0,acoustic
2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0.0,-9.734,1.0,0.0557,0.21,0.0,0.117,0.12,76.332,4.0,acoustic
3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0.0,-18.515,1.0,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3.0,acoustic
4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2.0,-9.681,1.0,0.0526,0.469,0.0,0.0829,0.167,119.949,4.0,acoustic
