# Data Cleaning

In this notebook we perform duplicate entry removal from the dataset and save the resulting dataframe as a new csv file for use in model training.

First import necessary packages

In [1]:
import numpy as np
import pandas as pd

Next load the csv file.

In [2]:
songdata = pd.read_csv('data/genre_music.csv')
songdata.head(20)

Unnamed: 0,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_s,time_signature,chorus_hit,sections,popularity,decade,genre
0,Jealous Kind Of Fella,Garland Green,0.417,0.62,3,-7.727,1,0.0403,0.49,0.0,0.0779,0.845,185.655,173.533,3,32.94975,9,1,60s,edm
1,Initials B.B.,Serge Gainsbourg,0.498,0.505,3,-12.475,1,0.0337,0.018,0.107,0.176,0.797,101.801,213.613,4,48.8251,10,0,60s,pop
2,Melody Twist,Lord Melody,0.657,0.649,5,-13.392,1,0.038,0.846,4e-06,0.119,0.908,115.94,223.96,4,37.22663,12,0,60s,pop
3,Mi Bomba Sonó,Celia Cruz,0.59,0.545,7,-12.058,0,0.104,0.706,0.0246,0.061,0.967,105.592,157.907,4,24.75484,8,0,60s,pop
4,Uravu Solla,P. Susheela,0.515,0.765,11,-3.515,0,0.124,0.857,0.000872,0.213,0.906,114.617,245.6,4,21.79874,14,0,60s,r&b
5,Beat n. 3,Ennio Morricone,0.697,0.673,0,-10.573,1,0.0266,0.714,0.919,0.122,0.778,112.117,167.667,4,65.48604,7,0,60s,pop
6,Samba De Uma Nota So (One Note Samba),Antônio Carlos Jobim,0.662,0.272,0,-18.883,1,0.0313,0.36,0.228,0.0963,0.591,143.507,134.36,4,47.82155,7,0,60s,pop
7,Happy Days,Marv Johnson,0.72,0.624,5,-9.086,0,0.0473,0.795,0.0,0.488,0.887,119.999,160.04,4,30.42891,8,1,60s,pop
8,Carolina - Remastered 2006,Caetano Veloso,0.545,0.22,2,-15.079,0,0.0828,0.582,0.239,0.269,0.386,118.223,158.413,4,47.08099,6,0,60s,pop
9,I Can Hear Music,The Beach Boys,0.511,0.603,2,-7.637,1,0.028,0.0385,2e-06,0.142,0.685,128.336,157.293,4,43.36534,9,1,60s,pop


In [3]:
print("Shape: ", songdata.shape)
songdata.dtypes

Shape:  (41099, 20)


track                object
artist               object
danceability        float64
energy              float64
key                   int64
loudness            float64
mode                  int64
speechiness         float64
acousticness        float64
instrumentalness    float64
liveness            float64
valence             float64
tempo               float64
duration_s          float64
time_signature        int64
chorus_hit          float64
sections              int64
popularity            int64
decade               object
genre                object
dtype: object

In [4]:
songdata.info

<bound method DataFrame.info of                        track            artist  danceability  energy  key  \
0      Jealous Kind Of Fella     Garland Green         0.417   0.620    3   
1              Initials B.B.  Serge Gainsbourg         0.498   0.505    3   
2               Melody Twist       Lord Melody         0.657   0.649    5   
3              Mi Bomba Sonó        Celia Cruz         0.590   0.545    7   
4                Uravu Solla       P. Susheela         0.515   0.765   11   
...                      ...               ...           ...     ...  ...   
41094          Lotus Flowers             Yolta         0.172   0.358    9   
41095      Calling My Spirit       Kodak Black         0.910   0.366    1   
41096          Teenage Dream        Katy Perry         0.719   0.804   10   
41097         Stormy Weather    Oscar Peterson         0.600   0.177    7   
41098                   Dust       Hans Zimmer         0.121   0.123    4   

       loudness  mode  speechiness  acousti

### Clean the dataset to remove duplicate songs.

Reference: https://towardsdatascience.com/finding-and-removing-duplicate-rows-in-pandas-dataframe-c6117668631f

We can use the duplicated() function to find duplicates in the dataset. As can be seen from the sum, there are 5,246 duplicated rows with the same track variable value.

In [5]:
songdata.track.duplicated().sum()

5246

Finally, to drop the duplicate rows, we can use the drop_duplicates() function to drop the rows with duplicates from the dataframe. We set keep to `first` to keep only the first occurrence of the row, and set `inplace=True` to update the original dataframe.

In [6]:
songdata.drop_duplicates(subset=['track'], keep='first', inplace=True)
print('New shape: ', songdata.shape)
songdata.head()

New shape:  (35853, 20)


Unnamed: 0,track,artist,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_s,time_signature,chorus_hit,sections,popularity,decade,genre
0,Jealous Kind Of Fella,Garland Green,0.417,0.62,3,-7.727,1,0.0403,0.49,0.0,0.0779,0.845,185.655,173.533,3,32.94975,9,1,60s,edm
1,Initials B.B.,Serge Gainsbourg,0.498,0.505,3,-12.475,1,0.0337,0.018,0.107,0.176,0.797,101.801,213.613,4,48.8251,10,0,60s,pop
2,Melody Twist,Lord Melody,0.657,0.649,5,-13.392,1,0.038,0.846,4e-06,0.119,0.908,115.94,223.96,4,37.22663,12,0,60s,pop
3,Mi Bomba Sonó,Celia Cruz,0.59,0.545,7,-12.058,0,0.104,0.706,0.0246,0.061,0.967,105.592,157.907,4,24.75484,8,0,60s,pop
4,Uravu Solla,P. Susheela,0.515,0.765,11,-3.515,0,0.124,0.857,0.000872,0.213,0.906,114.617,245.6,4,21.79874,14,0,60s,r&b


### Save cleaned dataframe to csv

Now we save the cleaned dataframe as a csv file.

In [7]:
songdata.to_csv('data/genre_music_cleaned.csv', index=False)