## DATA DICTIONARY

1. `name`: The name or title of a music track.
2. `popularity`: A measure of the track's popularity, which could be a numerical rating or score.
3. `duration`: The duration or length of the music track, typically in seconds.
4. `explicit`: A binary indicator (e.g., 0 or 1) to denote whether the track contains explicit content.
5. `artists`: The name of the artist(s) or performer(s) associated with the track.
6. `id_artists`: Unique identifiers or IDs for the artist(s) associated with the track.
7. `danceability`: A measure of how suitable the track is for dancing, often in numerical form.
8. `energy`: A measure of the track's energy level, often in numerical form.
9. `loudness`: The loudness of the track's audio, usually represented as a numerical value.
10. `speechiness`: A measure of the presence of spoken words or speech in the track's lyrics.
11. `acousticness`: A measure of the track's acoustic or non-electronic characteristics.
12. `instrumentalness`: A measure of the track's instrumental or non-vocal characteristics.
13. `liveness`: A measure of the presence of a live audience or performance aspect in the track.
14. `valence`: A measure of the track's emotional positivity or happiness.
15. `tempo`: The tempo or beats per minute (BPM) of the track's rhythm.
16. `release_year`: The year when the track was released.


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [2]:
tracks=pd.read_csv('tracks.csv')
tracks.head(5)

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,35iwgR4jXetI318WEWsa1Q,Carve,6,126903,0,['Uli'],['45tIt06XoI0Iio4LBEVpls'],1922-02-22,0.645,0.445,0,-13.338,1,0.451,0.674,0.744,0.151,0.127,104.851,3
1,021ht4sdgPcrDgSk7JTbKY,Capítulo 2.16 - Banquero Anarquista,0,98200,0,['Fernando Pessoa'],['14jtPCOoNZwquk5wd9DxrY'],1922-06-01,0.695,0.263,0,-22.136,1,0.957,0.797,0.0,0.148,0.655,102.009,1
2,07A5yehtSnoedViJAZkNnc,Vivo para Quererte - Remasterizado,0,181640,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.434,0.177,1,-21.18,1,0.0512,0.994,0.0218,0.212,0.457,130.418,5
3,08FmqUhxtyLTn6pAh6bk45,El Prisionero - Remasterizado,0,176907,0,['Ignacio Corsini'],['5LiOoJbxVSAMkBS2fUm3X2'],1922-03-21,0.321,0.0946,7,-27.961,1,0.0504,0.995,0.918,0.104,0.397,169.98,3
4,08y9GfoqCWfOGsKdwojr5e,Lady of the Evening,0,163080,0,['Dick Haymes'],['3BiJGZsyX9sJchTqcSA7Su'],1922,0.402,0.158,3,-16.9,0,0.039,0.989,0.13,0.311,0.196,103.22,4


In [3]:
tracks.tail()

Unnamed: 0,id,name,popularity,duration_ms,explicit,artists,id_artists,release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
586667,5rgu12WBIHQtvej2MdHSH0,云与海,50,258267,0,['阿YueYue'],['1QLBXKM5GCpyQQSVMNZqrZ'],2020-09-26,0.56,0.518,0,-7.471,0,0.0292,0.785,0.0,0.0648,0.211,131.896,4
586668,0NuWgxEp51CutD2pJoF4OM,blind,72,153293,0,['ROLE MODEL'],['1dy5WNgIKQU6ezkpZs4y8z'],2020-10-21,0.765,0.663,0,-5.223,1,0.0652,0.141,0.000297,0.0924,0.686,150.091,4
586669,27Y1N4Q4U3EfDU5Ubw8ws2,What They'll Say About Us,70,187601,0,['FINNEAS'],['37M5pPGs6V1fchFJSgCguX'],2020-09-02,0.535,0.314,7,-12.823,0,0.0408,0.895,0.00015,0.0874,0.0663,145.095,4
586670,45XJsGpFTyzbzeWK8VzR8S,A Day At A Time,58,142003,0,"['Gentle Bones', 'Clara Benin']","['4jGPdu95icCKVF31CcFKbS', '5ebPSE9YI5aLeZ1Z2g...",2021-03-05,0.696,0.615,10,-6.212,1,0.0345,0.206,3e-06,0.305,0.438,90.029,4
586671,5Ocn6dZ3BJFPWh4ylwFXtn,Mar de Emociones,38,214360,0,['Afrosound'],['0i4Qda0k4nf7jnNHmSNpYv'],2015-07-01,0.686,0.723,6,-7.067,1,0.0363,0.105,0.0,0.264,0.975,112.204,4


In [4]:
tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586672 entries, 0 to 586671
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                586672 non-null  object 
 1   name              586601 non-null  object 
 2   popularity        586672 non-null  int64  
 3   duration_ms       586672 non-null  int64  
 4   explicit          586672 non-null  int64  
 5   artists           586672 non-null  object 
 6   id_artists        586672 non-null  object 
 7   release_date      586672 non-null  object 
 8   danceability      586672 non-null  float64
 9   energy            586672 non-null  float64
 10  key               586672 non-null  int64  
 11  loudness          586672 non-null  float64
 12  mode              586672 non-null  int64  
 13  speechiness       586672 non-null  float64
 14  acousticness      586672 non-null  float64
 15  instrumentalness  586672 non-null  float64
 16  liveness          58

## clean data

## drop not required columns

In [5]:
tracks.id_artists=tracks.id_artists.apply(lambda x:x.split(',')[0])

In [6]:
tracks.drop(['key','mode','id','time_signature'],axis='columns',inplace=True)

In [7]:
# duration_ms column
tracks.duration_ms=tracks.duration_ms/1000
tracks.rename(columns={'duration_ms':'duration'},inplace=True)

In [8]:
# artists
tracks.artists=tracks.artists.str.replace(r"['\[\]]","",regex=True)
tracks.id_artists=tracks.id_artists.str.replace(r"['\[\]]","",regex=True)

In [9]:
tracks.columns

Index(['name', 'popularity', 'duration', 'explicit', 'artists', 'id_artists',
       'release_date', 'danceability', 'energy', 'loudness', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'],
      dtype='object')

In [10]:
from datetime import datetime
tracks.release_date=pd.to_datetime(tracks['release_date'],format='mixed')

In [11]:
tracks['release_month']=tracks.release_date.dt.month
tracks['release_year']=tracks.release_date.dt.year


In [12]:
tracks=tracks[tracks.release_year>2000]
tracks.set_index(np.arange(len(tracks))).head()

Unnamed: 0,name,popularity,duration,explicit,artists,id_artists,release_date,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,release_month,release_year
0,You'll Never Walk Alone - Mono; 2002 Remaster,56,160.187,0,Gerry & The Pacemakers,3UmBeGyNwr4iDWi1vTxWi8,2008-02-11,0.484,0.265,-11.101,0.0322,0.394,0.0,0.149,0.285,113.564,2,2008
1,A Lover's Concerto,41,159.56,0,The Toys,6lH5PpuiMa5SpfjoIOlwCS,2020-03-13,0.671,0.867,-2.706,0.0571,0.436,0.0,0.139,0.839,120.689,3,2020
2,Ferry Cross the Mersey - Mono; 2002 Remaster,40,141.987,0,Gerry & The Pacemakers,3UmBeGyNwr4iDWi1vTxWi8,2008-02-11,0.405,0.365,-10.226,0.0289,0.255,5e-06,0.163,0.588,104.536,2,2008
3,Don't Let the Sun Catch You Crying (Main) - Mono,34,157.093,0,Gerry & The Pacemakers,3UmBeGyNwr4iDWi1vTxWi8,2008-02-11,0.477,0.352,-14.165,0.03,0.406,0.0,0.122,0.478,106.773,2,2008
4,The September Of My Years - Live At The Sands ...,26,187.333,0,Frank Sinatra,1Mxqyy3pSjf8kZZL4QVxS0,2018-05-04,0.319,0.201,-17.796,0.0623,0.887,0.0,0.904,0.239,117.153,5,2018


# MISSING VALUES

In [13]:
tracks.isnull().sum()

name                0
popularity          0
duration            0
explicit            0
artists             0
id_artists          0
release_date        0
danceability        0
energy              0
loudness            0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
release_month       0
release_year        0
dtype: int64

In [14]:
tracks.dropna(subset='name',inplace=True)

## Duplicate values

In [15]:
tracks.duplicated().sum()

618

In [16]:
tracks.drop_duplicates(keep='first',ignore_index=True,inplace=True)

In [17]:
tracks.duplicated().sum()

0

In [18]:
tracks.head()

Unnamed: 0,name,popularity,duration,explicit,artists,id_artists,release_date,danceability,energy,loudness,speechiness,acousticness,instrumentalness,liveness,valence,tempo,release_month,release_year
0,You'll Never Walk Alone - Mono; 2002 Remaster,56,160.187,0,Gerry & The Pacemakers,3UmBeGyNwr4iDWi1vTxWi8,2008-02-11,0.484,0.265,-11.101,0.0322,0.394,0.0,0.149,0.285,113.564,2,2008
1,A Lover's Concerto,41,159.56,0,The Toys,6lH5PpuiMa5SpfjoIOlwCS,2020-03-13,0.671,0.867,-2.706,0.0571,0.436,0.0,0.139,0.839,120.689,3,2020
2,Ferry Cross the Mersey - Mono; 2002 Remaster,40,141.987,0,Gerry & The Pacemakers,3UmBeGyNwr4iDWi1vTxWi8,2008-02-11,0.405,0.365,-10.226,0.0289,0.255,5e-06,0.163,0.588,104.536,2,2008
3,Don't Let the Sun Catch You Crying (Main) - Mono,34,157.093,0,Gerry & The Pacemakers,3UmBeGyNwr4iDWi1vTxWi8,2008-02-11,0.477,0.352,-14.165,0.03,0.406,0.0,0.122,0.478,106.773,2,2008
4,The September Of My Years - Live At The Sands ...,26,187.333,0,Frank Sinatra,1Mxqyy3pSjf8kZZL4QVxS0,2018-05-04,0.319,0.201,-17.796,0.0623,0.887,0.0,0.904,0.239,117.153,5,2018


# filter 2000 and above

# columns

In [19]:
tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204129 entries, 0 to 204128
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   name              204129 non-null  object        
 1   popularity        204129 non-null  int64         
 2   duration          204129 non-null  float64       
 3   explicit          204129 non-null  int64         
 4   artists           204129 non-null  object        
 5   id_artists        204129 non-null  object        
 6   release_date      204129 non-null  datetime64[ns]
 7   danceability      204129 non-null  float64       
 8   energy            204129 non-null  float64       
 9   loudness          204129 non-null  float64       
 10  speechiness       204129 non-null  float64       
 11  acousticness      204129 non-null  float64       
 12  instrumentalness  204129 non-null  float64       
 13  liveness          204129 non-null  float64       
 14  vale

# EXPORT 

In [20]:
tracks.to_csv('tracks_cleaned.csv',index=False)