# <p style='color: green;' > Spotify Exploratory Data Analysis (EDA) </p>

### Goal of EDA:
    * Analyze the data set by summarizin the main characteristics of the data, complemented with visualization representations
    * Understand the data's structure, its outliers and anamolies
    * Uncover underlying patterns before applying more complex statistical modeling or machine learning algorithms.

### Key aspects of EDA:
    1. Descriptive Statistics: Calculating measures such as mean, median, mode, standard deviation, and range to summarize the data.

    2. Data Visualization: Creating plots and graphs like histograms, scatter plots, box plots, and bar charts to visualize data distributions and relationships between variables.

    3. Data Cleaning: Identifying and handling missing values, duplicates, and errors in the data set.

    4. Outlier Detection: Identifying and analyzing outliers to understand their impact on the data set.

    5. Correlation Analysis: Examining relationships between variables to identify any associations or dependencies.

    6. Hypothesis Generation: Formulating hypotheses about the data that can be tested with more formal statistical methods or experiments.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as seaborn

In [2]:
df = pd.read_csv('dataset-2.csv',
                 index_col='track_id')
df.head()

Unnamed: 0_level_0,Unnamed: 0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
5SuOikwiRyPMVoIQDJUgSV,0,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
4qPNDBW1i3p13qLCt0Ki3A,1,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
1iJBSr7s7jYXzM8EGcbK5b,2,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
6lfxq3CG4xtTiEg7opyCyx,3,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
5vjLSffimiIP26QG5WcN2K,4,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


# <p style='color:green;'> Data Cleaning </p>

First we need to clean the data of any nulls, duplicates, and features we amy deem as unncessary at first glance.

The column "Unnamed: 0" seems to be an for the dataset that increments that was repeated twice. We don't need this column and can drop this column.

The data has one row with nulls in three crucial features: artists, album_name, and track_name.

Then we have 577 duplicates in the data which will need to be dropped as well.

In [5]:
df.drop('Unnamed: 0', axis = 1, inplace=True)

In [6]:
df.columns

Index(['artists', 'album_name', 'track_name', 'popularity', 'duration_ms',
       'explicit', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'time_signature', 'track_genre'],
      dtype='object')

In [7]:
df.isna().any(), df.isna().sum()

(artists              True
 album_name           True
 track_name           True
 popularity          False
 duration_ms         False
 explicit            False
 danceability        False
 energy              False
 key                 False
 loudness            False
 mode                False
 speechiness         False
 acousticness        False
 instrumentalness    False
 liveness            False
 valence             False
 tempo               False
 time_signature      False
 track_genre         False
 dtype: bool,
 artists             1
 album_name          1
 track_name          1
 popularity          0
 duration_ms         0
 explicit            0
 danceability        0
 energy              0
 key                 0
 loudness            0
 mode                0
 speechiness         0
 acousticness        0
 instrumentalness    0
 liveness            0
 valence             0
 tempo               0
 time_signature      0
 track_genre         0
 dtype: int64)

In [8]:
df.loc[df['artists'].isna()]

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1kR4gIb7nGxHPI3D2ifs59,,,,0,0,False,0.501,0.583,7,-9.46,0,0.0605,0.69,0.00396,0.0747,0.734,138.391,4,k-pop


In [9]:
df.loc[df['album_name'].isna()]

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1kR4gIb7nGxHPI3D2ifs59,,,,0,0,False,0.501,0.583,7,-9.46,0,0.0605,0.69,0.00396,0.0747,0.734,138.391,4,k-pop


In [10]:
df.loc[df['track_name'].isna()]

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1kR4gIb7nGxHPI3D2ifs59,,,,0,0,False,0.501,0.583,7,-9.46,0,0.0605,0.69,0.00396,0.0747,0.734,138.391,4,k-pop


In [11]:
df.isna().value_counts()

artists  album_name  track_name  popularity  duration_ms  explicit  danceability  energy  key    loudness  mode   speechiness  acousticness  instrumentalness  liveness  valence  tempo  time_signature  track_genre
False    False       False       False       False        False     False         False   False  False     False  False        False         False             False     False    False  False           False          113999
True     True        True        False       False        False     False         False   False  False     False  False        False         False             False     False    False  False           False               1
Name: count, dtype: int64

In [12]:
df.drop(['1kR4gIb7nGxHPI3D2ifs59'], inplace=True)


In [17]:
df.isna().value_counts()

artists  album_name  track_name  popularity  duration_ms  explicit  danceability  energy  key    loudness  mode   speechiness  acousticness  instrumentalness  liveness  valence  tempo  time_signature  track_genre
False    False       False       False       False        False     False         False   False  False     False  False        False         False             False     False    False  False           False          113999
Name: count, dtype: int64

In [21]:
df.duplicated().value_counts()

False    113422
True        577
Name: count, dtype: int64

In [28]:
df[df.duplicated()]

Unnamed: 0_level_0,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0CDucx9lKxuCZplLXUz0iX,Buena Onda Reggae Club,Disco 2,Song for Rollins,16,219346,False,0.841,0.577,0,-7.544,1,0.0438,0.238000,0.860000,0.0571,0.843,90.522,4,afrobeat
77s65ayZ3gXbqMV8jKH1A3,The Killers;Ryan Pardey,Alternative Christmas 2022,Don't Shoot Me Santa,0,245106,False,0.588,0.847,8,-4.164,1,0.0705,0.060100,0.000000,0.3070,0.662,120.041,4,alt-rock
4fdy3vg2bCXU6L77vC6li8,The Smashing Pumpkins,Alternative Christmas 2022,Christmastime,0,196723,False,0.165,0.434,0,-8.163,1,0.0288,0.316000,0.171000,0.2130,0.186,77.983,3,alt-rock
7mntHnF2frXuZwFAp8ouCB,Weezer,Alternative Christmas 2022,We Wish You A Merry Christmas,0,84973,False,0.387,0.786,11,-4.127,1,0.0436,0.019500,0.000000,0.1230,0.462,149.806,3,alt-rock
2aibwv5hGXSgw7Yru8IYTO,Red Hot Chili Peppers,Stadium Arcadium,Snow (Hey Oh),80,334666,False,0.427,0.900,11,-3.674,1,0.0499,0.116000,0.000017,0.1190,0.599,104.655,4,alt-rock
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2zg3iJW4fK7KZgHOvJU67z,Faithless,Faithless 2.0,Tarantula,21,398152,False,0.622,0.816,6,-11.095,0,0.0483,0.009590,0.578000,0.0991,0.427,136.007,4,trip-hop
46FPub2Fewe7XrgM0smTYI,Morcheeba,Parts of the Process,Undress Me Now,17,203773,False,0.576,0.352,7,-10.773,0,0.0268,0.700000,0.270000,0.1600,0.360,95.484,4,trip-hop
6qVA1MqDrDKfk9144bhoKp,Acil Servis,Küçük Adam,Bebek,38,319933,False,0.486,0.485,5,-12.391,0,0.0331,0.004460,0.000017,0.3690,0.353,120.095,4,turkish
5WaioelSGekDk3UNQy8zaw,Matt Redman,Sing Like Never Before: The Essential Collection,Our God - New Recording,34,265373,False,0.487,0.895,11,-5.061,1,0.0413,0.000183,0.000000,0.3590,0.384,105.021,4,world-music


In [47]:
df = df.drop_duplicates()

In [49]:
df.shape

(113422, 19)

Now the data is clean of any duplicates and nulls. We can now properly perform EDA.

## 1. Descriptive Statistics: 

Calculating measures such as mean, median, mode, standard deviation, and range to summarize the data.