## Exploratory Data Analysis on the Most Streamed Spotify Songs of 2023 by Bon-ao, Angelo B.

#### This code entails an exploratory data analysis on the most streamed spotify songs of 2023. 
#### It includes exploring, cleaning, and visualizing the data it contains.

### Preparation/Pre-processing for EDA

#### Importing necessary libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

##### These libraries are needed in order to perform EDA on the given dataset;
##### The pandas library will make it so that the dataset from the csv file is read and turned into a dataframe.
##### It'll also allow us to use commands that is needed for the EDA.
#####
##### The matplotlib.pyplot library will allow us to visualize the data and highlight the stories within the data.
##### 
##### The seaborn library serves the same purpose as matplotlib.pyplot but carries an extra feature that colors the visualizations.

#### Loading/Reading the data as a dataframe

In [3]:
df_spotify = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')
df_spotify

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
948,My Mind & Me,Selena Gomez,1,2022,11,3,953,0,91473363,61,...,144,A,Major,60,24,39,57,0,8,3
949,Bigger Than The Whole Sky,Taylor Swift,1,2022,10,21,1180,0,121871870,4,...,166,F#,Major,42,7,24,83,1,12,6
950,A Veces (feat. Feid),"Feid, Paulo Londra",2,2022,11,3,573,0,73513683,2,...,92,C#,Major,80,81,67,4,0,8,6
951,En La De Ella,"Feid, Sech, Jhayco",3,2022,10,20,1320,0,133895612,29,...,97,C#,Major,82,67,77,8,0,12,5


In [None]:
##### To load the given dataset, which is a csv file, the pd.read_csv is used
##### The encoding='ISO-8859-1' is also implemented as it would not load the file without it

#### Viewing/Checking the data

In [None]:
df_spotify.info()

In [None]:
##### Checking the csv file in excel, scrolling through the data, it is obvious that some of it is missing
#####
##### Using .info() acquires the basic information of the data
##### There are 953 rows and 24 columns. 
##### The datatype for the streams in_deezer_playlists, and in_shazam_charts columns are also incorrectly detected as an object

### Fixing the dataset

In [None]:
### Fixing the datatypes for streams, in_deezer_playlists, and in_shazam_charts

In [None]:
#### For the streams columns

In [None]:
df_spotify['streams'] = pd.to_numeric(df_spotify['streams'], errors = 'coerce')
df_spotify['streams']

In [None]:
df_spotify['streams'].iloc[574]

##### Used errors = 'coerce' as without it, it wouldn't be able to convert the 574th row in to a numerical value
##### The value of the 574th row of streams is converted to Nan and is now missing
##### It used to store the values of the other columns related to the row of the song which was 'Love Grows (Where My Rosemary Goes)' by Edison Lighthouse
##### It used to have 'BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3'
#####
##### Upon researching, the original song has about 276,093,748	accumulated streams and is being streamed daily by 168,328 on 2024/10/16 which was 290 days since 2023
##### Subtracting 276,093,748 by 168,000 multiplied by 290 to account for the extra days, we get 227,373,748 streams
##### We should now store this value in that row as done below,

In [None]:
df_spotify.at[574, 'streams'] = 227373748
df_spotify['streams'].iloc[574]

In [None]:
#### For the in_deezer_playlists columns

In [5]:
df_spotify['in_deezer_playlists'].iloc[48]

'2,445'

In [7]:
df_spotify['in_deezer_playlists'] = df_spotify['in_deezer_playlists'].str.replace(",","").astype(float)
df_spotify['in_deezer_playlists']

0       45.0
1       58.0
2       91.0
3      125.0
4       87.0
       ...  
948     37.0
949      8.0
950      7.0
951     17.0
952     32.0
Name: in_deezer_playlists, Length: 953, dtype: float64

In [9]:
df_spotify['in_deezer_playlists'].iloc[48]

2445.0

In [None]:
##### The in_deezer_playlists contains numbers with commas and that is probably the reason it detected as an object
##### replacing all the commas with a space using .str.replace(","") ensuring that it can be all converted into numerical values
##### The astype(float) is used to convert all the values in to column to become the float datatype

In [None]:
#### For the in_shazam_charts

In [11]:
##### It has the same problem with the 'in_deezer_playlists' column so we just need to repeat the process before

In [13]:
df_spotify['in_shazam_charts'].iloc[12]

'1,021'

In [15]:
df_spotify['in_shazam_charts'] = df_spotify['in_shazam_charts'].str.replace(",","").astype(float)
df_spotify['in_shazam_charts']

0      826.0
1      382.0
2      949.0
3      548.0
4      425.0
       ...  
948      0.0
949      0.0
950      0.0
951      0.0
952      0.0
Name: in_shazam_charts, Length: 953, dtype: float64

In [17]:
df_spotify['in_shazam_charts'].iloc[12]

1021.0

## Guide Questions for EDA

#### How many rows and columns does the dataset contain?

##### using the .info(), as used before to check and view the basic information of the data
##### it shows that the dataset contains 953 rows and 24 columns

#### What are the data types of each column? Are there any missing values?

##### also using the .info(), the data types of each column are either a 64bit integer or an object
##### the streams values are also incorrectly detected as an object

In [19]:
df_spotify.isna().sum()

track_name               0
artist(s)_name           0
artist_count             0
released_year            0
released_month           0
released_day             0
in_spotify_playlists     0
in_spotify_charts        0
streams                  0
in_apple_playlists       0
in_apple_charts          0
in_deezer_playlists      0
in_deezer_charts         0
in_shazam_charts        50
bpm                      0
key                     95
mode                     0
danceability_%           0
valence_%                0
energy_%                 0
acousticness_%           0
instrumentalness_%       0
liveness_%               0
speechiness_%            0
dtype: int64

In [None]:
##### using the combination of .isna(), which checks if the value inside a column is missing or null, and
##### .sum() to sum all the counts/times there are missing/Nan values in a column,
##### it shows that there are 50 missing values inside 'in_shazam_charts' column and 95 missing values inside the 'key' column

In [None]:
df_spotify['streams'] = pd.to_numeric(df_spotify['streams'])

### Basic Descriptive Statistics

#### What are the mean, median, and standard deviation of the streams column?

In [None]:
df_spotify['streams'].mean()

In [None]:
df_spotify['streams'].median()

In [None]:
df_spotify['streams'].std()

In [None]:
###### The mean of the streams column shows that on average the most streamed songs of spotify in 20223
###### has close to half a billion streams in order to be on that list but this may be due to the highest streamed songs having such high numbers
######
##### The median of streams represent the most middle value when it's ordered according to ascending value
##### It shows that most of the songs in this list has a lower stream count than the mean of the list

In [None]:
sorted_streams = df_spotify['streams'].sort_values(ascending=False)
sorted_streams.head(10)