## Exploratory Data Analysis on the Most Streamed Spotify Songs of 2023 by Bon-ao, Angelo B.

#### This code entails an exploratory data analysis on the most streamed spotify songs of 2023. 
#### It includes exploring, cleaning, and visualizing the data it contains.

In [None]:
## 

### Preparation/Pre-processing for EDA

#### Importing necessary libraries

In [10]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Loading/Reading the data as a dataframe

In [11]:
df_spotify = pd.read_csv('spotify-2023.csv', encoding='ISO-8859-1')

#### Viewing/Checking the data

In [14]:
df_spotify.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 953 entries, 0 to 952
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   track_name            953 non-null    object
 1   artist(s)_name        953 non-null    object
 2   artist_count          953 non-null    int64 
 3   released_year         953 non-null    int64 
 4   released_month        953 non-null    int64 
 5   released_day          953 non-null    int64 
 6   in_spotify_playlists  953 non-null    int64 
 7   in_spotify_charts     953 non-null    int64 
 8   streams               953 non-null    object
 9   in_apple_playlists    953 non-null    int64 
 10  in_apple_charts       953 non-null    int64 
 11  in_deezer_playlists   953 non-null    object
 12  in_deezer_charts      953 non-null    int64 
 13  in_shazam_charts      903 non-null    object
 14  bpm                   953 non-null    int64 
 15  key                   858 non-null    ob

In [16]:
df_spotify.describe()

Unnamed: 0,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,in_apple_playlists,in_apple_charts,in_deezer_charts,bpm,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
count,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0,953.0
mean,1.556139,2018.238195,6.033578,13.930745,5200.124869,12.009444,67.812172,51.908709,2.666317,122.540399,66.96957,51.43127,64.279119,27.057712,1.581322,18.213012,10.131165
std,0.893044,11.116218,3.566435,9.201949,7897.60899,19.575992,86.441493,50.630241,6.035599,28.057802,14.63061,23.480632,16.550526,25.996077,8.4098,13.711223,9.912888
min,1.0,1930.0,1.0,1.0,31.0,0.0,0.0,0.0,0.0,65.0,23.0,4.0,9.0,0.0,0.0,3.0,2.0
25%,1.0,2020.0,3.0,6.0,875.0,0.0,13.0,7.0,0.0,100.0,57.0,32.0,53.0,6.0,0.0,10.0,4.0
50%,1.0,2022.0,6.0,13.0,2224.0,3.0,34.0,38.0,0.0,121.0,69.0,51.0,66.0,18.0,0.0,12.0,6.0
75%,2.0,2022.0,9.0,22.0,5542.0,16.0,88.0,87.0,2.0,140.0,78.0,70.0,77.0,43.0,0.0,24.0,11.0
max,8.0,2023.0,12.0,31.0,52898.0,147.0,672.0,275.0,58.0,206.0,96.0,97.0,97.0,97.0,91.0,97.0,64.0


##### Using .info() acquires the basic information of the data
##### There are 953 rows and 24 columns. 
##### The datatype for the streams column is wrongly detected as an object. This is probably due to the values in it being expressed 
##### as an exponential using the letter 'E'.

##### 
#####

##### For the descriptive statistics, the things that stood out were
##### The most streamed songs on Spotify in 2023
##### is more prevalent in the charts of Apple Music than Deezer, and Spotify itself.
##### is more likely than not has a fast bpm or in Allegro which ranges from 109-132 BPM.
##### is more acoustic and slight less valence and energy than I expected

## Guide Questions for EDA

#### How many rows and columns does the dataset contain?

##### using the .info(), as used before to check and view the basic information of the data
##### it shows that the dataset contains 953 rows and 24 columns

#### What are the data types of each column? Are there any missing values?

##### also using the .info(), the data types of each column are either a 64bit integer or an object
##### the streams values are also incorrectly detected as an object

In [24]:
df_spotify.isna().sum()

track_name               0
artist(s)_name           0
artist_count             0
released_year            0
released_month           0
released_day             0
in_spotify_playlists     0
in_spotify_charts        0
streams                  0
in_apple_playlists       0
in_apple_charts          0
in_deezer_playlists      0
in_deezer_charts         0
in_shazam_charts        50
bpm                      0
key                     95
mode                     0
danceability_%           0
valence_%                0
energy_%                 0
acousticness_%           0
instrumentalness_%       0
liveness_%               0
speechiness_%            0
dtype: int64

##### using the combination of .isna(), which checks if the value inside a column is missing or null, and
##### .sum() to sum all the counts/times there are missing/null values in a column,
##### it shows that there are 50 missing values inside 'in_shazam_charts' column and 95 missing values inside the 'key' column