## 1) Data Collection 

Initially we started using Spotifys own API with a python library called spotipy to harvest our data. However, due to sending to many requests we reached their WEB API rate limit. Hence, we acknowledged the server resource limits and proceeded to look for datasets on various places such as Amazon's AWS datasets, OpenDataMonitor.eu, OpenMLG.org, Kaggle and reddits r/datasets. 

On reddit, we found a dataset linked to kaggle user named asaniczka. This data was updated to include songs from 2024, which fitted perfect for our own time limit. We wanted ot include data that were not more than 12 months old. 

Hence we will now start the data collection process, using python, pandas, matplotlib and searborn-

Link: https://www.kaggle.com/datasets/asaniczka/top-spotify-songs-in-73-countries-daily-updated?rvi=1 



In [4]:
import pandas as pd

df = pd.read_csv('universal_top_spotify_songs.csv') 

pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', 30)


In [5]:
df.head()

Unnamed: 0,spotify_id,name,artists,daily_rank,daily_movement,weekly_movement,country,snapshot_date,popularity,is_explicit,duration_ms,album_name,album_release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,2KslE17cAJNHTsI2MI0jb2,Standing Next to You,Jung Kook,1,0,0,VN,2024-04-22,92,False,206019,GOLDEN,2023-11-03,0.711,0.809,2,-4.389,0,0.0955,0.0447,0.0,0.339,0.816,106.017,4
1,2OzhQlSqBEmt7hmkYxfT6m,Fortnight (feat. Post Malone),"Taylor Swift, Post Malone",2,0,48,VN,2024-04-22,88,False,228965,THE TORTURED POETS DEPARTMENT,2024-04-18,0.675,0.397,11,-10.895,1,0.0245,0.499,6e-06,0.0939,0.319,95.988,4
2,2xOhv7XudrBDtkID1jwsFE,Từng Là,Vũ Cát Tường,3,1,0,VN,2024-04-22,71,False,252500,Từng Là,2024-03-01,0.808,0.414,5,-10.95,1,0.038,0.864,0.000118,0.174,0.609,115.041,4
3,2HRgqmZQC0MC7GeNuDIXHN,Seven (feat. Latto) (Explicit Ver.),"Jung Kook, Latto",4,-1,0,VN,2024-04-22,87,True,183550,GOLDEN,2023-11-03,0.79,0.831,11,-4.185,1,0.044,0.312,0.0,0.0797,0.872,124.987,4
4,3qhYidu0cemx1v9PgTtpS5,Chúng Ta Của Tương Lai,Sơn Tùng M-TP,5,0,-3,VN,2024-04-22,73,False,249871,Chúng Ta Của Tương Lai,2024-03-08,0.694,0.556,0,-7.097,1,0.0805,0.787,0.00688,0.115,0.485,145.954,4


The info() method is used to get a quick overview of the dataset, most specifically on the total amount of rows, columns, each atttribute type and number of 

As we can see, the dataset is fairly large and contains approximately 675.000 rows. We will now proceed to investigate the data further

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 674853 entries, 0 to 674852
Data columns (total 25 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   spotify_id          674853 non-null  object 
 1   name                674828 non-null  object 
 2   artists             674828 non-null  object 
 3   daily_rank          674853 non-null  int64  
 4   daily_movement      674853 non-null  int64  
 5   weekly_movement     674853 non-null  int64  
 6   country             665696 non-null  object 
 7   snapshot_date       674853 non-null  object 
 8   popularity          674853 non-null  int64  
 9   is_explicit         674853 non-null  bool   
 10  duration_ms         674853 non-null  int64  
 11  album_name          674603 non-null  object 
 12  album_release_date  674603 non-null  object 
 13  danceability        674853 non-null  float64
 14  energy              674853 non-null  float64
 15  key                 674853 non-nul

## 2) Data Pre-processing

in this part we will start to explore the datasat further in order to understand its underlying structure. This step includes handling any missing values, duplication and outliers. 

When handling missing values, there are sevveral techniques one can use, depending on whether the data is numerical, text and cateogorical. 

We can choose to either get rid of the missing values or use imputation to subset for missing values. 

Let's go by imputation, as the amount of missing values in relation to our whole dataset is very low. 

Now a better way to do this to make sure that we do not have any missing values in our test_set or training set is to use SimpleIMputer from sklean. 

The benenift is that it will store the median value of all our attributes, which enables us to impute missing value on our test_Set, training_Set, validation_set and any other new data in the future. 



In [7]:
df.isna().sum()

spotify_id               0
name                    25
artists                 25
daily_rank               0
daily_movement           0
weekly_movement          0
country               9157
snapshot_date            0
popularity               0
is_explicit              0
duration_ms              0
album_name             250
album_release_date     250
danceability             0
energy                   0
key                      0
loudness                 0
mode                     0
speechiness              0
acousticness             0
instrumentalness         0
liveness                 0
valence                  0
tempo                    0
time_signature           0
dtype: int64

In [8]:
df.shape

(674853, 25)

In [10]:
#Let's look at all the countries in the data
df['country'].unique()

array(['VN', 'UA', 'TW', 'TR', 'TH', 'SV', 'PT', 'PL', 'PK', 'PH', 'PE',
       'PA', 'NI', 'MY', 'MX', 'KZ', 'JP', 'IT', 'IL', 'ID', 'HU', 'HN',
       'HK', 'GT', 'GR', 'FR', 'FI', 'EG', 'DO', 'CZ', 'CR', 'CL', 'BY',
       'BR', 'BO', 'BE', 'AR', 'SK', 'SG', 'NL', 'KR', nan, 'ZA', 'US',
       'SE', 'SA', 'RO', 'NZ', 'NO', 'NG', 'MA', 'LV', 'LU', 'LT', 'IS',
       'IE', 'GB', 'ES', 'EE', 'DK', 'DE', 'CH', 'CA', 'BG', 'AU', 'AT',
       'VE', 'UY', 'PY', 'IN', 'EC', 'CO', 'AE'], dtype=object)

In [15]:
#In our project we will focus on solely the european market, so we will filter the data to include only the european countries
european_countries = ['DK']

#Now we filter the DataFrame to include only rows where the 'country' column is in the list of European countries
df = df[df['country'].isin(european_countries)]

In [20]:
#That leves us with approximately 250 k rows worth of data
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 9210 entries, 6200 to 674052
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   spotify_id          9210 non-null   object 
 1   name                9210 non-null   object 
 2   artists             9210 non-null   object 
 3   daily_rank          9210 non-null   int64  
 4   daily_movement      9210 non-null   int64  
 5   weekly_movement     9210 non-null   int64  
 6   country             9210 non-null   object 
 7   snapshot_date       9210 non-null   object 
 8   popularity          9210 non-null   int64  
 9   is_explicit         9210 non-null   bool   
 10  duration_ms         9210 non-null   int64  
 11  album_name          9210 non-null   object 
 12  album_release_date  9210 non-null   object 
 13  danceability        9210 non-null   float64
 14  energy              9210 non-null   float64
 15  key                 9210 non-null   int64  
 16  l

In [21]:
#Let's start by handling missing data:
df.isna().sum()

spotify_id            0
name                  0
artists               0
daily_rank            0
daily_movement        0
weekly_movement       0
country               0
snapshot_date         0
popularity            0
is_explicit           0
duration_ms           0
album_name            0
album_release_date    0
danceability          0
energy                0
key                   0
loudness              0
mode                  0
speechiness           0
acousticness          0
instrumentalness      0
liveness              0
valence               0
tempo                 0
time_signature        0
dtype: int64

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9210 entries, 6200 to 674052
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   spotify_id          9210 non-null   object 
 1   name                9210 non-null   object 
 2   artists             9210 non-null   object 
 3   daily_rank          9210 non-null   int64  
 4   daily_movement      9210 non-null   int64  
 5   weekly_movement     9210 non-null   int64  
 6   country             9210 non-null   object 
 7   snapshot_date       9210 non-null   object 
 8   popularity          9210 non-null   int64  
 9   is_explicit         9210 non-null   bool   
 10  duration_ms         9210 non-null   int64  
 11  album_name          9210 non-null   object 
 12  album_release_date  9210 non-null   object 
 13  danceability        9210 non-null   float64
 14  energy              9210 non-null   float64
 15  key                 9210 non-null   int64  
 16  l