## 1) Data Collection 

Initially we started using Spotifys own API with a python library called spotipy to harvest our data, focusing solely on the danish market. However, due to sending to many requests we flagged by spotify over the following SpotifyException error: 429. 

Since this was a problem for our project group, we started to loo for other alternatives and see whether we could find some new music data, focusing on mostly europe as we are from here. 

We did find a dataset on kaggle, which containted some new informaiton ranging from 2024. Therefor we will now proceed with that data

In [36]:
import pandas as pd

df = pd.read_csv('universal_top_spotify_songs.csv') 

pd.set_option('display.max_rows', 30)
pd.set_option('display.max_columns', 30)


In [37]:
df.head()

Unnamed: 0,spotify_id,name,artists,daily_rank,daily_movement,weekly_movement,country,snapshot_date,popularity,is_explicit,duration_ms,album_name,album_release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
0,2KslE17cAJNHTsI2MI0jb2,Standing Next to You,Jung Kook,1,0,0,VN,2024-04-22,92,False,206019,GOLDEN,2023-11-03,0.711,0.809,2,-4.389,0,0.0955,0.0447,0.0,0.339,0.816,106.017,4
1,2OzhQlSqBEmt7hmkYxfT6m,Fortnight (feat. Post Malone),"Taylor Swift, Post Malone",2,0,48,VN,2024-04-22,88,False,228965,THE TORTURED POETS DEPARTMENT,2024-04-18,0.675,0.397,11,-10.895,1,0.0245,0.499,6e-06,0.0939,0.319,95.988,4
2,2xOhv7XudrBDtkID1jwsFE,Từng Là,Vũ Cát Tường,3,1,0,VN,2024-04-22,71,False,252500,Từng Là,2024-03-01,0.808,0.414,5,-10.95,1,0.038,0.864,0.000118,0.174,0.609,115.041,4
3,2HRgqmZQC0MC7GeNuDIXHN,Seven (feat. Latto) (Explicit Ver.),"Jung Kook, Latto",4,-1,0,VN,2024-04-22,87,True,183550,GOLDEN,2023-11-03,0.79,0.831,11,-4.185,1,0.044,0.312,0.0,0.0797,0.872,124.987,4
4,3qhYidu0cemx1v9PgTtpS5,Chúng Ta Của Tương Lai,Sơn Tùng M-TP,5,0,-3,VN,2024-04-22,73,False,249871,Chúng Ta Của Tương Lai,2024-03-08,0.694,0.556,0,-7.097,1,0.0805,0.787,0.00688,0.115,0.485,145.954,4


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 674853 entries, 0 to 674852
Data columns (total 25 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   spotify_id          674853 non-null  object 
 1   name                674828 non-null  object 
 2   artists             674828 non-null  object 
 3   daily_rank          674853 non-null  int64  
 4   daily_movement      674853 non-null  int64  
 5   weekly_movement     674853 non-null  int64  
 6   country             665696 non-null  object 
 7   snapshot_date       674853 non-null  object 
 8   popularity          674853 non-null  int64  
 9   is_explicit         674853 non-null  bool   
 10  duration_ms         674853 non-null  int64  
 11  album_name          674603 non-null  object 
 12  album_release_date  674603 non-null  object 
 13  danceability        674853 non-null  float64
 14  energy              674853 non-null  float64
 15  key                 674853 non-nul

## 2) Data Pre-processing

in this part we will start to explore the datasat further in order to understand its underlying structure. This step includes handling any missing values, duplication and outliers. The reason behind this necessary process is 

Explore the dataset to understand its structure and identify any missing values or outliers. Preprocess the data as necessary, including handling missing values, scaling numerical features, encoding categorical variables, etc.



In [39]:
df['country'].unique()

array(['VN', 'UA', 'TW', 'TR', 'TH', 'SV', 'PT', 'PL', 'PK', 'PH', 'PE',
       'PA', 'NI', 'MY', 'MX', 'KZ', 'JP', 'IT', 'IL', 'ID', 'HU', 'HN',
       'HK', 'GT', 'GR', 'FR', 'FI', 'EG', 'DO', 'CZ', 'CR', 'CL', 'BY',
       'BR', 'BO', 'BE', 'AR', 'SK', 'SG', 'NL', 'KR', nan, 'ZA', 'US',
       'SE', 'SA', 'RO', 'NZ', 'NO', 'NG', 'MA', 'LV', 'LU', 'LT', 'IS',
       'IE', 'GB', 'ES', 'EE', 'DK', 'DE', 'CH', 'CA', 'BG', 'AU', 'AT',
       'VE', 'UY', 'PY', 'IN', 'EC', 'CO', 'AE'], dtype=object)

In [40]:
#In our project we will focus on solely the european market, so we will filter the data to include only the european countries
european_countries = ['UA', 'PT', 'PL', 'IT', 'HU', 'GR', 'FR', 'FI', 'CZ', 'BE', 'SK', 'NL', 'SE', 'RO', 'NO', 'LV', 'LU', 'LT', 'IS', 'IE', 'GB', 'ES', 'EE', 'DK', 'DE', 'CH', 'BG', 'AT']

#Now we filter the DataFrame to include only rows where the 'country' column is in the list of European countries
df = df[df['country'].isin(european_countries)]

In [41]:
#That leves us with approximately 250 k rows worth of data
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 258485 entries, 50 to 674752
Data columns (total 25 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   spotify_id          258485 non-null  object 
 1   name                258482 non-null  object 
 2   artists             258482 non-null  object 
 3   daily_rank          258485 non-null  int64  
 4   daily_movement      258485 non-null  int64  
 5   weekly_movement     258485 non-null  int64  
 6   country             258485 non-null  object 
 7   snapshot_date       258485 non-null  object 
 8   popularity          258485 non-null  int64  
 9   is_explicit         258485 non-null  bool   
 10  duration_ms         258485 non-null  int64  
 11  album_name          258396 non-null  object 
 12  album_release_date  258396 non-null  object 
 13  danceability        258485 non-null  float64
 14  energy              258485 non-null  float64
 15  key                 258485 non-nu

In [42]:
#Let's start by handling missing data:
df.isna().sum()

spotify_id             0
name                   3
artists                3
daily_rank             0
daily_movement         0
weekly_movement        0
country                0
snapshot_date          0
popularity             0
is_explicit            0
duration_ms            0
album_name            89
album_release_date    89
danceability           0
energy                 0
key                    0
loudness               0
mode                   0
speechiness            0
acousticness           0
instrumentalness       0
liveness               0
valence                0
tempo                  0
time_signature         0
dtype: int64

In [45]:
#Let's subset our dataframe to look at missing album names:
df_missing_album = df[df['album_name'].isna()]

In [46]:
df_missing_album

Unnamed: 0,spotify_id,name,artists,daily_rank,daily_movement,weekly_movement,country,snapshot_date,popularity,is_explicit,duration_ms,album_name,album_release_date,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature
23599,3yrSvpt2l1xhsV9Em88Pul,Brown Eyed Girl,Van Morrison,50,0,-3,IE,2024-04-15,85,False,183306,,,0.491,0.583,7,-10.964,1,0.0376,0.1850,0.0,0.4060,0.908,150.566,4
49146,3yrSvpt2l1xhsV9Em88Pul,Brown Eyed Girl,Van Morrison,47,0,-13,IE,2024-04-08,84,False,183306,,,0.491,0.583,7,-10.964,1,0.0376,0.1850,0.0,0.4060,0.908,150.566,4
52796,3yrSvpt2l1xhsV9Em88Pul,Brown Eyed Girl,Van Morrison,47,2,2,IE,2024-04-07,84,False,183306,,,0.491,0.583,7,-10.964,1,0.0376,0.1850,0.0,0.4060,0.908,150.566,4
56448,3yrSvpt2l1xhsV9Em88Pul,Brown Eyed Girl,Van Morrison,49,-1,1,IE,2024-04-06,84,False,183306,,,0.491,0.583,7,-10.964,1,0.0376,0.1850,0.0,0.4060,0.908,150.566,4
60097,3yrSvpt2l1xhsV9Em88Pul,Brown Eyed Girl,Van Morrison,48,2,2,IE,2024-04-05,84,False,183306,,,0.491,0.583,7,-10.964,1,0.0376,0.1850,0.0,0.4060,0.908,150.566,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501768,4HEOgBHRCExyYVeTyrXsnL,Jingle Bells - Remastered 1999,Frank Sinatra,49,1,1,LU,2023-12-04,83,False,120693,,,0.512,0.339,8,-13.119,1,0.0498,0.7270,0.0,0.0977,0.951,174.609,4
503264,4HEOgBHRCExyYVeTyrXsnL,Jingle Bells - Remastered 1999,Frank Sinatra,45,5,5,CH,2023-12-04,83,False,120693,,,0.512,0.339,8,-13.119,1,0.0498,0.7270,0.0,0.0977,0.951,174.609,4
503656,4HEOgBHRCExyYVeTyrXsnL,Jingle Bells - Remastered 1999,Frank Sinatra,37,13,13,AT,2023-12-04,83,False,120693,,,0.512,0.339,8,-13.119,1,0.0498,0.7270,0.0,0.0977,0.951,174.609,4
589880,7lyv2sysHCzFjypILxAynT,,,12,38,38,ES,2023-11-10,0,True,0,,,0.730,0.792,0,-4.643,1,0.0517,0.0232,0.0,0.0699,0.533,90.019,4
