### **Project Introduction: Spotify data analysis** 

This notebook will serve as a draft for development of the project's parts.

Utilizes Kaggle dataset for prototyping analysis and data pipelines later to be used on real fetched data from project data sources (such as Spotify API, ...):
- dataset: https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023

Contains:

- Data engineering requirements:
    - data manipulation
    - ETL requirements
    - anomaly analysis

- Data Analysis:
    - EDA

In [1]:
import os 
import sys

sys.dont_write_bytecode = True

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Data preparation

In [2]:
spotify_2023_df = pd.read_csv('./Data/spotify-2023.csv', encoding='latin-1')

In [3]:
spotify_2023_df.head()

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,streams,in_apple_playlists,...,bpm,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",2,2023,7,14,553,147,141381703,43,...,125,B,Major,80,89,83,31,0,8,4
1,LALA,Myke Towers,1,2023,3,23,1474,48,133716286,48,...,92,C#,Major,71,61,74,7,0,10,4
2,vampire,Olivia Rodrigo,1,2023,6,30,1397,113,140003974,94,...,138,F,Major,51,32,53,17,0,31,6
3,Cruel Summer,Taylor Swift,1,2019,8,23,7858,100,800840817,116,...,170,A,Major,55,58,72,11,0,11,15
4,WHERE SHE GOES,Bad Bunny,1,2023,5,18,3133,50,303236322,84,...,144,A,Minor,65,23,80,14,63,11,6


data investigation

In [4]:
spotify_2023_df.dtypes

track_name              object
artist(s)_name          object
artist_count             int64
released_year            int64
released_month           int64
released_day             int64
in_spotify_playlists     int64
in_spotify_charts        int64
streams                 object
in_apple_playlists       int64
in_apple_charts          int64
in_deezer_playlists     object
in_deezer_charts         int64
in_shazam_charts        object
bpm                      int64
key                     object
mode                    object
danceability_%           int64
valence_%                int64
energy_%                 int64
acousticness_%           int64
instrumentalness_%       int64
liveness_%               int64
speechiness_%            int64
dtype: object

categorical data type exploration:

few problems:
- streams should be numerical
- in_deezer_playlists, in_shazam_charts also appear to be numerical data

In [5]:
spotify_2023_df[spotify_2023_df.columns[spotify_2023_df.dtypes == object]]

Unnamed: 0,track_name,artist(s)_name,streams,in_deezer_playlists,in_shazam_charts,key,mode
0,Seven (feat. Latto) (Explicit Ver.),"Latto, Jung Kook",141381703,45,826,B,Major
1,LALA,Myke Towers,133716286,58,382,C#,Major
2,vampire,Olivia Rodrigo,140003974,91,949,F,Major
3,Cruel Summer,Taylor Swift,800840817,125,548,A,Major
4,WHERE SHE GOES,Bad Bunny,303236322,87,425,A,Minor
...,...,...,...,...,...,...,...
948,My Mind & Me,Selena Gomez,91473363,37,0,A,Major
949,Bigger Than The Whole Sky,Taylor Swift,121871870,8,0,F#,Major
950,A Veces (feat. Feid),"Feid, Paulo Londra",73513683,7,0,C#,Major
951,En La De Ella,"Feid, Sech, Jhayco",133895612,17,0,C#,Major


Conversion investigation
- result: Data is corrputed, conversion error handling ETL required

In [6]:
convert_cols = ['streams', 'in_deezer_playlists', 'in_shazam_charts']

for col in convert_cols:

    try:
        _  = spotify_2023_df[col].apply(lambda x: float(x))

    except ValueError as e:
        print(f"COLUMN: {col} --> Float conversion failed with value = {e}")

COLUMN: streams --> Float conversion failed with value = could not convert string to float: 'BPM110KeyAModeMajorDanceability53Valence75Energy69Acousticness7Instrumentalness0Liveness17Speechiness3'
COLUMN: in_deezer_playlists --> Float conversion failed with value = could not convert string to float: '2,445'
COLUMN: in_shazam_charts --> Float conversion failed with value = could not convert string to float: '1,021'


analysis of unique values in categorical columns (without conversion columns (these columns will be converted to numeric))

In [7]:
categorical_columns = set(spotify_2023_df[spotify_2023_df.columns[spotify_2023_df.dtypes == object]]) - set(convert_cols)
categorical_columns = list(categorical_columns)

Analysis:

- artist_name: this will require some strategy for encoding
- track_name: makes sense, track name is unique and probably will be dropped
- key: no anomalies
- mode: no anomalies

In [8]:
spotify_2023_df[categorical_columns].apply(lambda x: x.nunique())

key                11
artist(s)_name    645
mode                2
track_name        943
dtype: int64

IMPORTANT: Artists are separated by commans, hence we first need to split and then count uniques --> this will require encoding where multiple artists are present in row (probably something like 01000100...)

In [9]:
spotify_2023_df[categorical_columns].head()

Unnamed: 0,key,artist(s)_name,mode,track_name
0,B,"Latto, Jung Kook",Major,Seven (feat. Latto) (Explicit Ver.)
1,C#,Myke Towers,Major,LALA
2,F,Olivia Rodrigo,Major,vampire
3,A,Taylor Swift,Major,Cruel Summer
4,A,Bad Bunny,Minor,WHERE SHE GOES


In [10]:
data = np.sum(spotify_2023_df['artist(s)_name'].str.split(',').to_numpy())
data_series = pd.Series(data)

In [11]:
data_series.to_frame()

Unnamed: 0,0
0,Latto
1,Jung Kook
2,Myke Towers
3,Olivia Rodrigo
4,Taylor Swift
...,...
1477,Paulo Londra
1478,Feid
1479,Sech
1480,Jhayco


True number of unique artists

In [88]:
data_series.nunique()

803

### Eploratory data analysis

minor data fixes - for convenience of analysis (later ETLs)

In [58]:
(~spotify_2023_df['streams'].str.fullmatch('-?\d+(\.\d+)?')).any()

True

In [60]:
# replacement of invalid stream values

invalid_row_index = spotify_2023_df[~spotify_2023_df['streams'].str.fullmatch(r'-?\d+(\.\d+)?')].index

dropping invalid rows

In [62]:
spotify_2023_df = spotify_2023_df.drop(invalid_row_index, errors='ignore')

recasting dtype

In [65]:
spotify_2023_df['streams'] = spotify_2023_df['streams'].astype(float)

Artist analysis - counts & streams

In [66]:
spotify_2023_df_copy = spotify_2023_df.copy()

spotify_2023_df_copy['artist(s)_name'] = spotify_2023_df_copy['artist(s)_name'].str.split(',')

track counts

In [71]:
spotify_2023_df_copy.explode('artist(s)_name').reset_index().value_counts(subset='artist(s)_name').to_frame()

Unnamed: 0_level_0,count
artist(s)_name,Unnamed: 1_level_1
Taylor Swift,36
The Weeknd,34
Bad Bunny,26
SZA,23
Kendrick Lamar,23
...,...
Supernova Ent,1
Swae Lee,1
Swedish House Mafia,1
Tanna Leone,1


track streams

In [79]:
spotify_2023_df_copy.explode('artist(s)_name').reset_index().groupby('artist(s)_name').agg(
    most_popular_stream_sums = ('streams', 'sum')
).sort_values(by='most_popular_stream_sums', ascending=False)

Unnamed: 0_level_0,most_popular_stream_sums
artist(s)_name,Unnamed: 1_level_1
The Weeknd,2.151655e+10
Bad Bunny,1.536378e+10
Ed Sheeran,1.455968e+10
Taylor Swift,1.442324e+10
Harry Styles,1.160865e+10
...,...
DJ 900,1.195664e+07
Sog,1.159939e+07
Ryan Castro,1.159939e+07
Sukriti Kakar,1.365184e+06


#### **categorical data analysis**
- preparation for data encoding: determination of encoding types, column handling,...

In [82]:
spotify_2023_df[categorical_columns]

Unnamed: 0,key,artist(s)_name,mode,track_name
0,B,"Latto, Jung Kook",Major,Seven (feat. Latto) (Explicit Ver.)
1,C#,Myke Towers,Major,LALA
2,F,Olivia Rodrigo,Major,vampire
3,A,Taylor Swift,Major,Cruel Summer
4,A,Bad Bunny,Minor,WHERE SHE GOES
...,...,...,...,...
948,A,Selena Gomez,Major,My Mind & Me
949,F#,Taylor Swift,Major,Bigger Than The Whole Sky
950,C#,"Feid, Paulo Londra",Major,A Veces (feat. Feid)
951,C#,"Feid, Sech, Jhayco",Major,En La De Ella


attributes:
- **key**: The key represents the tonal center or the "home" note and scale of a piece of music. For example, if a song is in the key of C major, the note C is the tonal center, and the scale used is the C major scale.

- **mode**: The two most common modes are:
    - Major (e.g., C major): Sounds bright, happy, and uplifting.
    - Minor (e.g., A minor): Sounds dark, sad, or introspective.

<img src="https://musiccrashcourses.com/images/notation/circle_of_fifths_colors.svg">

**Encoding interpretation**

- **mode**: Since mode can only be from values ('Minor', 'Major') then binary encoding makes the most sense here:
    - Minor = 0
    - Major = 1

- **key**: Based on the attribute principle of "cirle of fifths" there are multiple ways to encode this:
    - Ordinal encoding: custom ordinal encoding that will preserve the relationships between the keys (putting close keys together, far keys further)
    - Circular encoding: utilizing sine/cosine for continious periodical representation utilizing two columns:
    
        - $x = \cos(\frac{2\pi k}{12})$
        
        - $y = \sin(\frac{2\pi k}{12})$

- **key-mode combination**: Utilizing one-hot encoding we could theoretically create 12*2 = 24 columns in one hot encoding representation.