### This notebook contains early-stage experimentation, exploratory analysis, and feature prototyping.The final, cleaned analysis and modeling pipeline is available on the `main` branch.

## The Data Collection/Extraction Stages.
### The goal is to have audio features for each track as well as # of streams across all platforms. We will attempt to predict the # of streams based on the track's audio features.
1. Pull a sample of metal tracks released starting 2005 from Chartmetric. The data includes Spotify and other platforms' # of streams, as well as ISRC (International Standard Recording Code). This number is shared between multiple versions of the same song (e.g. album vs. single, remastered versions etc.), but belongs to essential the same track.
2. Spotify no longer allows to use its data for AI training and deprecated the API for pulling tracks' audio features, so we will use a third party API - Reccobeats to calculate audio features for each track.
3. Need to create a script to pull audio features from Reccobeats based on ISCRC for each song on the Chartmetric dataset.
4. Since ISRC can pull multiple versions of the same song, need to calculate the mean for each audio feature to closely approximate the most popular version's features. Reccobeats API currently does not provide the song's stream data, so it is not possible to pull only the audio features of the main version. Averaging out accross multiple versions of the song would be a solid option.


In [32]:
#import libraries

#data analysis
from pathlib import Path
project_root = Path.cwd().parent
import pandas as pd
import numpy as np

## Data Analysis

In [27]:
#Load chartmetrics data
raw_data_path =  project_root/"data"/"raw"
chart_ds = pd.read_csv(raw_data_path/ "chartmetric_raw.csv")
print(chart_ds.shape)
#/Users/test/Desktop/Data_Science/GIT/
chart_ds.head(5)

(6800, 30)


Unnamed: 0,Track,Album Name,Artists,Release Date,ISRC,All Time Rank,Track Score,Spotify Streams,Spotify Playlist Count,Spotify Playlist Reach,...,Deezer Playlist Count,Deezer Playlist Reach,Amazon Music Playlist Count,Pandora Streams,Pandora Track Stations,Soundcloud Streams,Shazams,TIDAL Popularity,Explicit Track,Shortlists
0,Psychosocial,Psychosocial,Slipknot,2008-07-08,NLA320886993,1912,97.14,808166393,172861,23649176,...,20.0,277509,34.0,128354144.0,82682.0,14547258,,,0,
1,Stricken,Stricken,Disturbed,2005-07-20,USRE10500766,6688,95.45,466331168,92604,17799301,...,7.0,201307,24.0,316520872.0,63786.0,6196954,,,0,
2,Unsainted,Unsainted,Slipknot,2019-05-16,NLA321900089,8169,95.1,379175553,85612,9081286,...,13.0,109511,24.0,37473127.0,13403.0,633800,524502.0,,1,
3,Sanctified with Dynamite,Blood of the Saints,Powerwolf,2011-07-29,USMBR1108247,9865,94.75,62337875,19076,1982796,...,2.0,4037,11.0,,,154715,62296.0,,0,
4,Nero Forte,We Are Not Your Kind,Slipknot,2019-08-09,NLA321900097,12992,94.19,205529233,52321,6025170,...,7.0,67513,4.0,20807156.0,5627.0,398199,304875.0,,1,


In [28]:
#Chek for null values and type
chart_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6800 entries, 0 to 6799
Data columns (total 30 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Track                        6800 non-null   object 
 1   Album Name                   6800 non-null   object 
 2   Artists                      6800 non-null   object 
 3   Release Date                 6800 non-null   object 
 4   ISRC                         6800 non-null   object 
 5   All Time Rank                6800 non-null   object 
 6   Track Score                  6800 non-null   float64
 7   Spotify Streams              6762 non-null   object 
 8   Spotify Playlist Count       6800 non-null   object 
 9   Spotify Playlist Reach       6800 non-null   object 
 10  Spotify Popularity           6749 non-null   object 
 11  YouTube Views                3642 non-null   object 
 12  YouTube Likes                3640 non-null   object 
 13  TikTok Videos     

### We are seeing a lot of null data in the dataset, however we are mostly interested in Spotify stream counts, as it is the biggest platform. We are also seeing that a lot of numeric data is identified as object, which we will correct as well.  So we will reload the dataset only using the relevant columns, correct data types and decide what to do with null values


In [56]:
null_count=chart_ds['Spotify Streams'].count()
null_perc = 1-null_count/chart_ds.shape[0]
print(null_perc)

0.005588235294117672


In [69]:

#we are only missing a tiny proportion of data in spotify streams column. We could safely drop the null values
#define list of columns to load:
cols = [0,2,3,4,7,28]
chart_ds = pd.read_csv(raw_data_path/ "chartmetric_raw.csv", usecols=cols, parse_dates = [2], thousands=",", keep_default_na=False,
                       na_filter=False, dtype = {'Spotify Streams':np.float64})
chart_ds.head(5)

ValueError: could not convert string to float: '808,166,393'

In [42]:
chart_ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6800 entries, 0 to 6799
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Track            6800 non-null   object        
 1   Artists          6800 non-null   object        
 2   Release Date     6800 non-null   datetime64[ns]
 3   ISRC             6800 non-null   object        
 4   Spotify Streams  6762 non-null   object        
 5   Explicit Track   6800 non-null   int64         
dtypes: datetime64[ns](1), int64(1), object(4)
memory usage: 318.9+ KB
