# Spotify Data Analysis

Now that I've scraped data using Spotify's API, it's time to analyze it.

## Here's The Plan

1. Data Cleaning
2. Exploratory Data Analysis
3. Data Visualization

## 0 - Importing Libraries and Reading Data

In [1]:
import pandas as pd

In [4]:
playlist = pd.read_csv('files/playlist.csv')
artists = pd.read_csv('files/updated_artists.csv')

## 1 - Data Cleaning

When I scraped the data from my playlist, I created the wrong header names for the csv file. So I'll fix them here.

In [12]:
playlist.columns

Index(['Track_Name', 'Track_Id', 'Track_Popularity', 'Artist_1_Name',
       'Artist_2_Name', 'Artist_3_Name', 'Artist_4_Name', 'Artist_5_Name',
       'Artist_6_Name', 'Artist_1_Popularity', 'Artist_2_Popularity',
       'Artist_3_Popularity', 'Artist_4_Popularity', 'Artist_5_Popularity',
       'Artist_6_Popularity'],
      dtype='object')

In [25]:
playlist.rename(columns={
                    'Artist_1_Popularity': 'Artist_1_Id', 
                    'Artist_2_Popularity': 'Artist_2_Id',
                    'Artist_3_Popularity': 'Artist_3_Id',
                    'Artist_4_Popularity': 'Artist_4_Id',
                    'Artist_5_Popularity': 'Artist_5_Id',
                    'Artist_6_Popularity': 'Artist_6_Id'})

Unnamed: 0,Track_Name,Track_Id,Track_Popularity,Artist_1_Name,Artist_2_Name,Artist_3_Name,Artist_4_Name,Artist_5_Name,Artist_6_Name,Artist_1_Id,Artist_2_Id,Artist_3_Id,Artist_4_Id,Artist_5_Id,Artist_6_Id
0,Felices los 4,1RouRzlg8OKFeqc6LvdxmB,77,Maluma,,,,,,1r4hJ1h58CWwUQe3MxPuau,,,,,
1,ADMV,3eJMSq78dDaFb7VvhNFnq6,72,Maluma,,,,,,1r4hJ1h58CWwUQe3MxPuau,,,,,
2,Chantaje (feat. Maluma),6mICuAdrwEjh6Y6lroV2Kg,80,Shakira,Maluma,,,,,0EmeFodog0BfCgMzAIvKQp,1r4hJ1h58CWwUQe3MxPuau,,,,
3,Ignorantes,3wYRLYuO1M88d8woWUIxct,72,Bad Bunny,Sech,,,,,4q3ewBCX7sLwd24euuV69X,77ziqFxp5gaInVrF2lj4ht,,,,
4,Me Gusta,5Xhqh4lwJPtMUTsdBztN1a,74,Shakira,Anuel AA,,,,,0EmeFodog0BfCgMzAIvKQp,2R21vXR83lH98kGeO99Y66,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
434,tqum,3zR2iyPKGtAVyvAYZH1YUr,77,Sofia Reyes,Danna Paola,,,,,0haZhu4fFKt0Ag94kZDiz2,5xSx2FM8mQnrfgM1QsHniB,,,,
435,BESO,609E1JCInJncactoMmkDon,96,ROSALIA,Rauw Alejandro,,,,,7ltDVBr6mKbRvohxheJ9h1,1mcTU81TzQhprhouKaTkpq,,,,
436,Tattoo,7na7Bk98usp84FaOJFPv3d,69,Rauw Alejandro,,,,,,1mcTU81TzQhprhouKaTkpq,,,,,
437,8 AM,7CSmXJNeArnwDfUmtP4Gve,77,Nicki Nicole,Young Miko,,,,,2UZIAOlrnyZmyzt1nuXr9y,3qsKSpcV3ncke3hw52JSMB,,,,


## 2 - Exploratory Data Analysis

Here are the metrics I'm looking for:

1. total number of songs
2. total number of artists
3. average song popularity
4. average artist popularity
5. average number of artists per song
6. number of times each artist appears
7. percentage of artists' top tracks in my playlist
8. ration of "% top tracks in playlist" to total number of artist's songs in playlist
9. total number of songs with featured artists

### 1. total number of songs

This is as easy as using the pandas `shape` function to check how many rows are in the csv.

And you can see there aer 439 rows... aka 439 songs.

In [26]:
playlist.shape

(439, 15)

### 2. total number of artists

Again, we use the pandas `shape` function with the `artists.csv` to find how many rows there are.

**There are 232 artists.**

In [28]:
artists.shape

(232, 13)

### 3. average song popularity

Using pandas `describe` function, **the average song popularity is about 61.**

In [29]:
playlist.describe()

Unnamed: 0,Track_Popularity
count,439.0
mean,61.369021
std,20.061675
min,0.0
25%,56.0
50%,65.0
75%,74.0
max,96.0


### 4. average artist popularity

Again using `describe`, **the average artist popularity is about 67.**

In [30]:
artists.describe()

Unnamed: 0,Artist_Popularity
count,232.0
mean,67.258621
std,16.186479
min,0.0
25%,59.75
50%,71.0
75%,79.0
max,96.0


### 5. average number of artists per song

Here's where the code gets fun...

Once I find the count of songs per each number of artists, I'll use an `Expected Value` formula I learned while studying probability and stats.

As you can see, **the average number of artists per song is 1.9066059225512528 (which rounds up to 2)**

In [35]:
#Testing the for loop I'll use in my EV equation

for column in playlist.columns:
    if column.startswith('Artist_') and column.endswith('Name'):
        count = playlist[column][playlist[column] != ''].count()
        print(f'{column} has {count} non-null values')

Artist_1_Name has 439 non-null values
Artist_2_Name has 282 non-null values
Artist_3_Name has 73 non-null values
Artist_4_Name has 25 non-null values
Artist_5_Name has 13 non-null values
Artist_6_Name has 5 non-null values


In [50]:
num_artists = [1, 2, 3, 4, 5, 6]
temp_count = []

for column in playlist.columns:
    if column.startswith('Artist_') and column.endswith('Name'):
        count = playlist[column][playlist[column] != ''].count()
        temp_count.append(count)

for i in range(len(temp_count)):
    if temp_count[i] != temp_count[-1]:
        temp_count[i] -= temp_count[i+1]

expected_value = sum(n * (p/439) for n,p in zip(num_artists, temp_count))
print(f'The average number of artsist per song is {expected_value}')

The average number of artsist per song is 1.9066059225512528


### 6. number of times each artists appears

Now I'm looking for the number of times each artist shows up on my playlist. ***Note:*** this doesn't mean the number of songs an each artist has on my playlist. This calculation includes the *number of features* each artist has too.