# Exploratory Data Analysis
In this notebook, I investigate the stucture of the Spotify pop playlists in search for interesting conclusions. All plots are based on data as of July $29^{th}$.

In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np
import sqlite3
from lets_plot import *

# Set up the lets-plot packages and sql magic
LetsPlot.setup_html()
%load_ext sql
%config SqlMagic.autocommit=True

# Connect to the database
%sql sqlite:///../data//clean/spotify_playlists.db --alias db


## What playlists do people usually listen to?
Let us inspect how popular each playlists is.

In [None]:
%sql pop << SELECT * FROM playlists

pop = pop.DataFrame()
pop = pop.sort_values('num_followers')
p1 = ggplot(pop, aes(x='name', y='num_followers')) + \
    geom_lollipop(fatten=1.5)  + \
    scale_x_log10() + \
    coord_flip() + \
    ylab('Number of Followers (log scale)') + \
    xlab('Playlist') + \
    ggtitle('Only four playlists cross the line of 1 mln followers')

p1.show()

We can clearly see that four playlists beat all other playlists in terms of number of followers by a 2 million margin. Those are [Today's Top Hits](https://open.spotify.com/playlist/37i9dQZF1DXcBWIGoYBM5M), [Songs to Sing in the Car](https://open.spotify.com/playlist/37i9dQZF1DWWMOmoXKqHTD), [Mega Hit Mix](https://open.spotify.com/playlist/37i9dQZF1DXbYM3nMM0oPk) and [just hits](https://open.spotify.com/playlist/37i9dQZF1DXcRXFNfZr7Tp).

## Are songs with adult content more popular than others?
Now it is time to dig into the details and find out whether inclusion of adult content is a recipe for song's success. The [popularity](https://developer.spotify.com/documentation/web-api/reference/get-track) variable is provided by Spotify's API and i

In [None]:
%%sql

tab << SELECT is_explicit, popularity, release_date, title, album_name
FROM songs
LEFT JOIN song_album_map
ON songs.song_id = song_album_map.song_id
LEFT JOIN albums
ON song_album_map.album_id = albums.album_id


In [None]:
songs = tab.DataFrame()
songs['release_date'] = pd.to_datetime(songs['release_date'], format = 'ISO8601')
songs = songs.sort_values('release_date')

# Categorical type resulted in incorrectly formated plots so I changed the type to str
songs['is_explicit'] = songs['is_explicit'].astype(str)

TO DO: ADD title, album_name TO THE POINTS

In [None]:
plot = ggplot(songs, aes(x='release_date', y='popularity', color='is_explicit')) + \
    geom_point(alpha=0.6, tooltips=layer_tooltips(['title', 'album_name'])) + \
    ggtitle('Songs\' Popularity versus Release Date') + \
    ylab('Popularity') + \
    xlab('Date of Release') + \
    scale_x_datetime() + \
    scale_color_manual(values=['red', 'blue'], name='Explicit Content', labels=['No', 'Yes'])

plot.show()

The plot above is difficult to interpret, mainly because of overwhelmingly many data point in the two most recent years. The only interesting insight is that only since the 90s mainstream music began to include songs with explicit content. In the next plot I aggregate the songs from each year and compute following statistics: mean and standard deviation of popularity and the percentage of songs with explicit content.

In [None]:
songs['year'] = songs['release_date'].dt.year
songs['avg_popularity'] = songs.groupby('year')['popularity'].transform('mean')
songs['std_popularity'] = songs.groupby('year')['popularity'].transform('std')
songs['is_explicit'] = songs['is_explicit'].astype(int)
songs['frac_explicit'] = songs.groupby('year')['is_explicit'].transform('mean')
songs['lower_ci'] = songs['avg_popularity'] - songs['std_popularity']
songs['upper_ci'] = songs['avg_popularity'] + songs['std_popularity']

In [None]:
plot = ggplot(songs, aes(x='year', y='avg_popularity', color='frac_explicit')) + \
    geom_point(alpha=0.6) + \
    geom_errorbar(aes(ymin='lower_ci', ymax='upper_ci'), width=0.2) + \
    ggtitle('Songs\' Popularity and Release Date with Explicit Content') + \
    ylab('Popularity') + \
    xlab('Date of Release') + \
    scale_color_viridis() + \
    scale_x_continuous(breaks=[1960, 1970, 1980, 1990, 2000, 2010, 2020], 
                       labels=['1960', '1970', '1980', '1990', '2000', '2010', '2020']) + \
    ggsize(width=1000, height = 330)
    # Explicit labels were necessary to remove comma from the year (it was treated as numeric)
plot.show()

Here, the inclusion of vertical lines representing the confidence intervals and aggregation per year still did not improve the readibility significantly. The next plot will only differ in the aggregation period - I will cluster the observations into 5-year intervals.

In [None]:
# chatGPT helped in this cell
year_bins = np.arange(1960, 2025, 5)

# Create a function to map each year to the nearest value in the sequence
def map_to_nearest_5year(year):
    return year_bins[np.abs(year_bins - year).argmin()]

# Apply the function to create the '5years' column
songs['5years'] = songs['year'].apply(map_to_nearest_5year)

songs['avg_popularity'] = songs.groupby('5years')['popularity'].transform('mean')
songs['std_popularity'] = songs.groupby('5years')['popularity'].transform('std')
songs['frac_explicit'] = songs.groupby('5years')['is_explicit'].transform('mean')
songs['lower_ci'] = songs['avg_popularity'] - songs['std_popularity']
songs['upper_ci'] = songs['avg_popularity'] + songs['std_popularity']


plot = ggplot(songs, aes(x='5years', y='avg_popularity', color='frac_explicit')) + \
    geom_point(alpha=0.6) + \
    geom_errorbar(aes(ymin='lower_ci', ymax='upper_ci'), width=0.2) + \
    ggtitle('Songs\' Popularity and Release Date with Explicit Content') + \
    ylab('Popularity') + \
    xlab('Date of Release') + \
    scale_color_viridis() + \
    scale_x_continuous(breaks=[1960, 1970, 1980, 1990, 2000, 2010, 2020], 
                       labels=['1960', '1970', '1980', '1990', '2000', '2010', '2020']) 
    # Explicit labels were necessary to remove comma from the year (it was treated as numeric)
plot.show()

In [None]:
songs[['5years', 'avg_popularity']]

Unnamed: 0,5years,avg_popularity
1370,1960,4.000000
1226,1960,75.000000
1211,1960,77.000000
605,1965,68.500000
1300,1965,68.500000
...,...,...
454,2020,51.282958
420,2020,51.282958
452,2020,51.282958
459,2020,51.282958


## Second Plot
