# Song Popularity on Spotify over Time

## Introduction: 
*Spotify is one of the world's biggest audio streaming service provider. Launched in late 2008, it had rapidly gained popularity as it provides unlimited access to people using a "freemium" model of bussiness. Which means one can use it for free (with advertisements and lower sound quality) or one can pay for premium for additional features. As spotify is a very largerly used medium for acessing music the data collected from there is a good way to find out how "popular" a song is.*
![Markdown Logo is here.](https://i.guim.co.uk/img/media/ae483ce4f1bfc5497fee1b5387711d1ff0172ec9/232_0_3268_1963/master/3268.jpg?width=1200&quality=85&auto=format&fit=max&s=fcfceea59329a6bee9c9b75dd8d7a055)
## Objective: 
*Different Parametres of a song effect its popularity, the at what level these parametres change and how much they effect the popularity changes over time.*

Our objective is to investigate the optimal levels of multiple parameters make a song most likely to be popular. 

## Overview & Method:

1. Tidy the data we found from <a href="url" target="https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks"> Kaggle </a> and keep only the columns with parametres that we will use to predict the popularity of a song. This data set contains ...
### Primary:
* id (Id of track generated by Spotify) (don't need because irrelevant) 
### Numerical:
* acousticness (Ranges from 0 to 1) - (Keep, is parametre) 
* danceability (Ranges from 0 to 1) - (Keep, is parametre) 
* energy (Ranges from 0 to 1) - (Keep, is parametre) 
* duration_ms (Integer typically ranging from 200k to 300k) - (Keep, is parametre) 
* instrumentalness (Ranges from 0 to 1) - (Keep, is parametre) 
* valence (Ranges from 0 to 1) - (Keep, is parametre) 
* popularity (Ranges from 0 to 100) - (keep in traing set to create model, remove from test data before testing it)
* tempo (dbl typically ranging from 50 to 150) - (Keep, is parametre) 
* liveness (Ranges from 0 to 1) - (Keep, is parametre) 
* loudness (dbl typically ranging from -60 to 0) - (Keep, is parametre) 
* speechiness (Ranges from 0 to 1) - (Keep, is parametre) 
* year (Ranges from 1921 to 2020) - (Keep, used to filter data, and see changes over time)
### Dummy:
* mode (0 = Minor, 1 = Major) - (Dont keep, not numerical) 
* explicit (0 = No explicit content, 1 = Explicit content) - (Dont keep, not numerical)
Categorical:
* key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…) - (Dont keep, not numerical, is categorical)
* artists (List of artists mentioned) - (Don't keep, will cause too much bias, and also is not numerical)
- release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary) - (not required, we are filtering by year)
- name (Name of the song) - (not required, we can use row number to index data)

2. Spotify has around a 100 years of release dates of songs, we will filter this data to look at years from 2010 - 2021 because the kind of popular music changes over time and we would not be able to come up with a reasonably accurate model to predict the popularity of a song based on our chosen parameters.

3. Using classification we are only able to predict the "Class" of data and not numerical values. Spotify uses a 0 to 100 numerical value to identify popularity, that we will have to turn into a factor data type. We can do this by fixing specific ranges of data values and assigning a class to it --> 
* 0 - 20: Not popular
* 20 - 50: Not very popular
* 50 - 90: Popular
* 90 - 100: Most Popular 

3. Once the data is tidy we will standardise it so each parameter is represented fairly in our investigation.

## Expected Outcome: 

We think that... 



In [1]:
# Loading the libraries
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [2]:
# Reading the raw data 
spotify_year_data <- read_csv("data/data.csv")
spotify_year_data

Parsed with column specification:
cols(
  acousticness = [32mcol_double()[39m,
  artists = [31mcol_character()[39m,
  danceability = [32mcol_double()[39m,
  duration_ms = [32mcol_double()[39m,
  energy = [32mcol_double()[39m,
  explicit = [32mcol_double()[39m,
  id = [31mcol_character()[39m,
  instrumentalness = [32mcol_double()[39m,
  key = [32mcol_double()[39m,
  liveness = [32mcol_double()[39m,
  loudness = [32mcol_double()[39m,
  mode = [32mcol_double()[39m,
  name = [31mcol_character()[39m,
  popularity = [32mcol_double()[39m,
  release_date = [31mcol_character()[39m,
  speechiness = [32mcol_double()[39m,
  tempo = [32mcol_double()[39m,
  valence = [32mcol_double()[39m,
  year = [32mcol_double()[39m
)



acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,valence,year
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0.991,['Mamie Smith'],0.598,168333,0.224,0,0cS0A1fUEUd1EW3FcF8AEI,5.22e-04,5,0.3790,-12.628,0,Keep A Song In Your Soul,12,1920,0.0936,149.976,0.634,1920
0.643,"[""Screamin' Jay Hawkins""]",0.852,150200,0.517,0,0hbkKFIJm7Z05H8Zl9w30f,2.64e-02,5,0.0809,-7.261,0,I Put A Spell On You,7,1920-01-05,0.0534,86.889,0.950,1920
0.993,['Mamie Smith'],0.647,163827,0.186,0,11m7laMUgmOKqI3oYzuhne,1.76e-05,0,0.5190,-12.098,1,Golfing Papa,4,1920,0.1740,97.600,0.689,1920
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.806,['Roger Fly'],0.671,218147,0.589,0,48Qj61hOdYmUCFJbpQ29Ob,0.920,4,0.113,-12.393,0,Together,0,2020-12-09,0.0282,108.058,0.714,2020
0.920,['Taylor Swift'],0.462,244000,0.240,1,1gcyHQpBQ1lfXGdhZmWrHP,0.000,0,0.113,-12.077,1,champagne problems,69,2021-01-07,0.0377,171.319,0.320,2021
0.239,['Roger Fly'],0.677,197710,0.460,0,57tgYkWQTNHVFEt6xDKKZj,0.891,7,0.215,-12.237,1,Improvisations,0,2020-12-09,0.0258,112.208,0.747,2020


In [44]:
# We do not require the coloumns key and mode as they are not relevant to the qualities of the song that make it popular. 
spotify_selected_parameters <- spotify_year_data %>%
                               select(-key, -mode, -id, -artists, -release_date, -name, -explicit) %>%
                               filter(year > 2010) # extracting the last decade's spotify tracks
spotify_selected_parameters

acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.887,0.319,187333,0.201,0.00e+00,0.904,-17.796,27,0.0623,117.153,0.239,2018
0.938,0.269,236800,0.129,4.87e-06,0.683,-18.168,26,0.0576,82.332,0.160,2018
0.881,0.644,313093,0.212,2.22e-05,0.798,-14.118,19,0.0347,117.072,0.441,2020
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.806,0.671,218147,0.589,0.920,0.113,-12.393,0,0.0282,108.058,0.714,2020
0.920,0.462,244000,0.240,0.000,0.113,-12.077,69,0.0377,171.319,0.320,2021
0.239,0.677,197710,0.460,0.891,0.215,-12.237,0,0.0258,112.208,0.747,2020


In [45]:
# We will now mutate the data to transform popularity into a factor data type from a numerical (dbl) data type
spotify_most_popular <- filter(spotify_selected_parameters, popularity >= 90 )  %>% 
                        mutate(popularity = "M")
spotify_most_popular
spotify_popular <- filter(spotify_selected_parameters, popularity >= 50 & popularity < 90 ) %>% 
                        mutate(popularity = "P")
spotify_popular
spotify_slightly_popular <- filter(spotify_selected_parameters, popularity >= 20 & popularity < 50 ) %>% 
                        mutate(popularity = "S")
spotify_slightly_popular
spotify_not_popular <- filter(spotify_selected_parameters, popularity <20 ) %>% 
                        mutate(popularity = "N")
spotify_not_popular

acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0.483,0.716,165907,0.512,0,0.0928,-6.257,M,0.0331,104.957,0.326,2018
0.122,0.548,174000,0.816,0,0.3350,-4.209,M,0.0465,95.390,0.557,2019
0.751,0.501,182161,0.405,0,0.1050,-5.679,M,0.0319,109.891,0.446,2019
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.0847,0.574,183240,0.891,0,0.1600,-3.665,M,0.1570,100.978,0.707,2020
0.0882,0.659,180520,0.701,0,0.0866,-4.107,M,0.1640,91.970,0.623,2020
0.1670,0.824,187427,0.457,0,0.0410,-5.428,M,0.0543,87.977,0.950,2020


acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0.68800,0.375,209134,0.418,0,0.371,-5.999,P,0.0360,136.319,0.287,2015
0.74100,0.418,183814,0.343,0,0.113,-7.492,P,0.0339,121.805,0.327,2015
0.00847,0.560,218013,0.936,0,0.161,-5.835,P,0.0439,112.960,0.371,2011
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.498,0.597,196493,0.368,0,0.1090,-10.151,P,0.0936,171.980,0.590,2021
0.105,0.781,172720,0.487,0,0.0802,-7.301,P,0.1670,129.941,0.327,2021
0.920,0.462,244000,0.240,0,0.1130,-12.077,P,0.0377,171.319,0.320,2021


acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0.88700,0.319,187333,0.201,0.00e+00,0.904,-17.796,S,0.0623,117.153,0.239,2018
0.93800,0.269,236800,0.129,4.87e-06,0.683,-18.168,S,0.0576,82.332,0.160,2018
0.00157,0.693,194508,0.387,5.88e-01,0.108,-13.820,S,0.0520,129.968,0.282,2019
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.952000,0.232,622147,0.0986,0.847,0.0645,-19.897,S,0.0376,88.988,0.0567,2020
0.000283,0.635,198164,0.9420,0.737,0.1100,-5.689,S,0.0324,131.978,0.4330,2020
0.808000,0.404,120869,0.2220,0.135,0.1150,-20.526,S,0.0342,180.023,0.5410,2021


acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
0.881,0.644,313093,0.212,2.22e-05,0.7980,-14.118,N,0.0347,117.072,0.441,2020
0.955,0.627,295093,0.184,1.62e-04,0.0986,-15.533,N,0.0450,115.864,0.299,2020
0.888,0.581,183440,0.331,1.50e-05,0.1470,-14.087,N,0.2430,88.303,0.642,2020
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.795,0.429,144720,0.211,0.000,0.196,-11.665,N,0.0360,94.710,0.228,2021
0.806,0.671,218147,0.589,0.920,0.113,-12.393,N,0.0282,108.058,0.714,2020
0.239,0.677,197710,0.460,0.891,0.215,-12.237,N,0.0258,112.208,0.747,2020


In [46]:
spotify_data <- full_join(spotify_most_popular, spotify_popular) %>%
                       full_join(spotify_slightly_popular) %>% 
                       full_join(spotify_not_popular) %>% 
                       mutate(popularity = as_factor(popularity))
spotify_data
# our data is now tidy, as we have the correct columns with the correct data types
# and there is no missing data. 

Joining, by = c("acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "liveness", "loudness", "popularity", "speechiness", "tempo", "valence", "year")

Joining, by = c("acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "liveness", "loudness", "popularity", "speechiness", "tempo", "valence", "year")

Joining, by = c("acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "liveness", "loudness", "popularity", "speechiness", "tempo", "valence", "year")



acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
0.483,0.716,165907,0.512,0,0.0928,-6.257,M,0.0331,104.957,0.326,2018
0.122,0.548,174000,0.816,0,0.3350,-4.209,M,0.0465,95.390,0.557,2019
0.751,0.501,182161,0.405,0,0.1050,-5.679,M,0.0319,109.891,0.446,2019
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.795,0.429,144720,0.211,0.000,0.196,-11.665,N,0.0360,94.710,0.228,2021
0.806,0.671,218147,0.589,0.920,0.113,-12.393,N,0.0282,108.058,0.714,2020
0.239,0.677,197710,0.460,0.891,0.215,-12.237,N,0.0258,112.208,0.747,2020


In [47]:
# separating traning data from test data 
set.seed(1)

spotify_split <- initial_split(spotify_data, prop = 0.75, strata = popularity)
spotify_training_data <- training(spotify_split)
spotify_test_data <- testing(spotify_split)

spotify_training_data 
spotify_test_data

acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
0.483,0.716,165907,0.512,0,0.0928,-6.257,M,0.0331,104.957,0.326,2018
0.122,0.548,174000,0.816,0,0.3350,-4.209,M,0.0465,95.390,0.557,2019
0.751,0.501,182161,0.405,0,0.1050,-5.679,M,0.0319,109.891,0.446,2019
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.79500,0.429,144720,0.211,0.00e+00,0.196,-11.665,N,0.0360,94.710,0.228,2021
0.00917,0.792,147615,0.866,5.99e-05,0.178,-5.089,N,0.0356,125.972,0.186,2020
0.79500,0.429,144720,0.211,0.00e+00,0.196,-11.665,N,0.0360,94.710,0.228,2021


acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo,valence,year
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
0.213,0.662,161385,0.413,0.00,0.134,-7.357,M,0.0299,93.005,0.467,2020
0.218,0.889,174321,0.340,0.13,0.055,-7.773,M,0.0697,94.009,0.716,2020
0.584,0.357,198040,0.425,0.00,0.322,-7.301,M,0.0333,102.078,0.270,2020
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
0.795,0.429,144720,0.211,0.000,0.196,-11.665,N,0.0360,94.710,0.228,2021
0.806,0.671,218147,0.589,0.920,0.113,-12.393,N,0.0282,108.058,0.714,2020
0.239,0.677,197710,0.460,0.891,0.215,-12.237,N,0.0258,112.208,0.747,2020


In [48]:
# summarising data to count the number of rows with different classes

summerise_info <- spotify_training_data %>% 
                  group_by(popularity) %>% 
                  summarise(n=n())
summerise_info

summerise_info_test <- spotify_test_data %>% 
                       group_by(popularity) %>% 
                       summarise(n=n())
summerise_info_test


`summarise()` ungrouping output (override with `.groups` argument)



popularity,n
<fct>,<int>
M,33
P,7238
S,1115
N,11586


`summarise()` ungrouping output (override with `.groups` argument)



popularity,n
<fct>,<int>
M,10
P,2426
S,373
N,3847


We can see from the tibles above, that the values in the test set are appoximately 1/3rd of the values in the training data. This means that the data has be evenly slit between the test and training data sets. This will ensure that when we test our model we will be able to more acurately predict how well it works. 

In [54]:
# now we will see the mean values of each parametre with respect to each of the factors 
# in the column popularity

average_values_by_factor <- spotify_training_data %>% 
                            select(-year) %>% 
                            group_by(popularity) %>% 
                            summarise_all(mean)
                           
average_values_by_factor

popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
M,0.292983,0.712697,185053.8,0.6117273,0.0006184312,0.1716364,-6.252061,0.10107879,117.0918,0.5209
P,0.2582682,0.6153288,216728.8,0.6210593,0.0503848103,0.1829737,-7.211345,0.10621951,121.3042,0.4555018
S,0.2182744,0.5337483,203937.3,0.6886261,0.2690283819,0.280554,-8.670645,0.08736709,123.7486,0.4660721
N,0.2218599,0.5813572,268890.3,0.6743113,0.3753101829,0.2313699,-8.955802,0.09212121,124.2667,0.4359647


Looking at the means of each parametre with respect to the popularity factor, we can see that some of the means are very simmilar accross the different factors, whereas some are very different. To establish a better model we should use the ones that have mean values that vary more when the factor is different. After looking at the tible above we concluded that we will use: 
* danceability 
* duration_ms
* instumentalness
* liveness 
* loudness 
* tempo 
* valence