# Table of Contents
 <p><div class="lev1"><a href="#Popularity-of-Music-Records"><span class="toc-item-num">1&nbsp;&nbsp;</span>Popularity of Music Records</a></div><div class="lev2"><a href="#Problem-1.1---Understanding-the-Data"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Problem 1.1 - Understanding the Data</a></div><div class="lev2"><a href="#Problem-1.2---Understanding-the-Data"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Problem 1.2 - Understanding the Data</a></div><div class="lev2"><a href="#Problem-1.4---Understanding-the-Data"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Problem 1.4 - Understanding the Data</a></div><div class="lev2"><a href="#Problem-1.5---Understanding-the-Data"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Problem 1.5 - Understanding the Data</a></div><div class="lev2"><a href="#Problem-2.1---Creating-Our-Prediction-Model"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Problem 2.1 - Creating Our Prediction Model</a></div><div class="lev2"><a href="#Problem-2.2---Creating-our-Prediction-Model"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Problem 2.2 - Creating our Prediction Model</a></div><div class="lev2"><a href="#Problem-2.3---Creating-Our-Prediction-Model"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Problem 2.3 - Creating Our Prediction Model</a></div><div class="lev2"><a href="#Problem-2.4---Creating-Our-Prediction-Model"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Problem 2.4 - Creating Our Prediction Model</a></div><div class="lev2"><a href="#Problem-2.5---Creating-Our-Prediction-Model"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Problem 2.5 - Creating Our Prediction Model</a></div><div class="lev2"><a href="#Problem-3.1---Beware-of-Multicollinearity-Issues!"><span class="toc-item-num">1.10&nbsp;&nbsp;</span>Problem 3.1 - Beware of Multicollinearity Issues!</a></div><div class="lev2"><a href="#Problem-3.2---Beware-of-Multicollinearity-Issues!"><span class="toc-item-num">1.11&nbsp;&nbsp;</span>Problem 3.2 - Beware of Multicollinearity Issues!</a></div><div class="lev2"><a href="#Problem-3.3---Beware-of-Multicollinearity-Issues!"><span class="toc-item-num">1.12&nbsp;&nbsp;</span>Problem 3.3 - Beware of Multicollinearity Issues!</a></div><div class="lev2"><a href="#Problem-4.1---Validating-Our-Model"><span class="toc-item-num">1.13&nbsp;&nbsp;</span>Problem 4.1 - Validating Our Model</a></div><div class="lev2"><a href="#Problem-4.2---Validating-Our-Model"><span class="toc-item-num">1.14&nbsp;&nbsp;</span>Problem 4.2 - Validating Our Model</a></div><div class="lev2"><a href="#Problem-4.3---Validating-Our-Model"><span class="toc-item-num">1.15&nbsp;&nbsp;</span>Problem 4.3 - Validating Our Model</a></div><div class="lev2"><a href="#Problem-4.4---Validating-Our-Model"><span class="toc-item-num">1.16&nbsp;&nbsp;</span>Problem 4.4 - Validating Our Model</a></div><div class="lev2"><a href="#Problem-4.5---Validating-Our-Model"><span class="toc-item-num">1.17&nbsp;&nbsp;</span>Problem 4.5 - Validating Our Model</a></div>

Popularity of Music Records
===========================

The music industry has a well-developed market with a global annual revenue around $15 billion. 
The recording industry is highly competitive and is dominated by three big production companies 
which make up nearly 82% of the total annual album sales. 

Artists are at the core of the music industry and record labels provide them with the necessary resources
to sell their music on a large scale. A record label incurs numerous costs
(studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist's release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable. 

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success. 

How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, we aim to use information about a song's properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn't make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here's a detailed description of the variables:

- year = the year the song was released
- songtitle = the title of the song
- artistname = the name of the artist of the song
- songID and artistID = identifying variables for the song and artist
- timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate
- loudness = a continuous variable indicating the average amplitude of the audio in decibels
- tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate
- key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate
- energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
- pitch = a continuous variable that indicates the pitch of the song
- timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the - minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)
- Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

## Problem 1.1 - Understanding the Data

Use the read.csv function to load the dataset "songs.csv" into R.

How many observations (songs) are from the year 2010?

In [1]:
songs <- read.csv("songs.csv")
str(songs)

'data.frame':	7574 obs. of  39 variables:
 $ year                    : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ songtitle               : Factor w/ 7141 levels "'03 Bonnie & Clyde",..: 6204 5522 241 3098 47 607 254 4419 2887 6756 ...
 $ artistname              : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
 $ songID                  : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831 3964 6904 2473 ...
 $ artistID                : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 671 507 ...
 $ timesignature           : int  3 4 4 4 4 4 4 4 4 4 ...
 $ timesignature_confidence: num  0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
 $ loudness                : num  -4.26 -4.05 -3.57 -3.81 -4.71 ...
 $ tempo                   : num  91.5 140 160.5 97.5 140.1 ...
 $ tempo_confidence        : num  0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
 $ key                  

The songs dataset consists of 7574 observations.

In [2]:
table(songs$year)["2010"]

From these 7574 observations, 373 are from the year 2010.

## Problem 1.2 - Understanding the Data

How many songs does the dataset include for which the artist name is "Michael Jackson"?

In [3]:
table(songs$artistname)["Michael Jackson"]

Problem 1.3 - Understanding the Data

Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.

- Beat It
- You Rock My World
- Billie Jean
- You Are Not Alone

In [4]:
MJ <- subset(songs, artistname == "Michael Jackson")

In [5]:
MJ_songs <- MJ[c("songtitle", "Top10")]
subset(MJ_songs, MJ_songs$Top10 == 1)

Unnamed: 0,songtitle,Top10
4329,You Rock My World,1
6207,You Are Not Alone,1
6210,Black or White,1
6218,Remember the Time,1
6915,In The Closet,1


You Rock My World and You Are Not Alone made it to the top10.

## Problem 1.4 - Understanding the Data

The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that 
it only takes integer values (0, 1, 2, 3, . . . ). 

What are the values of this variable that occur in our dataset?

In [6]:
table(songs$timesignature)


   0    1    3    4    5    7 
  10  143  503 6787  112   19 

Which timesignature value is the most frequent among songs in our dataset? --> 4

## Problem 1.5 - Understanding the Data

Out of all of the songs in our dataset, the song with the highest tempo is one of the following songs. Which one is it?

- Until The Day I Die
- Wanna Be Startin' Somethin'
- My Happy Ending
- You Make Me Wanna...

In [7]:
songs$songtitle[which.max(songs$tempo)]

## Problem 2.1 - Creating Our Prediction Model

We wish to predict whether or not a song will make it to the Top 10.

To do this, first use the subset function to split the data into a training set "SongsTrain" consisting of all the observations up to and including 2009 song releases, and a testing set "SongsTest", consisting of the 2010 song releases.

How many observations (songs) are in the training set?

In [8]:
SongsTrain <- subset(songs, songs$year <= 2009)
SongsTest <- subset(songs, songs$year > 2009)

In [9]:
nrow(SongsTrain)

## Problem 2.2 - Creating our Prediction Model

In this problem, our outcome variable is "Top10" - we are trying to predict whether or not a song will make it to the Top 10 of the Billboard Hot 100 Chart. 

Since the outcome variable is binary, we will build a logistic regression model. We'll start by using all song attributes as our independent variables, which we'll call Model 1.

We will only use the variables in our dataset that describe the numerical attributes of the song in our logistic regression model. So we won't use the variables "year", "songtitle", "artistname", "songID" or "artistID".

Use the glm function to build a logistic regression model to predict Top10 using all of the other variables as the independent variables. You should use SongsTrain to build the model.

Looking at the summary of your model, what is the value of the Akaike Information Criterion (AIC)?

In [10]:
nonvars = c("year", "songtitle", "artistname", "songID", "artistID")

SongsTrain = SongsTrain[ , !(names(SongsTrain) %in% nonvars) ]
SongsTest = SongsTest[ , !(names(SongsTest) %in% nonvars) ]

In [11]:
SongsLog1 = glm(Top10 ~ ., data=SongsTrain, family=binomial)

In [17]:
summary(SongsLog1)


Call:
glm(formula = Top10 ~ ., family = binomial, data = SongsTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9220  -0.5399  -0.3459  -0.1845   3.0770  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               1.470e+01  1.806e+00   8.138 4.03e-16 ***
timesignature             1.264e-01  8.674e-02   1.457 0.145050    
timesignature_confidence  7.450e-01  1.953e-01   3.815 0.000136 ***
loudness                  2.999e-01  2.917e-02  10.282  < 2e-16 ***
tempo                     3.634e-04  1.691e-03   0.215 0.829889    
tempo_confidence          4.732e-01  1.422e-01   3.329 0.000873 ***
key                       1.588e-02  1.039e-02   1.529 0.126349    
key_confidence            3.087e-01  1.412e-01   2.187 0.028760 *  
energy                   -1.502e+00  3.099e-01  -4.847 1.25e-06 ***
pitch                    -4.491e+01  6.835e+00  -6.570 5.02e-11 ***
timbre_0_min              2.316e-02  4.256e-03   5.44

Looking at the bottom of the summary(SongsLog1) output, we can see that the AIC value is 4827.2.

## Problem 2.3 - Creating Our Prediction Model

Let's now think about the variables in our dataset related to the confidence of the time signature, key and tempo (timesignature_confidence, key_confidence, and tempo_confidence). Our model seems to indicate that these confidence variables are significant (rather than the variables timesignature, key and tempo themselves). What does the model suggest?


- The lower our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10
- The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10

In [25]:
summary(SongsLog1)$coefficients[c("timesignature_confidence", "key_confidence", "tempo_confidence"),"Estimate"] > 0

Their coefficients are positive, this means that higher confidence leads to a higher predicted probability of a Top 10 hit.

## Problem 2.4 - Creating Our Prediction Model

In general, if the confidence is low for the time signature, tempo, and key, then the song is more likely to be complex. What does Model 1 suggest in terms of complexity?

- Mainstream listeners tend to prefer more complex songs.
- Mainstream listeners tend to prefer less complex songs.

Since the coefficient values for timesignature_confidence, tempo_confidence, and key_confidence are all positive, lower confidence leads to a lower predicted probability of a song being a hit. So mainstream listeners tend to prefer less complex songs.

## Problem 2.5 - Creating Our Prediction Model

Songs with heavier instrumentation tend to be louder (have higher values in the variable "loudness") and more energetic (have higher values in the variable "energy").

By inspecting the coefficient of the variable "loudness", what does Model 1 suggest?


- Mainstream listeners prefer songs with heavy instrumentation.
- Mainstream listeners prefer songs with light instrumentation.

In [26]:
summary(SongsLog1)$coefficients[c("energy","loudness"),"Estimate"]

The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs, which are those with heavier instrumentation.

However, the coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those with light instrumentation.

These coefficients lead us to different conclusions.

## Problem 3.1 - Beware of Multicollinearity Issues!

What is the correlation between the variables "loudness" and "energy" in the training set?

In [30]:
cor(SongsTrain[c("energy","loudness")])

Unnamed: 0,energy,loudness
energy,1.0,0.7399067
loudness,0.7399067,1.0


Given that these two variables are highly correlated, Model 1 suffers from multicollinearity.

To avoid this issue, we will omit one of these two variables and rerun the logistic regression.

In the rest of this problem, we'll build two variations of our original model: Model 2, in which we keep "energy" and omit "loudness", and Model 3, in which we keep "loudness" and omit "energy".

## Problem 3.2 - Beware of Multicollinearity Issues!

Create Model 2, which is Model 1 without the independent variable "loudness".

Look at the summary, and inspect the coefficient of the variable "energy". What do you observe?


- Model 2 suggests that songs with high energy levels tend to be more popular. This contradicts our observation in Model 1.
- Model 2 suggests that, similarly to Model 1, songs with low energy levels tend to be more popular.

In [31]:
SongsLog2 <- glm(Top10 ~. -loudness, data=SongsTrain, family=binomial)
summary(SongsLog2)


Call:
glm(formula = Top10 ~ . - loudness, family = binomial, data = SongsTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0983  -0.5607  -0.3602  -0.1902   3.3107  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -2.241e+00  7.465e-01  -3.002 0.002686 ** 
timesignature             1.625e-01  8.734e-02   1.860 0.062873 .  
timesignature_confidence  6.885e-01  1.924e-01   3.578 0.000346 ***
tempo                     5.521e-04  1.665e-03   0.332 0.740226    
tempo_confidence          5.497e-01  1.407e-01   3.906 9.40e-05 ***
key                       1.740e-02  1.026e-02   1.697 0.089740 .  
key_confidence            2.954e-01  1.394e-01   2.118 0.034163 *  
energy                    1.813e-01  2.608e-01   0.695 0.486991    
pitch                    -5.150e+01  6.857e+00  -7.511 5.87e-14 ***
timbre_0_min              2.479e-02  4.240e-03   5.847 5.01e-09 ***
timbre_0_max             -1.007e-01  1.178

In [35]:
summary(SongsLog2)$coefficients["energy", "Estimate"]
summary(SongsLog2)$coefficients["energy", "Estimate"] > 0

The coefficient estimate for energy is positive in Model 2, suggesting that songs with higher energy levels tend to be more popular. However, the variable energy is not significant in this model.

## Problem 3.3 - Beware of Multicollinearity Issues!

Now, create Model 3, which should be exactly like Model 1, but without the variable "energy".

Look at the summary of Model 3 and inspect the coefficient of the variable "loudness". Remembering that higher loudness and energy both occur in songs with heavier instrumentation, do we make the same observation about the popularity of heavy instrumentation as we did with Model 2?

In [36]:
SongsLog3 <- glm(Top10 ~. -energy, data=SongsTrain, family=binomial)
summary(SongsLog3)


Call:
glm(formula = Top10 ~ . - energy, family = binomial, data = SongsTrain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9182  -0.5417  -0.3481  -0.1874   3.4171  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)    
(Intercept)               1.196e+01  1.714e+00   6.977 3.01e-12 ***
timesignature             1.151e-01  8.726e-02   1.319 0.187183    
timesignature_confidence  7.143e-01  1.946e-01   3.670 0.000242 ***
loudness                  2.306e-01  2.528e-02   9.120  < 2e-16 ***
tempo                    -6.460e-04  1.665e-03  -0.388 0.698107    
tempo_confidence          3.841e-01  1.398e-01   2.747 0.006019 ** 
key                       1.649e-02  1.035e-02   1.593 0.111056    
key_confidence            3.394e-01  1.409e-01   2.409 0.015984 *  
pitch                    -5.328e+01  6.733e+00  -7.914 2.49e-15 ***
timbre_0_min              2.205e-02  4.239e-03   5.200 1.99e-07 ***
timbre_0_max             -3.105e-01  2.537e-

In [38]:
summary(SongsLog3)$coefficients["loudness", "Estimate"]
summary(SongsLog3)$coefficients["loudness", "Estimate"] > 0

Looking at the output of summary(SongsLog3), we can see that loudness has a positive coefficient estimate, meaning that our model predicts that songs with heavier instrumentation tend to be more popular.

This is the same conclusion we got from Model 2.

In the remainder of this problem, we'll just use Model 3.

## Problem 4.1 - Validating Our Model

Make predictions on the test set using Model 3. What is the accuracy of Model 3 on the test set, using a threshold of 0.45? (Compute the accuracy as a number between 0 and 1.)

In [41]:
predMod3 <- predict(SongsLog3, newdata=SongsTest, type="response")
thresh <- .45
table(SongsTest$Top10, predMod3 >= thresh)

   
    FALSE TRUE
  0   309    5
  1    40   19

In [44]:
accuracy_predMod3 <- (309 + 19) / (309 + 19 + 40 + 5)
print(accuracy_predMod3)

[1] 0.8793566


## Problem 4.2 - Validating Our Model

Let's check if there's any incremental benefit in using Model 3 instead of a baseline model.

Given the difficulty of guessing which song is going to be a hit, an easier model would be to pick the most frequent outcome (a song is not a Top 10 hit) for all songs.

What would the accuracy of the baseline model be on the test set? (Give your answer as a number between 0 and 1.)

In [43]:
table(SongsTest$Top10)


  0   1 
314  59 

The baseline model would get 314 observations correct, and 59 wrong.

In [45]:
accuracy_baseline <- 314/(314+59)
print(accuracy_baseline)

[1] 0.8418231


## Problem 4.3 - Validating Our Model

It seems that Model 3 gives us a small improvement over the baseline model. Still, does it create an edge?

Let's view the two models from an investment perspective. A production company is interested in investing in songs that are highly likely to make it to the Top 10. The company's objective is to minimize its risk of financial losses attributed to investing in songs that end up unpopular.

A competitive edge can therefore be achieved if we can provide the production company a list of songs that are highly likely to end up in the Top 10. We note that the baseline model does not prove useful, as it simply does not label any song as a hit. Let us see what our model has to offer.

How many songs does Model 3 correctly predict as Top 10 hits in 2010 (remember that all songs in 2010 went into our test set), using a threshold of 0.45?

In [47]:
table(SongsTest$Top10, predMod3 >= thresh)

   
    FALSE TRUE
  0   309    5
  1    40   19

The model correctly predicts 19 songs to be in the top 10.

How many non-hit songs does Model 3 predict will be Top 10 hits (again, looking at the test set), using a threshold of 0.45?

The model falsely predicts 5 songs to be in the top 10.

## Problem 4.4 - Validating Our Model

What is the sensitivity of Model 3 on the test set, using a threshold of 0.45?

In [50]:
sensitivity_mod3 <- 19/(40+19)
print(sensitivity_mod3)

[1] 0.3220339


What is the specificity of Model 3 on the test set, using a threshold of 0.45?

In [51]:
specificity_mod3 <- 309/(309+5)
print(specificity_mod3)

[1] 0.9840764


## Problem 4.5 - Validating Our Model

What conclusions can you make about our model? (Select all that apply.)

- Model 3 favors specificity over sensitivity.
- Model 3 favors sensitivity over specificity.
- Model 3 captures less than half of Top 10 songs in 2010. Model 3 therefore does not provide a useful list of candidate songs to investors, and hence offers no competitive edge.
- Model 3 provides conservative predictions, and predicts that a song will make it to the Top 10 very rarely. So while it detects less than half of the Top 10 songs, we can be very confident in the songs that it does predict to be Top 10 hits.

In [55]:
table(SongsTest$Top10, predMod3 >= thresh)
print(paste("sensitivity: ", sensitivity_mod3))
print(paste("specificity: ", specificity_mod3))

   
    FALSE TRUE
  0   309    5
  1    40   19

[1] "sensitivity:  0.322033898305085"
[1] "specificity:  0.984076433121019"


Model 3 favors specificity over sensitivity.

While Model 3 only captures less than half of the Top 10 songs, it still can offer a competitive edge, since it is very conservative in its predictions.