# Popularity of Music Records

<img src="images/music.jpg"/>

The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.

Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist's release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable.

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success.

How can we use analytics to predict the popularity of a song? In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.

Taking an analytics approach, we aim to use information about a song's properties to predict its popularity. The dataset songs.csv consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn't make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here's a detailed description of the variables:

    year = the year the song was released

    songtitle = the title of the song

    artistname = the name of the artist of the song

    songID and artistID = identifying variables for the song and artist

    timesignature and timesignature_confidence = a variable estimating the time signature of the song, and the confidence in the estimate

    loudness = a continuous variable indicating the average amplitude of the audio in decibels

    tempo and tempo_confidence = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate

    key and key_confidence = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate

    energy = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness

    pitch = a continuous variable that indicates the pitch of the song

    timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)

    Top10 = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

### Problem 1.1 - Understanding the Data

Use the read.csv function to load the dataset "songs.csv" into R.

In [1]:
songs = read.csv("data/songs.csv")
head(songs)

Unnamed: 0_level_0,year,songtitle,artistname,songID,artistID,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,⋯,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,2010,This Is the House That Doubt Built,A Day to Remember,SOBGGAB12C5664F054,AROBSHL1187B9AFB01,3,0.853,-4.262,91.525,0.953,⋯,82.475,-52.025,39.116,-35.368,71.642,-126.44,18.658,-44.77,25.989,0
2,2010,Sticks & Bricks,A Day to Remember,SOPAQHU1315CD47F31,AROBSHL1187B9AFB01,4,1.0,-4.051,140.048,0.921,⋯,106.918,-61.32,35.378,-81.928,74.574,-103.808,121.935,-38.892,22.513,0
3,2010,All I Want,A Day to Remember,SOOIZOU1376E7C6386,AROBSHL1187B9AFB01,4,1.0,-3.571,160.512,0.489,⋯,80.621,-59.773,45.979,-46.293,59.904,-108.313,33.3,-43.733,25.744,0
4,2010,It's Complicated,A Day to Remember,SODRYWD1315CD49DBE,AROBSHL1187B9AFB01,4,1.0,-3.815,97.525,0.794,⋯,96.675,-78.66,41.088,-49.194,95.44,-102.676,46.422,-59.439,37.082,0
5,2010,2nd Sucks,A Day to Remember,SOICMQB1315CD46EE3,AROBSHL1187B9AFB01,4,0.788,-4.707,140.053,0.286,⋯,110.332,-56.45,37.555,-48.588,67.57,-52.796,22.888,-50.414,32.758,0
6,2010,Better Off This Way,A Day to Remember,SOCEYON1315CD4A23E,AROBSHL1187B9AFB01,4,1.0,-3.807,160.366,0.347,⋯,91.117,-54.378,53.808,-33.183,54.657,-64.478,34.522,-40.922,36.453,0


In [2]:
str(songs)

'data.frame':	7574 obs. of  39 variables:
 $ year                    : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ songtitle               : Factor w/ 7141 levels "'03 Bonnie & Clyde",..: 6204 5522 241 3098 47 607 254 4419 2887 6756 ...
 $ artistname              : Factor w/ 1032 levels "50 Cent","98 Degrees",..: 3 3 3 3 3 3 3 3 3 12 ...
 $ songID                  : Factor w/ 7549 levels "SOAACNI1315CD4AC42",..: 595 5439 5252 1716 3431 1020 1831 3964 6904 2473 ...
 $ artistID                : Factor w/ 1047 levels "AR00B1I1187FB433EB",..: 671 671 671 671 671 671 671 671 671 507 ...
 $ timesignature           : int  3 4 4 4 4 4 4 4 4 4 ...
 $ timesignature_confidence: num  0.853 1 1 1 0.788 1 0.968 0.861 0.622 0.938 ...
 $ loudness                : num  -4.26 -4.05 -3.57 -3.81 -4.71 ...
 $ tempo                   : num  91.5 140 160.5 97.5 140.1 ...
 $ tempo_confidence        : num  0.953 0.921 0.489 0.794 0.286 0.347 0.273 0.83 0.018 0.929 ...
 $ key                  

In [3]:
summary(songs)

      year          songtitle              artistname  
 Min.   :1990   Intro    :  15   Various artists: 162  
 1st Qu.:1997   Forever  :   8   Anal Cunt      :  49  
 Median :2002   Home     :   7   Various Artists:  44  
 Mean   :2001   Goodbye  :   6   Tori Amos      :  41  
 3rd Qu.:2006   Again    :   5   Eels           :  37  
 Max.   :2010   Beautiful:   5   Napalm Death   :  37  
                (Other)  :7528   (Other)        :7204  
                songID                   artistID    timesignature  
 SOALSZJ1370F1A7C75:   2   ARAGWS81187FB3F768: 222   Min.   :0.000  
 SOANPAC13936E0B640:   2   ARL14X91187FB4CF14:  49   1st Qu.:4.000  
 SOBDGMX12B0B80808E:   2   AR4KS8C1187FB4CF3D:  41   Median :4.000  
 SOBUDCZ12A58A80013:   2   AR0JZZ01187B9B2C99:  37   Mean   :3.894  
 SODFRLK13134387FB5:   2   ARZGTK71187B9AC7F5:  37   3rd Qu.:4.000  
 SOEJPOK12A6D4FAFE4:   2   AR95XYH1187FB53951:  31   Max.   :7.000  
 (Other)           :7562   (Other)           :7157                  


**How many observations (songs) are from the year 2010?**

In [6]:
table(songs$year)


1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 
 328  196  186  324  198  258  178  329  380  357  363  282  518  434  479  392 
2006 2007 2008 2009 2010 
 479  622  415  483  373 

Answer: 373.

### Problem 1.2 - Understanding the Data

**How many songs does the dataset include for which the artist name is "Michael Jackson"?**

In [7]:
MJTop10 <- subset(songs,artistname=="Michael Jackson" &  Top10 == 1 )
head(MJTop10)

Unnamed: 0_level_0,year,songtitle,artistname,songID,artistID,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,⋯,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
4329,2001,You Rock My World,Michael Jackson,SOBLCOF13134393021,ARXPPEY1187FB51DF4,4,1.0,-2.768,95.003,0.892,⋯,120.076,-53.839,63.576,-85.169,84.84,-102.185,55.266,-48.107,56.116,1
6207,1995,You Are Not Alone,Michael Jackson,SOJKNNO13737CEB162,ARXPPEY1187FB51DF4,4,1.0,-9.408,120.566,0.805,⋯,90.735,-61.583,60.92,-55.904,76.632,-69.799,46.173,-67.281,47.128,1
6210,1995,Black or White,Michael Jackson,SOBBRFO137756C9CB7,ARXPPEY1187FB51DF4,4,1.0,-4.017,115.027,0.535,⋯,107.974,-55.063,52.505,-110.999,71.477,-133.939,60.442,-55.008,43.473,1
6218,1995,Remember the Time,Michael Jackson,SOIQZMT136C9704DA5,ARXPPEY1187FB51DF4,4,1.0,-3.633,107.921,1.0,⋯,146.587,-58.117,62.157,-54.44,94.501,-112.348,90.437,-53.634,51.681,1
6915,1992,In The Closet,Michael Jackson,SOKIOOC12AF729ED9E,ARXPPEY1187FB51DF4,4,0.991,-4.315,110.501,0.949,⋯,124.354,-78.303,41.322,-83.184,106.263,-136.109,102.829,-48.192,74.575,1


In [8]:
nrow(MJTop10)

In [9]:
table(songs$artistname=="Michael Jackson" ,songs$Top10 == 1)

       
        FALSE TRUE
  FALSE  6442 1114
  TRUE     13    5

Answer: 18 songs, 5 in the TOP10.

### Problem 1.4 - Understanding the Data

The variable corresponding to the estimated time signature (timesignature) is discrete, meaning that it only takes integer values (0, 1, 2, 3, . . . ). What are the values of this variable that occur in our dataset?