# Data Science Visualization in Python

## 1. Region and Domain
Global, Music Industry
    
## 2. Research Question
Which, if any, features of a song _might_ be predictive of it's popularity? I won't be doing any statistical analysis or modeling here, just exploring a few variables of interest for the question.  
    
## 3. Links
A discussion of the data set: https://labrosa.ee.columbia.edu/millionsong/
I ended up using a CSV subset from here: https://think.cs.vt.edu/corgis/csv/music/music.csv?forcedownload=1
Field Descriptions here: https://think.cs.vt.edu/corgis/csv/music/music.html
## 4. Graphs
<b>You must upload an image which addresses the research question you stated. In addition to addressing the question, this visual should follow Cairo’s principles of truthfulness, functionality, beauty, and insightfulness.</b>
<img src="final.png" width="1000" align="center"/>

Truthfulness: You will notice that many of the charts in the grid do not show a strong relationship. This is because the data has not been tampered with and MOST of the time MOST variables collected in a study don't turn out to be strong predictors of the variable of interest (in this case - song popularity). 

Functionality: These charts feature a shared X axis, allowing you to compare song popularity against all of these different data features.  Rather than using frequency for Minor/Major key info, I instead used a probability density function. That makes comparison easy between songs played with different keys.

Beauty: Nice, Soft Colors are used as well as transparency. 

Insightfulness: I hope you're left interested in how song popularity could be predicted by chart features. 
## 5. Discussion
<b>You must contribute a short (1-2 paragraph) written justification of how your visualization addresses your stated research question.</b>

The point of this research was to determine a few variables that might be valuable in a predictive model for song popularity. From the graph, it's obvious that Artist Popularity, loudness, and tempo deserve a closer look - perhaps with the full dataset. 

On the other hand, the length of the artist's name and the key of the song could probably be left off of a machine learning model, and certainly additional features should be explored. 
    
## 6. Citations
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. 
The Million Song Dataset. In Proceedings of the 12th International Society
for Music Information Retrieval Conference (ISMIR 2011), 2011.

## 7. Code


In [1]:
#ALL libraries should go here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
#Set a nice style for the plots
#In this case, I'm making a plot for music studio execs and will choose a soft, pleasing theme
plt.style.use('seaborn-pastel')

In [None]:
!wget --no-check-certificate https://think.cs.vt.edu/corgis/csv/music/music.csv
df = pd.read_csv('music.csv')
print(df.head())
print(df.columns)

In [None]:
# preprocessing
#remove rows of very unpopular songs
tempdf = df[(df['song.hotttnesss']>=0.1)]
#remove rows of very unpopular artists
tempdf = tempdf[(tempdf['artist.hotttnesss']>=0.1)]
mylen = np.vectorize(len)
tempdf['artist_name_length'] = mylen(tempdf['artist.name'])

In [None]:
#Lets create a 3x2 grid and select my favorite graphs for it
fig = plt.figure(figsize=(20,11))
# ax = fig.add_subplot(111)
ax1 = fig.add_subplot(231)
ax2 = fig.add_subplot(232)
ax3 = fig.add_subplot(233)
ax4 = fig.add_subplot(234)
ax5 = fig.add_subplot(235)
ax6 = fig.add_subplot(236)

# ax.tick_params(labelcolor='w', top=False, bottom=False, left=False, right=False)

# ax.set_xlabel('Song Popularity')
# ax.set_title('Song Popularity Vs. Various Song Features')
fig.suptitle('Song Popularity Vs. Various Song Features', fontsize=24)

ax1.scatter(tempdf['song.hotttnesss'], tempdf['loudness'], alpha=0.35, color='red')
ax1.set_ylabel('Loudness',fontsize=16)
ax1.set_ylim([-30,0])

ax2.scatter(tempdf['song.hotttnesss'], tempdf['duration'], alpha=0.35, color='grey')
ax2.set_ylabel('Duration (s)',fontsize=16)
ax2.set_ylim([0,600])

ax3.scatter(tempdf['song.hotttnesss'], tempdf['tempo'], alpha=0.35, color='blue')
ax3.set_ylabel('Tempo (bpm)',fontsize=16)
ax3.set_ylim([25,250])

minor = tempdf[(tempdf['mode'] == 0)]
major = tempdf[(tempdf['mode'] == 1)]
ax4.hist(minor['song.hotttnesss'],alpha=0.35, bins=20, label='Minor', density=True)
ax4.hist(major['song.hotttnesss'],alpha=0.35, bins=20, label='Major', density=True)
ax4.set_ylabel('Probability Density of Songs - key',fontsize=16)
ax4.set_label(['Minor','Major'])
ax4.legend()

ax5.scatter(tempdf['song.hotttnesss'], tempdf['artist.hotttnesss'], alpha=0.35, color='purple')
ax5.axes.set_xlabel('Song Popularity', fontsize=20)
ax5.axes.set_ylabel('Artist Popularity',fontsize=16)
ax5.set_ylim([0,1])

ax6.scatter(tempdf['song.hotttnesss'], tempdf['artist_name_length'], alpha=0.35, color='green')
ax6.axes.set_ylabel('Artist Name Length',fontsize=16)
ax6.set_ylim([0,25])

fig.tight_layout()
fig.subplots_adjust(top=0.92, bottom=0.05, hspace = 0.1)
plt.savefig('final.png', dpi=72)


In [None]:
#Hypothesis: Very loud and very quiet songs will not be popular
grid = sns.jointplot(tempdf['song.hotttnesss'], tempdf['loudness'], alpha=0.5)
grid.set_axis_labels('Song Popularity','Loudness')
#It seems only quiet songs pay a penalty

In [None]:
#Hypothesis: Very short songs and very long songs will not be popular
grid = sns.jointplot(tempdf['song.hotttnesss'], tempdf['duration'], alpha=0.5)
grid.set_axis_labels('Song Popularity','Duration (s)')
#Hypothesis is more or less true when song popularity >0.8

In [None]:
#Hypothesis: Key will be important to Song Popularity
grid = sns.jointplot(tempdf['song.hotttnesss'], tempdf['key'], alpha=0.5)
grid.set_axis_labels('Song Popularity','Key')
#Not so much...

In [None]:
#Hypothesis - Very Fast Tempo Songs, and Very slow tempo songs will not be popular
grid = sns.jointplot(tempdf['song.hotttnesss'], tempdf['tempo'], alpha=0.5, kind='hex')
grid.set_axis_labels('Song Popularity','Tempo')
#This is somewhat predictive, VERY slow songs (Tempo <80) amd VERU fast songs (>180) tend not to be highly
#successful (popularity >0.7)

In [None]:
#Hypothesis - major or minor keys will have no effect on song hotness
minor = tempdf[(tempdf['mode'] == 0)]
major = tempdf[(tempdf['mode'] == 1)]
ax = minor['song.hotttnesss'].plot.hist(alpha=0.5, bins=20)
ax1 = major['song.hotttnesss'].plot.hist(alpha=0.5, bins=20)
plt.legend(['Minor', 'Major'])
ax.axes.set_xlabel = 'Song Popularity'

In [None]:
#Hypothesis - artist hotness and song hotness are closely related
ax = tempdf.plot.scatter('song.hotttnesss', 'artist.hotttnesss', alpha=0.5)
ax.axes.set_xlabel('Song Popularity')
ax.axes.set_ylabel('Artist Popularity')
#Not as closely related as you might expect

In [None]:
#strategy is - formulate a hypothesis, investigate a single dimension against song.hottness

#hypothesis - length of artist name will be unrelated to song hotness
mylen = np.vectorize(len)

tempdf['artist_name_length'] = mylen(tempdf['artist.name'])
#remove rows where artist_name_length is = 0
tempdf = tempdf[(tempdf['artist_name_length']!=0)]
#remove rows where artist name length > 22
tempdf = tempdf[(tempdf['artist_name_length']<=22)]
#remove rows of very unpopular songs
tempdf = tempdf[(tempdf['song.hotttnesss']>=0.1)]
ax = tempdf.plot.scatter('song.hotttnesss', 'artist_name_length')
ax.set_xlabel('Song Popularity')
ax.set_ylabel('Artist Name Length')
#it's very much unrelated!