# Data Acquisition and Cleaning
The first part of the document presents the work done in order to obtain a usable dataset. Indeed, the Million Song Dataset (MSD) was lackluster regarding song features (danceability, energy, etc), which were all set to 0. That's why we decided to take additional steps to improve this dataset.    
In the first part, we import data from the MSD and we decide to keep only some values of interest from this dataset. Next, we query the Spotify API in order to obtain additional information about each song. In the meantime, we also add genre classification for each song from 3 datasets built around the MSD. Finally, we generate a csv which will be our main resource of data for the rest of the project.

In [None]:
import pandas as pd
import sqlite3
from sqlite3 import Error
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from ipywidgets import widgets, interactive
%matplotlib inline

%load_ext autoreload
%autoreload 2

sns.set(font_scale=1.75)

The following function aims at creating a connection to an SQLite database.

In [None]:
def create_connection(db_file):
    """ create a database connection to the SQLite database
        specified by the db_file
    :param db_file: database file
    :return: Connection object or None
    """
    try:
        conn = sqlite3.connect(db_file)
        return conn
    except Error as e:
        print(e)

    return None

We have a quick look at the columns we have at our disposal.

In [None]:
database = "data/track_metadata.db"
conn = create_connection(database)
cur = conn.cursor()
cur.execute("PRAGMA table_info(songs)")
rows = cur.fetchall()
print(rows)

Request the dataset and put the data in a dataframe.

In [None]:
cur.execute("SELECT track_id, song_id, artist_id, duration, artist_hotttnesss, year FROM songs ORDER BY track_id")
rows = cur.fetchall()
songs = pd.DataFrame(rows, columns=['track_id', 'song_id', 'artist_id', 'duration', 'artist_hotttnesss', 'year'])

The 3 following cells merge our data from genres classification coming from 3 datasets.

In [None]:
track_genre_cd1 = pd.read_csv('data/msd_tagtraum_cd1.cls', sep='\t', names=['track_id', 'genre1_cd1', 'genre2_cd1'])
songs = track_genre_cd1.merge(songs, on='track_id', how='right')

In [None]:
track_genre_cd1 = pd.read_csv('data/msd_tagtraum_cd2.cls', sep='\t', names=['track_id', 'genre1_cd2', 'genre2_cd2'])
songs = track_genre_cd1.merge(songs, on='track_id', how='right')

In [None]:
track_genre_cd1 = pd.read_csv('data/msd_tagtraum_cd2c.cls', sep='\t', names=['track_id', 'genre1_cd2c', 'genre2_cd2c'])
songs = track_genre_cd1.merge(songs, on='track_id', how='right')

We take a look at the analysis of the songs from the MSD

In [None]:
msd_summary_file = pd.HDFStore("data/msd_summary_file.h5")
songs_analysis = msd_summary_file.get('/analysis/songs')
songs_analysis.columns

And add the data we find interesting. Indeed a lot of columns in this dataset are empty and cannot be used.

In [None]:
songs_analysis = songs_analysis[['track_id', 'loudness', 'mode', 'tempo', 'key']]
songs = songs_analysis.merge(songs, on='track_id')
del(songs_analysis)

We do the same operation for the metadata.

In [None]:
songs_metadata = msd_summary_file.get('/metadata/songs')
songs_metadata.columns

In [None]:
songs_metadata = songs_metadata[['song_hotttnesss', 'song_id', 'artist_latitude', 'artist_location', 'artist_longitude']]
songs = songs_metadata.merge(songs, on='song_id')
del(songs_metadata)

And we take a look at the data we have gathered until now.

In [None]:
songs.head()

In order to gather data from the Spotify API we have some scripts in auxiliary files (stored in the folder spotify_requests_tools). Using these tools, we created two csv 'feature_songs.csv' and 'track_year_popularity.csv' which contain additional information for each song.

In [None]:
spotify_data = pd.read_csv('data/feature_songs.csv')
spotify_data.columns

In [None]:
# Replace the unknown values(zeros) by NaN
for column in spotify_data.columns:
    spotify_data.loc[spotify_data[column] == 0, column] = np.nan

In [None]:
songs2 = songs.merge(spotify_data, how='left', on='song_id')

Again taking a look at the data we have until now

In [None]:
songs2.head()

In [None]:
songs2.iloc[1, :]

In [None]:
spotify_year_pop = pd.read_csv('data/track_year_popularity.csv')
final_merge = songs2.merge(spotify_year_pop.drop_duplicates(['song_id'], keep='last'), how='left', on='song_id')

Finally we take a look at our data and save them into a csv file.

In [None]:
final_merge.columns

In [None]:
final_merge.to_csv('final_merge.csv')

In [None]:
final_merge.shape

It seems that we have gained some rows while joining the datasets. This may be due to duplicate IDs.

In [None]:
final_merge.loc[2,:]

# Year analysis

In [None]:
df = final_merge.copy()
df['album_release'] = df['album_release'].fillna(0).astype(int) #must fill nan with value to convert to int
df = df.loc[df['year'] != 0] #don't take 0, as it means unknown
df.describe()

We can see that we still have more than 500k tracks to do our year/time analysis, which should be enough to see correlations if there are any.

In [None]:
df_c = df.copy()

year_plot = df_c.groupby('year').size().plot(kind='bar', figsize=(16,12), color='g')
ax = plt.gca()
for label in ax.get_xticklabels(): #Little trick to avoid cluttering the x axis and only see every 5 years
    label.set_visible(False)
for label in ax.get_xticklabels()[2::5]:
    label.set_visible(True)
ax.set_ylabel("Number of samples")
fig = plt.gcf()
fig.suptitle("Samples per year", y=0.9)
fig.savefig("../year_analysis_plots/samples_per_year.png", bbox_inches='tight')

## Analysis

We can see, as expected, that the dataset doesn't have many songs before ~1990. It also stops after 2010 (when the dataset got created). The set is therefore not sampled uniformly on the release date. Indeed, as explained on the MSD website, the dataset was chosen using the most popular artist / tracks, which explains why older songs are underrepresented.

In [None]:
def plot_by(df, idx, column, ax=None):
    '''
        Plot a column (y axis) against an index (x axis)
        Will generate a plot matching for each idx value the mean of the column value for this idx.
        :param df: The dataframe containing the data
        :param idx: The x axis series
        :param column: the y axis series
        :param ax: A custom axis object to plot on
        :type df: DataFrame
        :type idx: string
        :type column: string
        :type ax: Axes
    '''
    df_c = df.copy()
    df_c = df_c[[idx,column]]
    if ax == None:
        axes = plt.gca()
    else:
        axes = ax
    axes.set_xlabel(idx)
    axes.set_ylabel(column)
    df_c.groupby([idx]).mean().plot(ax=ax)
    
def plot_by_year(df, column, ax = None):
    '''
        Plot a column (y axis) against the year (x axis)
        Will generate a plot matching for each year the mean of the column value for this year.
        :param df: The dataframe containing the data
        :param column: the y axis series
        :param ax: A custom axis object to plot on
        :type df: DataFrame
        :type column: string
        :type ax: Axes
    '''
    plot_by(df,'year',column, ax=ax)
    
def plot_heatmap_by(df, idx, column, ax = None):
    '''
        Plot a heatmap using an index (x axis), and a column (y axis)
        The color value of the heatmap will be the number of samples for this coordinate.
        :param df: The dataframe containing the data
        :param idx: the name of the x axis series
        :param column: the name of the y axis series
        :param ax: A custom axis object to plot on
        :type df: DataFrame
        :type idx: string
        :type column: string
        :type ax: Axes
    '''
    df_c = df.copy()
    df_c = df_c[[idx,column]]
    if df[column].dtype == np.float or df[column].dtype == np.float64: #Bin the data if needed
        bins = np.linspace(df[column].min(),df[column].max(),20)
        df_c[column] = pd.cut(df_c[column],bins)
    df_c = df_c.dropna()
    df_count = pd.DataFrame(df_c.groupby([idx, column]).size().rename('count'))
    df_c = df_c.join(df_count, on=[idx,column])
    df_c = df_c.reset_index().pivot_table(index=idx, columns=column, values='count', aggfunc='mean')
    if ax==None:
        axes = plt.gca()
    else:
        axes = ax
    axes.set_xlabel(idx)
    axes.set_ylabel(column)
    sns.heatmap(df_c, ax=ax, cbar_kws={'label': 'Number of samples'})

def plot_heatmap_by_year(df, column, ax=None):
    '''
        Plot a heatmap using a column (y axis) against the years
        The color value of the heatmap will be the number of samples for this coordinate.
        :param df: The dataframe containing the data
        :param column: the name of the y axis series
        :param ax: A custom axis object to plot on
        :type df: DataFrame
        :type column: string
        :type ax: Axes
    '''
    plot_heatmap_by(df, 'year', column, ax=ax)
    
def plot_for_year(df, column, year, ax=None, decade=False):
    '''
        Plot a stripplot (lineplot) of a feature / column for a given year against the song_hotttnesss
        The plot will have jitter to better visualize the data
        :param df: The dataframe containing the data
        :param column: the name of the series
        :param year: the year
        :param ax: A custom axis object to plot on
        :param decade: Whether to consider the whole decade or not
        :type df: DataFrame
        :type column: string
        :type year: int
        :type ax: Axes
        :type decade: boolean
    '''
    df_c = df.copy()
    if decade:
        df_c['year'], bins=pd.cut(df_c['year'],range(1960,2020,10), include_lowest=True, retbins=True)
        interval = pd.Interval(year, year+10)
        df_c = df_c[interval==df_c['year']]
    else:
        df_c = df_c[df_c['year']==year]
    df_c = df_c[[column, 'song_hotttnesss']]
    df_c['song_hotttnesss'], bins = pd.cut(df_c['song_hotttnesss'],np.linspace(0,1,11), retbins=True) #Bin the song hotness
    if ax == None:
        axes = plt.gca()
    else:
        axes = ax

    axes.set_xlim(df[column].min()-0.1,df[column].max()+0.1)
    axes.set_xlabel(column)
    axes.set_ylabel(year)
    sns.stripplot(x=column, y='song_hotttnesss',data=df_c, jitter=0.5, ax=axes)
    axes.invert_yaxis()
    min_y = axes.get_ylim()[0]
    max_y = axes.get_ylim()[1]
    length = max_y-min_y
    for hotness in df_c['song_hotttnesss'].unique(): #Plot the mean of each song hotness bin
        if type(hotness)==pd.Interval:
            median = df_c.loc[df_c['song_hotttnesss']==hotness][column].median()
            start = hotness.left
            end = hotness.right
            axes.plot([median,median], [length*start+min_y, length*end+min_y], color='k', zorder=1000)
    
    
def plot_heatmap_for_year(df, column1, column2, year, ax=None):
    '''
        Plot a kdeplot of two features / columns for a given year
        It will allow to see correlation between the two columns
        :param df: The dataframe containing the data
        :param column1: the name of the first series
        :param column2: the name of the second series
        :param year: the year
        :param ax: A custom axis object to plot on
        :type df: DataFrame
        :type column1: string
        :type column2: string
        :type year: int
        :type ax: Axes
    '''
    df_c = df.copy()
    df_c = df_c[df_c['year']==year]
    df_c = df_c[[column1, column2]]
    df_c = df_c.dropna()
    if ax == None:
        axes = plt.gca()
    else:
        axes = ax
    axes.set_xlabel(column1)
    axes.set_ylabel(column2)
    sns.kdeplot(df_c[column1], df_c[column2], cmap="Reds", shade=True, shade_lowest=False, ax=axes)

def ceil(x):
    '''
        Shortcut for np.ceil(x).astype(int)
        :param x: the value to ceil
        :type x: number
        :return: The result of the ceiling as an int
        :rtype: int
    '''
    return np.ceil(x).astype(int)

We select the features which make sense and then plot their evolution through the years

In [None]:
selected = ['song_hotttnesss', 'loudness_x', 'mode_x', 'tempo_x', 'key_x', 'duration', 'artist_hotttnesss', 'danceability', 'energy', 'key_y', 'loudness_y', 'mode_y', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo_y', 'duration_ms']
selected.sort()

n_cols = 4
n_rows = ceil(len(selected)/n_cols)
fig, ax = plt.subplots(n_rows,n_cols)
fig.set_size_inches(20,n_rows*20/n_cols)
idx_r = 0
idx_c = 0

#For each feature, plot the mean of the feature for each year
for col in selected:
    plot_by_year(df,col,ax=ax[idx_r, idx_c])
    if idx_c==n_cols-1:
        idx_r+=1
        idx_c=0
    else:
        idx_c+=1

## Analysis

For most of the data before ~1960, there is a huge variance due to the low number of samples. However, we can still see tendencies for several features.    
- Duration : We can see a clear spike (for both duration and duration_ms) around 1960. We suppose that it is due to the apparition and democratization of the vinyl record (more precisely its more modern iteration). This allowed the musicians to store longer musics (which seemed to be a problem before). However, the duration hasn't increased since, probably because the artists and public feel that the current mean duration is the most optimal one.

- Acousticness : We see a massive drop through the years. This is surely due to the apparition of the electronic music (acousticness is determined by the absence of electronic instruments).

- Loudness : The music seems to get louder and louder. This is probably due to cultural changes (genre, etc).

- Energy : The energy also increases along with the loudness.

- Song hotness : The hotness seems to spike at around 2010. This may be explained by the algorithm, if it is similar to the one of Spotify, the hotness is hugely influenced by the recency of the music, which explains this result. Otherwise, this may be due to the same problem seen with tempo, mode, etc, which all spike at the end of the graph.

- We can drop mode_y, as it is always 1.

- We can't really say much about the other values, except that they seem to stay stable through the years.

In [None]:
n_cols = 1
selected = ['song_hotttnesss', 'loudness_x', 'mode_x', 'tempo_x', 'key_x', 'duration', 'artist_hotttnesss', 'danceability', 'energy', 'key_y', 'loudness_y', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo_y', 'duration_ms']
n_rows = ceil(len(selected)/n_cols)
fig, ax = plt.subplots(n_rows,n_cols)
fig.set_size_inches(20,n_rows*20/n_cols/2) #Trying to find a good aspect
fig.tight_layout()
plt.subplots_adjust(hspace=0.5)

#For each feature, print the heatmap of the feature regarding the year
for idx,col in enumerate(selected):
    plot_heatmap_by_year(df, col, ax=ax[idx])

## Analysis
These graphs don't show much, except that most of the samples are recent, as seen earlier (and what feature values those recent years samples have). We could perhaps normalize by year to have a better visualization.

In [None]:
n_cols = 2
year_r = range(1960,2012)
n_rows = np.ceil(len(year_r)/n_cols).astype(int)
fig, ax = plt.subplots(n_rows,n_cols)
fig.set_size_inches(20,n_rows/n_cols*20)
idx_r = 0
idx_c = 0

#Plot the song hotness distribution for each year
for year in year_r:
    plot_for_year(df,'song_hotttnesss',year,ax=ax[idx_r, idx_c])
    if idx_c==n_cols-1:
        idx_r+=1
        idx_c=0
    else:
        idx_c+=1

## Analysis example

The first thing we remark is that a lot of the samples have an hotness of 0. We wondered if this means that the song hasn't been rated, but the description of the dataset just says that the hotness goes from 0 to 1, thus it seems that those songs are just really unpopular (which seems plausible).

We can see that, although the number of samples drastically increases as time goes on, the distribution roughly stays the same (which we could see in the previous graphs). We can also distinguish what looks like lines around 0.2, 0.27, 0.3 mainly (which means an higher concentration of samples). We don't know if it is due to the algorithm used for the rating or if this is just a coincidence at the moment.

### You will need the ipython extensions / widgets enabled to use the interactive graph below

In [None]:
df_c = df.copy()
df_c = df_c[df_c['year']>=1960] #Only take after 1960, as before there are very few samples
DECADE = "Decade"
SINGLE_YEAR = "Single year"
df_c['artist_hotttnesss'][df_c['artist_hotttnesss']<0]=0 # Clamp the artist hotness as the data is sometimes faulty
df_c['artist_hotttnesss'][df_c['artist_hotttnesss']>1]=1
decades = widgets.RadioButtons( #Button to select year or decade
options=[DECADE,SINGLE_YEAR],value=SINGLE_YEAR, disabled=False, description="Year grouping")
year = widgets.BoundedIntText( #Field to select year
    value=1960,
    min=1960,
    max=2010,
    step=1,
    description='Year:',
    disabled=False,
    color='black'
)
def update_year(*args):
    """
    Update the year widget depending on the button value
    """
    if decades.value==SINGLE_YEAR:
        year.step=1
        year.max=2010
        year.description='Year:'
    else:
        year.step=10
        year.value=np.round(year.value/10)*10
        year.max=2000
        year.description="Decade:"
    
decades.observe(update_year,'value') #Listener

feature = widgets.Dropdown( #Widget to select the feature
    options=selected[1:], #Don't take song hotness
    value='loudness_x',
    description='Feature:',
)

def save_all_plots():
    """
    Saves all the year stripplots
    """
    for f in selected[1:]:
        for year in range(1960, 2011):
            plotit(f, year, SINGLE_YEAR)
            fig = plt.gcf()
            fig.savefig("../year_analysis_plots/single_year/"+f+"_"+str(year)+".png", bbox_inches='tight')
            if year<2010 and year%10==0: #Save the decades too
                plotit(f,year,DECADE)
                fig = plt.gcf()
                fig.savefig("../year_analysis_plots/decade/"+f+"_"+str(year)+"-"+str(year+10)+".png",bbox_inches="tight")
            
def plotit(feature, year, decades):
    df_d = df_c.copy()
    plt.clf()
    fig = plt.gcf()
    fig.set_size_inches(16,16)
    decade = decades==DECADE
    ax = plt.gca()
    if decade:
        title = str(year)+" - "+str(year+10)
    else:
        title = str(year)
    fig.suptitle(title, y=0.9)
    plot_for_year(df_d,feature,year, decade=decade)

interactive(plotit,feature=feature, year=year, decades = decades)
# save_all_plots()

## Analysis
The analysis will be done by decade, as otherwise the values per year jump a lot, and grouping by decade helps to visualize better the evolution of the features (their tendency). To avoid cluttering the notebook, the plots have been saved and examined on the disk (they are also available on the github repository).

### Features
First off, the features don't seem to influence the song hotness at all (except for the artist hotness obviously), which is a bit of a letdown. The median for each song hotness bin seems to be around the same value, and the distributions the same.

### Evolution
There are some evolutions through the years / decades though:
- The acousticness seems to decrease throughout the years, and the mean acousticness is quite low overall.
- The duration increased from the 60s to the 70s, and then stayed at ~250s.
- The songs become more and more energetic decade after decade.
- The loudness also increases with the decades. The top songs seems to be louder as well.
- The valence decreases with the time, and the top songs hang around 0.5 valence.

## Conclusion
We can't really see any link between a feature through the years and the song hotness. We also can't, for most of the features, see any real evolution. These problems are maybe due to the selection bias of the MSD.

In [None]:
df = final_merge.copy()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
#df.columns.values[0] = 'id'
df.head()

In [None]:
df2 = df[(pd.isnull(df.genre1_cd1) == False) | (pd.isnull(df.genre1_cd2) == False) | (pd.isnull(df.genre1_cd2c) == False)].copy()

In [None]:
df2_year = df2.groupby(['year']).size().reset_index(name='counts')

First we check the number of song per year we have in the dataset. As expected we see an increase in the number of songs over the year except for 2010, this is probably because the year 2010 was just ending when the dataset was created and the 2010 songs hadn't had the time to attain their maximum popularity.

In [None]:
df2_year.iloc[1:, :].plot(x='year', y='counts', kind='line')

As said before there are not a lot of songs before the 60s, thus we will drop this song to continues a meaningful analysis.

In [None]:
df2 = df2[df2.year > 1960]

In [None]:
genres = set([])
genres_cols = ['genre1_cd2c', 'genre2_cd2c', 'genre1_cd2', 'genre2_cd2', 'genre1_cd1', 'genre2_cd1']
for col_name in genres_cols:
    genres = genres | set(df2[col_name].unique())
print(genres)
print(len(genres))

We have 17 different genres (nan are unkown and international is the same as world). To do a meaningful analysis of the genre analysis over the year a minimum amount of songs of the analyzed type must be in the dataset. In the following cells we first replace the nan and replace International by World. 

In [None]:
df2[genres_cols] = df2[genres_cols].fillna('Unknown')
df2[genres_cols] = df2[genres_cols].replace('International', 'World')

In [None]:
genres = set([])
genres_cols = ['genre1_cd2c', 'genre2_cd2c', 'genre1_cd2', 'genre2_cd2', 'genre1_cd1', 'genre2_cd1']
for col_name in genres_cols:
    genres = genres | set(df2[col_name].unique())
    df2[col_name] = df2[col_name].astype(str)
print(genres)
print(len(genres))

In [None]:
df2[genres_cols].head()

For the moment we have 6 columns for the genres, we would like to see if we can summarize these columns in one or two columns.
First we perform a pivot and count the number of different values present in each column. 

In [None]:
for genre in list(genres):
    df2[genre] = 0
    for col_name in genres_cols:
        df2.loc[df2[col_name] == genre, genre] = 1
df2 = df2.drop(columns=['Unknown'])
genres.remove('Unknown')

In [None]:
df2['nb_genre'] = np.sum(df2.iloc[:, -17:].values, axis=1)

In [None]:
df2['nb_genre'].plot(kind='hist')

So we see that the majority of the songs have 1 or 2 different genres, some also have 3 genres and 4 genres is atypical. We can now drop the 6 columns containing the label genres.

In [None]:
df2 = df2.drop(columns=genres_cols)

Now for each genre we plot the number of samples per year.

In [None]:
int(len(list(genres))/3.0 + .5)
f, axarr = plt.subplots(int(len(list(genres))/3.0 + .5), 3)
f.set_size_inches(15, 20)
plt.subplots_adjust(hspace=.4)
i = 0
all_data = {}
for genre in genres:
    data_genre = df2[df2[genre] == 1].groupby(['year']).size().reset_index(name='counts')
    data_genre.plot(x='year', y='counts', kind='line', title=genre, ax=axarr[int(i/3), i%3])
    fig = axarr[int(i/3), i%3].get_figure()
    extent = axarr[int(i/3), i%3].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
    data = {}
    data['years'] = list(data_genre['year'].astype(str).values)
    data['count'] = list(data_genre['counts'].values)
    all_data[genre] = data
    #fig.savefig('figures/%s_distri_year.png' % (genre), bbox_inches=extent.expanded(1.2, 1.15), dpi = 500)
    i+=1
f = open('counts.json','w')
f.write(str(all_data))
f.close()

These plots are useful to see the data we have in our hands. 

Firstly we observe that most of the music we have is rock, pop, pop_rock, electronic or metal. On the opposite World, Latin, blues are not very represented. This can be explained either because the dataset is biased or also because some genres are more popular. Indeed Latin music is sub-represented although there is a very important latin culture in the world. These observations could be made more precise by using only the total number of songs for each genre.

Secondly these plots allow us to see some trends in the evolution of the music. If we suppose the dataset is not too much biased for the genre the most represented, we can make some interesting observations. We can see that punk music suddenly appears in the middle of the 70's. Rock started in the 60's and grows exponentially since this moment. Indeed these plots are useful to tell something about when the genre appears and how it has evolved since this moment. 

Now we want to look how genres are connected, so let's construct a graph in which nodes are the genre and a connection between two genres appears when a song has both genres. The weight of the connection is given by the number of songs.

In [None]:
adj_mat_genres = np.zeros([len(genres), len(genres)])
genres = list(genres)
for i in range(len(genres)):
    for j in range(i, len(genres)):
        nb_songs = df2[(df2[genres[i]] == 1) & (df2[genres[j]] == 1)].shape[0]
        adj_mat_genres[i, j] = nb_songs
        adj_mat_genres[j, i] = nb_songs

adj_mat_genres

In [None]:
adj_df = pd.DataFrame(adj_mat_genres, columns=genres)

In [None]:
adj_df['genre'] = genres

In [None]:
adj_df['radius'] = (np.diag(adj_mat_genres))**.5
adj_df['id'] = range(len(adj_df))

In [None]:
adj_df[['radius', 'id', 'genre']].T.to_dict().values()

In [None]:
import json
a = adj_df[['radius', 'id', 'genre']].T.to_dict().values()
list(a)

In [None]:
adj_df = adj_df.iloc[:, :-3]

In [None]:
adj = adj_df.values

In [None]:
edges = []
for i in range(len(adj)):
    for j in range(i+1, len(adj)):
        edge = {'source_id': i, 'target_id': j, 'stroke_width': adj[i, j]/1000}
        edges.append(edge)
edges

In [None]:
%%html
<iframe src="http://www.cbinge.com/file/test.html" width=1000 height = 1000/>

In [None]:
from sklearn.ensemble import RandomForestRegressor
int(len(list(genres))/3.0 + .5)
f, axarr = plt.subplots(int(len(list(genres))/3.0 + .5), 3)
f.set_size_inches(15, 20)
plt.subplots_adjust(hspace=.4)
i = 0
all_data = {}

# Compute the avg hotness by year for all the data
hottness_avg = df2[df2['song_hotttnesss'].notna()].groupby(['year']).mean().reset_index()
regr = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_depth= 5)
regr.fit(hottness_avg[['year']], hottness_avg['song_hotttnesss'])
hottness_predict = regr.predict(np.array(list(range(1960, 2011))).reshape(-1, 1))
data = {}
data['years'] = list(hottness_avg['year'].astype(str).values)
data['hottness'] = list(hottness_avg['song_hotttnesss'].values)
data['predict'] = list(hottness_predict)
all_data['avg'] = data

# Compute it for each genre
for genre in genres:
    regr = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_depth= 5)
    hottness = df2[(df2[genre] == 1) & df2['song_hotttnesss'].notna()].groupby(['year']).mean().reset_index()
    hottness = hottness[hottness['song_hotttnesss'] > 0]
    regr.fit(hottness[['year']], hottness['song_hotttnesss'])
    hottness.plot(x='year', y='song_hotttnesss', kind='scatter', title=genre, ax=axarr[int(i/3), i%3], color='orange')
    hottness_predict = regr.predict(np.array(list(range(1960, 2011))).reshape(-1, 1))
    axarr[int(i/3), i%3].plot(list(range(1960, 2011)), hottness_predict)
    data = {}
    data['years'] = list(hottness['year'].astype(str).values)
    data['hottness'] = list(hottness['song_hotttnesss'].values)
    data['predict'] = list(hottness_predict)
    all_data[genre] = data
    i+=1
f = open('hottness.json','w')
f.write(str(all_data))
f.close()

# Milestone 3
### Analysis of the distribution of the different features for each genre
Look at the empirical probability function of the genre.
### Look at the influence of the year of these distribution
Is genre time invariant or not?
### Visualize the interesting results obtained
Visualization by using graph evolving with time.

In [None]:
for genre in genres:
    tmp = df2[df2[genre] == 1].reset_index()
    text = tmp[['song_hotttnesss', 'duration', 'speechiness', 'acousticness', 'instrumentalness']].describe().to_html()
    f = open('figures/%s.tab'%genre,'w')
    f.write(text)
    f.close()

In [None]:
df2.columns

In [None]:
from sklearn.ensemble import RandomForestRegressor

for col in ['song_hotttnesss', 'duration', 'speechiness', 'acousticness', 'instrumentalness']:
    all_data = {}

    # Compute the avg hotness by year for all the data
    avg = df2[df2[col].notna()].groupby(['year']).mean().reset_index()
    regr = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_depth= 5)
    regr.fit(hottness_avg[['year']], avg[col])
    predict = regr.predict(avg['year'].values.reshape(-1, 1))
    data = {}
    data['years'] = list(avg['year'].astype(str).values)
    data[col] = list(avg[col].values)
    data['predict'] = list(predict)
    all_data['avg'] = data

    # Compute it for each genre
    print("%s: " % col)
    for genre in genres:
        regr = RandomForestRegressor(n_estimators=10, n_jobs=-1, max_depth= 5)
        datas = df2[(df2[genre] == 1) & df2[col].notna()]
        col_data = datas.groupby(['year']).mean().reset_index()
        col_data = col_data[col_data[col] > 0]
        regr.fit(col_data[['year']], col_data[col])
        predict = regr.predict(hottness_avg['year'].values.reshape(-1, 1))
        data = {}
        data['years'] = list(hottness['year'].astype(str).values)
        data[col] = list(col_data[col].values)
        data['predict'] = list(predict)
        all_data[genre] = data
        i+=1
        avg += np.mean(col_data[col].values)
        print("\t %s: Nb songs = %d and Avg Value is %f" % (genre, len(datas), np.mean(data[col])))
    f = open('%s.json' % col,'w')
    f.write(str(all_data))
    f.close()

In [None]:
len(df2[df2['speechiness'].notna()])