# ADA 2018 -  “Happiness” Share it through music.

#### Andres Montero, Ariel Alba, Diego Iriarte




This project aims to find the relationships between tweets and music, specifically we focus on tweets which reflect a positive sentiment or mood and we'll relate them to the kind of music those people listen to. We believe that music preferences vary depending on people's mood, so we can detect what kind of music people tend to listen to when they are feeling happy or posting tweets with positive sentiment, which in our case are equivalent.
For it, we'll work with two datasets:
* FMA: A Dataset For Music Analysis
* Sentiment140 dataset with 1.6 million tweets

Tweets dataset is a labeled set, indicating whether each tweet has a positive sentiment or not.
In order to define whether the proposed idea is feasible or not, we follow these steps:

**A)** Preprocessing and analyzing FMA dataset. Gather the data that will be used, the dataset is very extense and we must focus on the data that will be used for our study. We clean the data, define the type of variables, normalize them and deal with missing values. From the cleaned data, we get some statistical information of the dataset, such as most listened songs, albums and we try to find some relationships between some information of the dataset such as: danceability, energy, number of times listened, location and others.

**B)** Loading and analyzing of Sentiment140 dataset. We will only focus on tweets which are labeled as positive sentiment and the approach. The dataset contains few information about the tweets such as date, sentiment, userID and tweet, and is limited to tweets of yar 2009. Before doing any further preprocessing of such dataset we decided to check the possible relationships that we could find with FMA dataset.


**C)** Find relationships between both datasets. It is important to identify whether the idea of the project is feasible or not. For the idea to be feasible, we need to find a considerable number of positive tweets which are somehow related to music or songs. To measure this relationships our approach is the following:
* Count how many tweets link to a song link
* Count how many tweets explicitly say spotify on it.
* Find words related to music contained in the tweets (may be a difficult task)


**D)** In case, our two first approaches to find the relationships between both datasets don't succeed, we may not be able to try with the third one due to time contraints and task difficulty. If thi is the scenario, the project scope will be reduced and will mostly focus on the FMA dataset and a specific event in the past years which may caused an impact on music production and preferences. Such scenario will be defined according to the insights found in part A.

In [None]:
% matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import reverse_geocoder as rg
import os.path
import ast
import seaborn as sns

from helpers import *
from datetime import datetime, date, time
from scipy import stats

%load_ext autoreload
%autoreload 2


### A) Preprocessing and Analyzing FMA Dataset

First part consists of loading the data that is needed for the analysis, cleaning and storing them in dataFrames so we can further work with them

In [None]:
# Constants
DATA_DIR = './data/'
MUSIC_DIR = '{dir}{file}/'.format(dir=DATA_DIR, 
                                  file='fma_metadata')
PKL_DIR = '{dir}{file}/'.format(dir=DATA_DIR, 
                               file='pkl')

# True if we want to to execute the clean phase and to force 
# saving the cleaned file
CLEAN_PHASE = True
DEBUG = True


In [None]:
if CLEAN_PHASE:
    echonest_path = '{dir}{file}'.format(dir=MUSIC_DIR, 
                                         file='echonest.csv')
    features_path = '{dir}{file}'.format(dir=MUSIC_DIR,
                                         file='features.csv')
    genres_path = '{dir}{file}'.format(dir=MUSIC_DIR,
                                       file='genres.csv')
    tracks_path = '{dir}{file}'.format(dir=MUSIC_DIR,
                                       file='tracks.csv')
    

We define which variables, type of variables and name of columns

In [None]:
if CLEAN_PHASE:
    # Load datasets into pandas dataframes
    echonest_col_names=['track_id', 'danceability', 'energy',
                        'valence', 'artist_latitude',
                        'artist_longitude', 'artist_name',
                        'artist_discovery', 'artist_family',
                        'artist_hotness', 'song_currency', 
                        'song_hotness']
    
    echonest_dtypes = {'track_id': int, 'danceability': float, 
                       'energy': float, 'valence': float,
                       'artist_latitude': float, 'artist_longitude': float,
                       'artist_name': str, 'artist_discovery': float, 
                       'artist_family': float, 'artist_hotness': float,
                       'song_currency': float, 'song_hotness': float}
    
    echonest_df = pd.read_csv(echonest_path, names=echonest_col_names,
                              header=3, dtype=echonest_dtypes, 
                              usecols=[0, 2, 3, 8, 11, 13, 14,
                                       21, 22, 23, 24, 25])

    genres_dtypes = {'genre_id': int, '#tracks': int, 
                     'parent': int, 'top_level': int}
    
    genres_df = pd.read_csv(genres_path, dtype=genres_dtypes)
    
    track_col_names = ['track_id', 'album_date_created',
                       'album_date_released', 'album_id',
                       'album_listens', 'album_title', 
                       'artist_id', 'artist_latitude',
                       'artist_longitude', 'artist_name',
                       'track_duration', 'track_genre_top',
                       'track_genres_all', 'track_language', 
                       'track_listens', 'track_tags', 
                       'track_title']

    tracks_dtypes = {'track_id': int, 'album_date_created': str,
                     'album_date_released': str, 
                     'album_id': int, 'album_listens': int, 
                     'album_title': str, 'artist_id': int,
                     'artist_latitude': float, 'artist_longitude': float,
                     'artist_name': str, 'track_duration': int, 
                     'track_genre_top': str, 'track_genres_all': str, 
                     'track_language': str, 'track_listens': int,
                     'track_tags': str, 'track_title': str}

    tracks_df = pd.read_csv(tracks_path, names=track_col_names,
                            header=2, usecols=[0, 2, 3, 6, 8, 11,
                                               21, 22, 24, 26, 38, 
                                               40, 41, 45, 47, 51,
                                               52])


In [None]:
# Tracks clean phase
if CLEAN_PHASE:
    # Transforms str to datetime
    tracks_df['album_date_released'] = pd.to_datetime( \
                                            tracks_df['album_date_released'])
    tracks_df['album_date_created'] = pd.to_datetime( \
                                            tracks_df['album_date_created'])
    # Transform str to list
    tracks_df['track_tags'] = tracks_df['track_tags'] \
                                .apply(lambda x: ast.literal_eval(x))
    tracks_df['track_genres_all'] = tracks_df['track_genres_all'] \
                                .apply(lambda x: ast.literal_eval(x))


We Normalize the dataFrames and remove redundant information. We merge information gathered in two different dataFrames

In [None]:
if CLEAN_PHASE:
    # Normalize dataframes to take out redundancy
    
    # Join echonist data to tracks
    echo_tracks = echonest_df.merge(tracks_df, left_on='track_id', 
                                    right_on='track_id', how='outer')
    
    print(len(echo_tracks))
     
    # Select main columns to create new df
    artists_df = echo_tracks[['artist_id', 'artist_name_x',
                              'artist_latitude_x', 'artist_longitude_x', 
                              'artist_discovery', 'artist_family', 
                              'artist_hotness', 'artist_latitude_y',
                              'artist_longitude_y', 'artist_name_y']].copy()

    albums_df = echo_tracks[['album_id', 'album_date_created',
                             'album_date_released', 'album_title',
                             'album_listens']].copy()
    
    tracks_df = echo_tracks[['track_id', 'track_title', 
                             'track_duration', 'artist_id', 
                             'album_id', 'track_genre_top',
                             'track_genres_all','track_language',
                             'track_listens', 'track_tags',
                             'danceability', 'energy',
                             'valence', 'song_currency',
                             'song_hotness']].copy()
    
    # Drop duplicates
    artists_df = artists_df.drop_duplicates('artist_id')
    albums_df = albums_df.drop_duplicates('album_id')
    
    # Reset Index
    artists_df = artists_df.reset_index(drop=True)
    albums_df = albums_df.reset_index(drop=True)
    
    # Clean listen count
    albums_df['album_listens'] = albums_df['album_listens'] \
                                    .apply(lambda x : neg_to_zero(x))
    
    tracks_df['track_listens'] = tracks_df['track_listens'] \
                                    .apply(lambda x : neg_to_zero(x))
    
    # Artist name different on echonest and FullMusicArchive
    if DEBUG:
        name_comp = not_eq_ign_case(artists_df['artist_name_x'],
                                    artists_df['artist_name_y'])
        artist_name_diff = artists_df[name_comp]
        
        print('# Different artist names: {}\n' \
                  .format(len(artist_name_diff)))

        print('Example:\n{}'.format(artist_name_diff[['artist_name_x',
                                                      'artist_name_y']].head(3)))


In [None]:
# Artist location different on echonest and FullMusicArchive
if DEBUG and CLEAN_PHASE:
    lat_comp = artists_df['artist_latitude_x'] != artists_df['artist_latitude_y']
    long_comp = artists_df['artist_longitude_x'] != artists_df['artist_longitude_y']
    latlong_comp = lat_comp | long_comp
    latlong_diff = artists_df[latlong_comp]

    print('# Different Lat Long values: {}\n' \
              .format(len(latlong_diff)))

    print('Example:\n{}'.format(artist_name_diff[['artist_latitude_x',
                                                  'artist_latitude_y',
                                                  'artist_longitude_x',
                                                  'artist_longitude_y']].head(3)))

In [None]:
# Echonest clean phase 
if CLEAN_PHASE:
    # Get city, state, country from artist longitude and latitude
    cities=[]
    states=[]
    countries=[]
    
    for i in range (0, len(artists_df)):
        if np.isnan(artists_df.artist_latitude_y[i]):
            city = np.nan
            state = np.nan
            country = np.nan
        else:
            coordinates = (artists_df.artist_latitude_y[i], 
                           artists_df.artist_longitude_y[i])
            results = rg.search(coordinates, mode=1)
            city = results[0]['name']
            state = results[0]['admin1']
            country = results[0]['cc']
        
        cities.append(city)
        states.append(state)
        countries.append(country)
    
    artists_df.insert(loc=5, column='city', 
                      value=pd.Series(cities))
    artists_df.insert(loc=6, column='state',
                      value=pd.Series(states))
    artists_df.insert(loc=7, column='country', 
                      value=pd.Series(countries))
    

In [None]:
# Read  and wirte files depending on the existance of own path
echonest_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                        file='echonest_df.pkl')
genres_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                     file='genres_df.pkl')
albums_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                     file='albums_df.pkl')
artists_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                     file='artists_df.pkl')
tracks_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                      file='tracks_df.pkl')

if CLEAN_PHASE:
    genres_df.to_pickle(genres_df_path)
elif os.path.exists(echonest_df_path):
    genres_df = pd.read_pickle(genres_df_path)
else:
    print('There is no genres pandas data')
    
if CLEAN_PHASE:
    albums_df.to_pickle(albums_df_path)
elif os.path.exists(albums_df_path):
    albums_df = pd.read_pickle(albums_df_path)
else:
    print('There is no albums pandas data')
    
if CLEAN_PHASE:
    artists_df.to_pickle(artists_df_path)
elif os.path.exists(artists_df_path):
    artists_df = pd.read_pickle(artists_df_path)
else:
    print('There is no albums pandas data')
    
if CLEAN_PHASE:
    tracks_df.to_pickle(tracks_df_path)
elif os.path.exists(tracks_df_path):
    tracks_df = pd.read_pickle(tracks_df_path)
else:
    print('There is no albums tracks data')
    

We identify the missing values on artist_df

In [None]:
print('Artists size: {}'.format(len(artists_df)))
if DEBUG: print('\nNaN count by column:\n{}' \
                .format(artists_df.isna().sum(axis=0)))
artists_df.head()


We identify the missing values on albums_df

In [None]:
print('Albums size: {}'.format(len(albums_df)))
if DEBUG: print('\nNaN count by column:\n{}' \
                .format(albums_df.isna().sum(axis=0)))
albums_df.head()


We identify the missing values on echonest_df

In [None]:
print('Echonest size: {}'.format(len(echonest_df)))
if DEBUG: print('\nNaN count by column:\n{}' \
                .format(echonest_df.isna().sum(axis=0)))
echonest_df.head()


We identify the missing values on genres_df

In [None]:
print('Genres size: {}'.format(len(genres_df)))
if DEBUG: print('\nNaN count by column:\n{}' \
                .format(genres_df.isna().sum(axis=0)))
genres_df.head()


We identify the missing values on tracks_df

In [None]:
print('Tracks size: {}'.format(len(tracks_df)))
if DEBUG: print('\nNaN count by column:\n{}' \
                .format(tracks_df.isna().sum(axis=0)))
tracks_df.head()


#### First we will see the top 10 countries were the music is produced.

In [None]:
#Merge data to have the entire data frame
tracks_echonest = echonest_df.merge(tracks_df, left_on='track_id', 
                                    right_on='track_id', how='right')
tracks_echonest.head()


In [None]:
track_artist = tracks_df.merge(artists_df, left_on='artist_id', 
                               right_on='artist_id')
country_grouped = track_artist.groupby(track_artist['country']).size()
country_top10 = country_grouped.sort_values(ascending=False) \
                                     .head(10)
country_top10.plot(kind='bar', 
                   title="Top 10 countries that produce tracks")
plt.ylabel('Number of Tracks')
plt.grid()


#### Top 10 albums

In [None]:
albums_grouped = albums_df.groupby(albums_df['album_id']) \
                          .first()[['album_title', 'album_listens']]
albums_top10 = albums_grouped.sort_values(by='album_listens', 
                                          ascending=False).head(10)

plt.figure(figsize=(12, 6))
plt.title("Top 10 Albums listened")
plt.grid()
ax = sns.barplot(x='album_title', y= 'album_listens',
                 data=albums_top10)
plt.xticks(rotation=90)


#### Top 10 Tracks

In [None]:
tracks_grouped = tracks_df.groupby(tracks_df['track_id']) \
                          .first()[['track_title', 'track_listens']]
tracks_top10 = tracks_grouped.sort_values(by='track_listens', 
                                          ascending=False).head(10)

albums_top10.plot(x='album_title', kind='bar', title="Top 10 Albums")
plt.grid()
plt.figure(figsize=(12, 6))
plt.title("Top 10 tracks listened")
plt.grid()
ax = sns.barplot(x='track_title', y= 'track_listens',
                 data=tracks_top10)
plt.xticks(rotation=45)


In [None]:

# albums_top10.plot(x='album_title', kind='bar', title="Top 10 Albums")
# plt.grid()
# plt.figure(figsize=(15, 8))
# plt.title("Top 10 tracks listened")
# plt.grid()
# ax = sns.barplot(x='track_title', y= 'track_listens',
#                  data=tracks_top10)
tracks_top10.plot(x='track_title', kind='bar', title='Top 10 tracks listened')

#### Relation between danceability and duration of the song

In [None]:
tracks_df.insert(loc=15, column='track_duration_minutes',
                 value=(tracks_df['track_duration'] / 60))

tracks_df['track_duration_minutes'] = pd.to_numeric(tracks_df['track_duration_minutes'])


In [None]:
# intervals = pd.IntervalIndex.from_arrays([0, 2, 4], 
#                                          [2, 4, 1000], closed='left')

# tracks_df['duration_class'] = pd.cut(tracks_df['track_duration_minutes'], 
#                                      bins=intervals)

tracks_df['track_duration_minutes'] = tracks_df['track_duration_minutes'] \
                                                .apply(lambda x : np.rint(x))
sns.regplot(x='danceability', y='track_duration_minutes',
            data=tracks_df, ci=95, 
            line_kws = {'color': 'green'})


In [None]:
tracks_df.head()


In [None]:
spearman_coeff = stats.spearmanr(tracks_df['danceability'], 
                                 tracks_df['track_duration_minutes'])
spearman_coeff


#### Relation between valence and other variables

Valence refers to the degree of positive or negative emotions one perceives from a song. We'll try to find some relations between such varible and others.
Relation between track_listens and valence

In [None]:
sns.regplot(x='valence', y='track_listens',
            data=tracks_df, ci=95, 
            line_kws = {'color': 'green'})

As seen in the plot, it is difficult to find a relationships between the valence of the song and the number of listens. We also tried to find relation between valence and energy, or valence and danceability and we found that such relationships are highly variable and do not show any dependance.

Mean value of valence per genre

In [None]:
genre_valence=tracks_df.groupby('track_genre_top', as_index=False)['valence'].mean()
genre_valence.plot(x='track_genre_top', kind='bar', title='Valence per genre')

In [None]:
track_album = tracks_df.merge(albums_df, left_on='album_id',
                              right_on='album_id')

genre_year = track_album[['track_genre_top', 'album_date_released']]
genre_year.insert(loc=2, column='album_released_year',
                  value=(genre_year['album_date_released'].dt.year))
genre_year.head()


In [None]:
top_genre = genre_year.groupby(genre_year['track_genre_top']).size()

In [None]:
top_genre_sorted = top_genre.sort_values(ascending=False).head(10)
top_genre_sorted.plot(kind='bar', title="Top 10 Genres")
plt.grid()


In [None]:
track_album.sort_values(by='album_date_released', ascending=False)

In [None]:
track_album.sort_values(by='album_date_created')

### B) Loading and analyzing of Sentiment140

In [None]:
% matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import reverse_geocoder as rg
import os.path
import ast
import seaborn as sns

from helpers import *
from datetime import datetime, date, time
from scipy import stats

%load_ext autoreload
%autoreload 2

In [None]:
# Constants
DATA_DIR = './data/'

In [None]:
tweets_col_names=['sentiment', 'ID', 'Date',
                        'user', 'text']

tweets_dtypes = {'sentiment': int, 'ID': int, 
                       'Date': str, 'user': str,
                       'text': str }

tweets_df = pd.read_csv(DATA_DIR + '/tweets.csv', names=tweets_col_names,
                              dtype=tweets_dtypes, 
                              usecols=[0, 1, 2, 4, 5], encoding='latin1')
# tweets_df['Date'] = pd.to_datetime(tweets_df['Date'])

### C) Find relationships between both datasets

In [None]:
f1 = tweets_df[tweets_df.text.str.contains('Spotify')]
f2 = tweets_df[tweets_df.text.str.contains('spotify')]
tweets_spotify = pd.concat([f1,f2])
num_tweets_spotify = tweets_spotify.count()
num_tweets_spotify[0]

In [None]:
num_tweets_total=tweets_df.count()
num_tweets_total[0]

In [None]:
tweets_spotify_pergentage = num_tweets_spotify[0]*100/num_tweets_total[0]
tweets_spotify_pergentage

We have less than 0.017% of tweets that relate to a song, therefore we conclude that there is not enough data on our training 
to have a confident model. 

#### There is not enough data in our training data set that relates tweets with spotify, less than 0.017%. Therefore we conclulde that this approach is not feasible. Because the previous aproach was not feasible, we decide to work with the music data set and related to an important event of the last decade to analyze if/how the music played a roll on this event.

### D) Redefine Project Scope

As we can see on the music dataset, most of it is from artist in the US, and one of the biggest events in the last decade is the election of the United States of 2016 when Donald Trump was elected as president.
With this in mind the new research question is the following:
##### By analyzing the information of the music data set since 2012-2016, find the relation that the music may have had on this event, analyzing the top genre of that period, top tracks, and most important the valence and energy of the songs to try to find out how people of the US fell in that specific time

##### Then the next approach will be to compare this data with information we will obtain trhought the API spotify wit music records since end of 2016, and try to find how people feel on this period. And if posible provide a prediction of what kind of music (genre, energy, valence) people of the US will listen in the following years.

Elections of trump were on november 2016 so we will have pre-trump period with 5 years of music data

In [None]:
pre_trump = track_album[(track_album.album_date_created.dt.year >= 2012) & (track_album.album_date_created.dt.year <= 2016)]

In [None]:
pre_trump.album_date_created

In [None]:
pre_trump

Top 10 tracks

In [None]:
pre_trump_grouped = pre_trump.groupby(tracks_df['track_id']) \
                          .first()[['track_title', 'track_listens']]
pre_trump_tracks_top10 = pre_trump_grouped.sort_values(by='track_listens', 
                                          ascending=False).head(10)


fig, axs = plt.subplots(ncols=2,  constrained_layout=True, figsize=(12,5))

f1=sns.barplot(x='track_title', y= 'track_listens',data=tracks_top10, ax=axs[0])
f1.set_xticklabels(f1.get_xticklabels(), rotation=90)
f1.set_title("Top Tracks")

f2=sns.barplot(x='track_title', y= 'track_listens',data=pre_trump_tracks_top10,  ax=axs[1])
f2.set_xticklabels(f2.get_xticklabels(), rotation=90)
f2.set_title("Pre trump Top Tracks")
fig.tight_layout()
plt.show()


In [None]:
pre_trump_albums_grouped = pre_trump.groupby(pre_trump['album_id']) \
                          .first()[['album_title', 'album_listens']]
pre_trump_albums_top10 = pre_trump_albums_grouped.sort_values(by='album_listens', 
                                          ascending=False).head(10)

fig, axs = plt.subplots(ncols=2,  constrained_layout=True, figsize=(12,5))

f1=sns.barplot(x='album_title', y= 'album_listens',data=albums_top10, ax=axs[0])
f1.set_xticklabels(f1.get_xticklabels(), rotation=90)
f1.set_title("Top Albums")

f2=sns.barplot(x='album_title', y= 'album_listens',data=pre_trump_albums_top10,  ax=axs[1])
f2.set_xticklabels(f2.get_xticklabels(), rotation=90)
f2.set_title("Pre trump Top Albums")
fig.tight_layout()
plt.show()

