# ADA 2018 -  Evolution of Music.

#### Andres Montero, Ariel Alba, Diego Iriarte




This project aims to find the relationships between tweets and music, specifically focus on tweets which reflect a positive sentiment or mood and we'll relate them to the kind of music those people listen to. We believe that music preferences vary depending on people's mood, so we can detect what kind of music people tend to listen to when they are feeling happy or posting tweets with positive sentiment, which in our case are equivalent.
For it, we'll work with two datasets:

* FMA: A Dataset For Music Analysis
* Sentiment140 dataset with 1.6 million tweets

Tweets dataset is a labeled set, indicating whether each tweet has a positive sentiment or not.
In order to define whether the proposed idea is feasible or not, we follow these steps:

**A)** Preprocessing and analyzing FMA dataset. Gather the data that will be used, the dataset is very extense and we must focus on the data that will be used for our study. We clean the data, define the type of variables, normalize them and deal with missing values. From the cleaned data, we get some statistical information of the dataset, such as most listened songs, albums and others. We try to find some relationships between some information of the dataset such as: danceability, energy, number of times listened, location and others, too.

**B)** Loading and analyzing of Sentiment140 dataset. We will only focus on tweets which are labeled as positive sentiment and the approach. The dataset contains few information about the tweets such as date, sentiment, userID and tweet, and is limited to tweets of yar 2009. Before doing any further preprocessing of such dataset we decided to check the possible relationships that we could find with FMA dataset.

**C)** Find relationships between both datasets. It is important to identify whether the idea of the project is feasible or not. For the idea to be feasible, we need to find a considerable number of positive tweets which are somehow related to music or songs. To measure this relationships our approach is the following:

* Count how many tweets are linked to a song
* Count how many tweets explicitly say spotify on it.
* Find words related to music contained in the tweets (may be a difficult task)

**D)** In case, our two first approaches to find the relationships between both datasets don't succeed, we may not be able to try with the third one due to time contraints and task difficulty. If this is the scenario, the project scope will be reduced and will mostly focus on the FMA dataset and a specific event in the past years which may caused an impact on music production and preferences. Such scenario will be defined according to the insights found in part A.


In [1]:
% matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import reverse_geocoder as rg
import os.path
import ast
import seaborn as sns

from helpers import *
from datetime import datetime, date, time
from scipy import stats

%load_ext autoreload
%autoreload 2


### A) Preprocessing and Analyzing FMA Dataset

First part consists of loading the data that is needed for the analysis, cleaning and storing them in dataFrames so we can further work with them.

First, we will define some constants and paths that will help us later.

In [2]:
# Constants
DATA_DIR = '../data/'
MUSIC_DIR = '{dir}{file}/'.format(dir=DATA_DIR, 
                                  file='fma_metadata')
PKL_DIR = '{dir}{file}/'.format(dir=DATA_DIR, 
                               file='pkl')

# True if we want to to execute the clean phase and to force 
# saving the cleaned file
CLEAN_PHASE = True
DEBUG = True


In [3]:
if CLEAN_PHASE:
    echonest_path = '{dir}{file}'.format(dir=MUSIC_DIR, 
                                         file='echonest.csv')
    features_path = '{dir}{file}'.format(dir=MUSIC_DIR,
                                         file='features.csv')
    genres_path = '{dir}{file}'.format(dir=MUSIC_DIR,
                                       file='genres.csv')
    tracks_path = '{dir}{file}'.format(dir=MUSIC_DIR,
                                       file='tracks.csv')
    

By looking at the FMA dataset we find out that metadata csv file is full of unnecesary information for our analysis.

Thus, we define which variables, type of variables and name of columns we are going to extract from the FMA csv files.

In [4]:
if CLEAN_PHASE:
    # Load datasets into pandas dataframes
    echonest_col_names=['track_id', 'danceability', 'energy',
                        'valence', 'artist_latitude',
                        'artist_longitude', 'artist_name',
                        'artist_discovery', 'artist_family',
                        'artist_hotness', 'song_currency', 
                        'song_hotness']
    
    echonest_dtypes = {'track_id': int, 'danceability': float, 
                       'energy': float, 'valence': float,
                       'artist_latitude': float, 'artist_longitude': float,
                       'artist_name': str, 'artist_discovery': float, 
                       'artist_family': float, 'artist_hotness': float,
                       'song_currency': float, 'song_hotness': float}
    
    echonest_df = pd.read_csv(echonest_path, names=echonest_col_names,
                              header=3, dtype=echonest_dtypes, 
                              usecols=[0, 2, 3, 8, 11, 13, 14,
                                       21, 22, 23, 24, 25])

    genres_dtypes = {'genre_id': int, '#tracks': int, 
                     'parent': int, 'top_level': int}
    
    genres_df = pd.read_csv(genres_path, dtype=genres_dtypes)
    
    track_col_names = ['track_id', 'album_date_created',
                       'album_date_released', 'album_id',
                       'album_listens', 'album_title', 
                       'artist_id', 'artist_latitude',
                       'artist_longitude', 'artist_name',
                       'track_duration', 'track_genre_top',
                       'track_genres_all', 'track_language', 
                       'track_listens', 'track_tags', 
                       'track_title']

    tracks_dtypes = {'track_id': int, 'album_date_created': str,
                     'album_date_released': str, 
                     'album_id': int, 'album_listens': int, 
                     'album_title': str, 'artist_id': int,
                     'artist_latitude': float, 'artist_longitude': float,
                     'artist_name': str, 'track_duration': int, 
                     'track_genre_top': str, 'track_genres_all': str, 
                     'track_language': str, 'track_listens': int,
                     'track_tags': str, 'track_title': str}

    tracks_df = pd.read_csv(tracks_path, names=track_col_names,
                            header=2, usecols=[0, 2, 3, 6, 8, 11,
                                               21, 22, 24, 26, 38, 
                                               40, 41, 45, 47, 51,
                                               52])
    track_genre_rel_df = pd.DataFrame(columns=['track_id', 'genre_id'])
    

Right now, we have 3 dataframes: echonest, tracks and genres which makes reference to the csv files.

Now, it is time to clean data!!! And have some information in a more suitable represetation for further analysis.

First we start cleaning tracks dataframe.

In [5]:
# Tracks clean phase
if CLEAN_PHASE:
    
    # Transforms str to datetime
    tracks_df['album_date_released'] = pd.to_datetime( \
                                            tracks_df['album_date_released'])
    tracks_df['album_date_created'] = pd.to_datetime( \
                                            tracks_df['album_date_created'])
    # Transform str to list
    tracks_df['track_tags'] = tracks_df['track_tags'] \
                                .apply(lambda x: ast.literal_eval(x))
    tracks_df['track_genres_all'] = tracks_df['track_genres_all'] \
                                        .apply(lambda x: ast.literal_eval(x))
    
    # Generate track-genre relational dataframe
    track_genre_tuple = list(zip(tracks_df['track_id'], 
                             tracks_df['track_genres_all']))
    track_genre_list = [{'track_id': track_id, 'genre_id': genre_id} 
                         for track_id, genres_id in track_genre_tuple
                         for genre_id in genres_id]

    track_genre_rel_df = track_genre_rel_df.append(track_genre_list, 
                                                   ignore_index=True,
                                                   sort=False)
    

We realize that the tracks dataframe contains also information about artists and albums and echonest has information for each track , thus, to have a better organization and not have redundacy, we joined tracks and echonest. Then the joined table was split on 3 new dataframes: tracks, albums and artists.

In [6]:
if CLEAN_PHASE:
    # Normalize dataframes to take out redundancy
    # Join echonist data to tracks
    echo_tracks = echonest_df.merge(tracks_df, left_on='track_id', 
                                    right_on='track_id', how='outer')
    
    print(len(echo_tracks))
     
    # Select main columns to create new df
    artists_df = echo_tracks[['artist_id', 'artist_name_x',
                              'artist_latitude_x', 'artist_longitude_x', 
                              'artist_discovery', 'artist_family', 
                              'artist_hotness', 'artist_latitude_y',
                              'artist_longitude_y', 'artist_name_y']].copy()

    albums_df = echo_tracks[['album_id', 'album_date_created',
                             'album_date_released', 'album_title',
                             'album_listens']].copy()
    
    tracks_df = echo_tracks[['track_id', 'track_title', 
                             'track_duration', 'artist_id', 
                             'album_id', 'track_genre_top',
                             'track_genres_all','track_language',
                             'track_listens', 'track_tags',
                             'danceability', 'energy',
                             'valence', 'song_currency',
                             'song_hotness']].copy()
    
    # Drop duplicates
    artists_df = artists_df.drop_duplicates('artist_id')
    albums_df = albums_df.drop_duplicates('album_id')
    
    # Reset Index
    artists_df = artists_df.reset_index(drop=True)
    albums_df = albums_df.reset_index(drop=True)
    
    # Clean listen count
    albums_df['album_listens'] = albums_df['album_listens'] \
                                    .apply(lambda x : neg_to_zero(x))
    
    tracks_df['track_listens'] = tracks_df['track_listens'] \
                                    .apply(lambda x : neg_to_zero(x))
    
    
    # Include genre_parent column in genres_df
    genres_df1=genres_df.set_index('genre_id')
    new_genre = [genres_df1.at[int(top_genre), 'title'] for top_genre in genres_df.top_level]
    genres_df=genres_df.assign(genre_parent=new_genre)
    
    # Artist name different on echonest and FullMusicArchive
    if DEBUG:
        name_comp = not_eq_ign_case(artists_df['artist_name_x'],
                                    artists_df['artist_name_y'])
        artist_name_diff = artists_df[name_comp]
        
        print('# Different artist names: {}\n' \
                  .format(len(artist_name_diff)))

        print('Example:\n{}'.format(artist_name_diff[['artist_name_x',
                                                      'artist_name_y']].head(3)))


106574
# Different artist names: 13797

Example:
                        artist_name_x                    artist_name_y
2   Alec K. Redfearn and the Eyesores  Alec K. Redfearn & the Eyesores
26                      From Quagmire                        Dora Bleu
31                     Marc A. Pullen                          Fanatic


Finally we have 4 main datasets: genres, artists, albums and tracks.

FMA dataset (https://lts2.epfl.ch/datasets/fma/) was gathered by joining a lot of information of many music API services like Echonest(now Spotify), LastFM, MusicBrainz and others more, thus, there are some columns that have data redundacy.

In [7]:
# Artist location different on echonest and FullMusicArchive
if DEBUG and CLEAN_PHASE:
    lat_comp = artists_df['artist_latitude_x'] != artists_df['artist_latitude_y']
    long_comp = artists_df['artist_longitude_x'] != artists_df['artist_longitude_y']
    latlong_comp = lat_comp | long_comp
    latlong_diff = artists_df[latlong_comp]

    print('# Different Lat Long values: {}\n' \
              .format(len(latlong_diff)))

    print('Example:\n{}'.format(artist_name_diff[['artist_latitude_x',
                                                  'artist_latitude_y',
                                                  'artist_longitude_x',
                                                  'artist_longitude_y']].head(3)))

# Different Lat Long values: 16339

Example:
    artist_latitude_x  artist_latitude_y  artist_longitude_x  \
2             41.8239          41.823989            -71.4120   
26                NaN                NaN                 NaN   
31            40.7978                NaN            -81.3716   

    artist_longitude_y  
2           -71.412834  
26                 NaN  
31                 NaN  


Also there is some spatial information encoded on certain columns, to handle them, we used a service to retrieve the exact location (Country, city, state).

In [8]:
if CLEAN_PHASE:
    # Getting city, state, country from artist longitude and latitude
    cities=[]
    states=[]
    countries=[]
    
    for i in range (0, len(artists_df)):
        if np.isnan(artists_df.artist_latitude_y[i]):
            city = np.nan
            state = np.nan
            country = np.nan
        else:
            coordinates = (artists_df.artist_latitude_y[i], 
                           artists_df.artist_longitude_y[i])
            results = rg.search(coordinates, mode=1)
            city = results[0]['name']
            state = results[0]['admin1']
            country = results[0]['cc']
        
        cities.append(city)
        states.append(state)
        countries.append(country)
    
    artists_df.insert(loc=5, column='city', 
                      value=pd.Series(cities))
    artists_df.insert(loc=6, column='state',
                      value=pd.Series(states))
    artists_df.insert(loc=7, column='country', 
                      value=pd.Series(countries))
    

Loading formatted geocoded file...


#### Dataset from Spotify

In [9]:
# Data from Spitfy and LastFM API with updated information
YEARS = [2015, 2016, 2017, 2018]
dfs = {}

for year in YEARS:
    albums_year_path = '{dir}albums_{year}_df.pkl'.format(dir=PKL_DIR,
                                                          year=year)
    artists_year_path = '{dir}artists_{year}_df.pkl'.format(dir=PKL_DIR,
                                                            year=year)
    tracks_year_path = '{dir}tracks_{year}_df.pkl'.format(dir=PKL_DIR,
                                                          year=year)

    dfs[year] = {'tracks': pd.read_pickle(tracks_year_path),
                 'albums': pd.read_pickle(albums_year_path),
                 'artists': pd.read_pickle(artists_year_path)}
    

In [10]:
if CLEAN_PHASE:
    tracks_post = pd.DataFrame(columns=tracks_df.columns) 
    albums_post = pd.DataFrame(columns=albums_df.columns)
    artists_post = pd.DataFrame(columns=artists_df.columns)
    result = []
    
    for index, year in enumerate(YEARS):
        tracks = dfs[year]['tracks']
        albums = dfs[year]['albums']
        artists = dfs[year]['artists']
        # Sample
        if(year == 2015):
            tracks = tracks.sample(frac=0.375)
            albums = albums.sample(frac=0.10)
            artists = artists.sample(frac=0.175)
        elif(year == 2016):
            tracks = tracks.sample(frac=0.50)
            albums = albums.sample(frac=0.125)
            artists = artists.sample(frac=0.175)
        elif(year == 2017):
            tracks = tracks.sample(frac=0.80)
            albums = albums.sample(frac=0.20)
            artists = artists.sample(frac=0.175)
        elif(year == 2018):
            tracks = tracks.sample(frac=0.90)
            albums = albums.sample(frac=0.225)
            artists = artists.sample(frac=0.175)
        
        # New tracks cleaning
        # Transform miliseconds duration to minutes
        tracks['track_duration'] = tracks['track_duration'] / 60000

        # Normalize song_hotness to be a value between 1 and 0
        tracks['song_hotness'] = tracks['song_hotness'] / 100 

        # New albums cleaning
        albums['album_date_released'] = pd.to_datetime( \
                                                albums['album_date_released'])

        # New artist cleaning
        # Normalize artist_hotness to be a value between 1 and 0
        artists['artist_hotness'] = artists['artist_hotness'] / 100

        # Merge the datasets of all years to have just one dataset for tracks ,
        # albums and artists
        tracks_post = pd.concat([tracks_post, tracks], axis=0, 
                                sort=False)
        artists_post = pd.concat([artists_post, artists], axis=0, 
                                 sort=False)
        albums_post = pd.concat([albums_post, albums], axis=0, 
                                sort=False)
        
    # Drop duplicates
    tracks_post.drop_duplicates('track_id', inplace=True)
    artists_post.drop_duplicates('artist_id', inplace=True)
    albums_post.drop_duplicates('album_id', inplace=True)
    

    
    # Generate relational dataframe track-genres
    # Look at genres on tags and add them to track_genres_all
    genre_list = genres_df['title'].tolist()
    genres = [genre.lower() for genre in genre_list]

    for i, row in tracks_post.iterrows():
        genre_set = set(genres)
        tag_set = set(row['track_tags'])
        genres_from_tags = list(genre_set.intersection(tag_set))
        track_uniq_genres = uniq(row['track_genres_all'] + genres_from_tags)
        [result.append({'track_id': row['track_id'],
                        'genre_name': genre}) for genre in track_uniq_genres]
    
    genres_merge = genres_df.copy()
    genres_merge['title'] = genres_merge['title'].str.lower()
    track_genre_name = pd.DataFrame(data=result)
    track_genre_name['genre_name'] = track_genre_name['genre_name'].str.lower()
    track_genre_rel_post = track_genre_name.merge(genres_merge, left_on='genre_name',
                                                right_on='title')[['track_id', 'genre_id']]


In [11]:
tracks_post

Unnamed: 0,track_id,track_title,track_duration,artist_id,album_id,track_genre_top,track_genres_all,track_language,track_listens,track_tags,danceability,energy,valence,song_currency,song_hotness,spoti_track_id,spoti_album_id,spoti_artist_id
64851,32WCcSindDy2AKj1IdV2aE,Lover,2.5571,27hSR8e34ZM5vj5fUFixyb,0QlGRL8MgEErqJW4a02iWk,,[],,9926,"[jazz, favorite songs, bebop, jazz instrumenta...",0.393,0.51100,0.2740,,0.03,32WCcSindDy2AKj1IdV2aE,0QlGRL8MgEErqJW4a02iWk,27hSR8e34ZM5vj5fUFixyb
84465,0aOoS0SpzGRnDXmuVbiCPZ,Mighty Mouse,0.528667,7pZPJvNcF3seJ8lHPfNwEA,66jOHevrhZbt8CWLmERUA9,,[],,2,[],0.440,0.47000,0.9650,,0.01,0aOoS0SpzGRnDXmuVbiCPZ,66jOHevrhZbt8CWLmERUA9,7pZPJvNcF3seJ8lHPfNwEA
4042,6g7amHSbqmmk7fpViGqXb3,Wolves,4.1051,099toTcKJoywTosZr2hHjy,77UiJMD9OVYj2YXr2gO9L5,,[],,5677,"[folk, acoustic, americana, afternoon, morning...",0.740,0.45300,0.6100,,0.26,6g7amHSbqmmk7fpViGqXb3,77UiJMD9OVYj2YXr2gO9L5,099toTcKJoywTosZr2hHjy
82084,60OwwNiAIeoe8533hk0USo,Eyes,4.08725,1Cb9Edoxp02mgY0VWVlwQo,05dDeQ6BITiw2hjuXxjBcU,,[],,10539,[],0.501,0.81400,0.2320,,0.19,60OwwNiAIeoe8533hk0USo,05dDeQ6BITiw2hjuXxjBcU,1Cb9Edoxp02mgY0VWVlwQo
55396,41eHsMDXaqizJxLYI9AkFa,Entre ciel et terre,2.63177,6bebPrkEPr2xEkiHP3QkNT,3jz23zr9FLsqJAC9hNpbxX,,[],,4730,[],0.797,0.44400,0.8770,,0.19,41eHsMDXaqizJxLYI9AkFa,3jz23zr9FLsqJAC9hNpbxX,6bebPrkEPr2xEkiHP3QkNT
55052,3QesnFSEEgZ1wwyox1ne1f,Not Tomorrow (Silent Hill),3.25527,7zdmbPudNX4SQJXnYIuCTC,218XV2o9997yzQZr2NA5Xr,,[],,1654,[],0.387,0.77300,0.0650,,0.21,3QesnFSEEgZ1wwyox1ne1f,218XV2o9997yzQZr2NA5Xr,7zdmbPudNX4SQJXnYIuCTC
70395,1PXIsSyC3YSAtuI0U9dTRk,Fragile Line,3.9251,7lAi1Cv19DsukgGjbZQxFg,1KTtC7NK3AaIcQpOmC1NsH,,[],,23304,"[post-hardcore, metalcore, rock, emo, Progress...",0.442,0.99600,0.0704,,0.23,1PXIsSyC3YSAtuI0U9dTRk,1KTtC7NK3AaIcQpOmC1NsH,7lAi1Cv19DsukgGjbZQxFg
19831,530J9GupahE2O4a1iVkQxB,Classic Man - Remix,3.91743,4TsHKU8l8Wq7n7OPVikirn,0b2j47HuTRqlzvROQktVtg,,[],,82309,"[Hip-Hop, rnb, 10s, 2015, kendrick lamar, rnba...",0.851,0.65700,0.6440,,0.52,530J9GupahE2O4a1iVkQxB,0b2j47HuTRqlzvROQktVtg,4TsHKU8l8Wq7n7OPVikirn
75702,0qNbCmjavn69YUdV4c1BDn,Vacations - 2010 Demo,12.4197,2Dr2zHoPEuASW1kOODc5I2,0mxli2Cs2vyf9PQenKM70L,,[],,212,[],0.264,0.97700,0.3200,,0.03,0qNbCmjavn69YUdV4c1BDn,0mxli2Cs2vyf9PQenKM70L,2Dr2zHoPEuASW1kOODc5I2
20847,5HhQMt1RKJZCY1FbNw6Od6,Drive Me Crazy,4.80282,6DvlemMOlMH6VD9tB7BPRf,2iW0Ey7bbLqOGicMDPo4yM,,[],,72,[],0.625,0.69200,0.3230,,0.28,5HhQMt1RKJZCY1FbNw6Od6,2iW0Ey7bbLqOGicMDPo4yM,6DvlemMOlMH6VD9tB7BPRf


In [12]:
if CLEAN_PHASE:
    albums_df = pd.concat([albums_df, albums_post], axis=0,
                          sort=False)
    albums_df.drop_duplicates('album_id', inplace=True)
    albums_df.drop(columns=['spoti_album_id'], inplace=True)
    albums_df['album_listens'] = pd.to_numeric(albums_df['album_listens'])
    
    
    artists_df = pd.concat([artists_df, artists_post], axis=0, 
                           sort=False)
    artists_df.drop_duplicates('artist_id', inplace=True)
    artists_df.drop(columns=['spoti_artist_id', 'artist_name_x', 
                             'artist_longitude_x', 'artist_latitude_x'], 
                    inplace=True)
    artists_df.rename({'artist_latitude_y': 'artist_latitude', 
                       'artist_longitude_y': 'artist_longitude',
                       'artist_name_y': 'artist_name'}, 
                      axis='columns', 
                      inplace=True)
    
    
    tracks_df = pd.concat([tracks_df, tracks_post], axis=0, 
                           sort=False)
    tracks_df.drop_duplicates('track_id', inplace=True)
    tracks_df.drop(columns=['spoti_track_id', 'spoti_album_id',
                            'spoti_artist_id', 'track_genres_all', 
                            'track_genre_top'], inplace=True)
    tracks_df['track_duration'] = pd.to_numeric(tracks_df['track_duration'])
    tracks_df['track_listens'] = pd.to_numeric(tracks_df['track_listens'])
    tracks_df['song_hotness'] = pd.to_numeric(tracks_df['song_hotness'])
    
    
    track_genre_rel_df = pd.concat([track_genre_rel_df, track_genre_rel_post], 
                                   axis=0, sort=False)
    track_genre_rel_df.drop_duplicates(subset=['track_id', 'genre_id'], 
                                       inplace=True)
    track_genre_rel_df['genre_id'] = pd.to_numeric(track_genre_rel_df['genre_id'])
    

Finally, after all preprocessing steps, we can save the information to be used for our analysis.

In [13]:
# Read  and wirte files depending on the existance of own path
genres_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                     file='genres_df.pkl')
albums_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                     file='albums_df.pkl')
artists_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                     file='artists_df.pkl')
tracks_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                      file='tracks_df.pkl')
track_genre_rel_df_path = '{dir}{file}'.format(dir=PKL_DIR, 
                                               file='track_genre_rel_df.pkl')
if CLEAN_PHASE:
    genres_df.to_pickle(genres_df_path)
elif os.path.exists(genres_df_path):
    genres_df = pd.read_pickle(genres_df_path)
else:
    print('There is no genres pandas data')
    
if CLEAN_PHASE:
    albums_df.to_pickle(albums_df_path)
elif os.path.exists(albums_df_path):
    albums_df = pd.read_pickle(albums_df_path)
else:
    print('There is no albums pandas data')
    
if CLEAN_PHASE:
    artists_df.to_pickle(artists_df_path)
elif os.path.exists(artists_df_path):
    artists_df = pd.read_pickle(artists_df_path)
else:
    print('There is no artists pandas data')
    
if CLEAN_PHASE:
    tracks_df.to_pickle(tracks_df_path)
elif os.path.exists(tracks_df_path):
    tracks_df = pd.read_pickle(tracks_df_path)
else:
    print('There is no tracks pandas data')
    
if CLEAN_PHASE:
    track_genre_rel_df.to_pickle(track_genre_rel_df_path)
elif os.path.exists(track_genre_rel_df_path):
    track_genre_rel_df = pd.read_pickle(track_genre_rel_df_path)
else:
    print('There is no tracks-genres pandas data')
    

We identify some statistics on all dataframes, like the missing values after preprocessing, taking echonest as one dataframe, just for analysis.

In [14]:
pretty_print(genres_df, 'Genres', DEBUG)


Genres size: 163

NaN count by column:
genre_id        0
#tracks         0
parent          0
title           0
top_level       0
genre_parent    0
dtype: int64


Unnamed: 0,genre_id,#tracks,parent,title,top_level,genre_parent
0,1,8693,38,Avant-Garde,38,Experimental
1,2,5271,0,International,2,International
2,3,1752,0,Blues,3,Blues
3,4,4126,0,Jazz,4,Jazz
4,5,4106,0,Classical,5,Classical


In [15]:
pretty_print(artists_df, 'Artists', DEBUG)


Artists size: 21966

NaN count by column:
artist_id               0
artist_discovery    19090
city                18130
state               18130
country             18130
artist_family       19090
artist_hotness      13465
artist_latitude     18130
artist_longitude    18130
artist_name             0
dtype: int64


Unnamed: 0,artist_id,artist_discovery,city,state,country,artist_family,artist_hotness,artist_latitude,artist_longitude,artist_name
0,1,0.38899,Lakehurst,New Jersey,US,0.38674,0.40637,40.0583,-74.4057,AWOL
1,6,0.557339,,,,0.614272,0.798387,,,Kurt Vile
2,54,0.388923,Providence,Rhode Island,US,0.330784,0.4063,41.824,-71.4128,Alec K. Redfearn & the Eyesores
3,56,0.252381,Providence,Rhode Island,US,0.238153,0.263657,41.824,-71.4128,Amoebic Ensemble
4,60,0.262717,Philadelphia,Pennsylvania,US,0.173315,0.274455,39.9523,-75.1638,Arc and Sender


In [16]:
pretty_print(albums_df, 'Albums', DEBUG)


Albums size: 19621

NaN count by column:
album_id                  0
album_date_created     5226
album_date_released    5404
album_title               1
album_listens            34
dtype: int64


Unnamed: 0,album_id,album_date_created,album_date_released,album_title,album_listens
0,1,2008-11-26 01:44:45,2009-01-05,AWOL - A Way Of Life,6073.0
1,6,2008-11-26 01:45:08,2008-02-06,Constant Hitmaker,47632.0
2,60,2008-11-26 01:49:57,2009-01-16,Every Man For Himself,1304.0
3,61,2008-11-26 01:49:59,2007-05-22,The Blind Spot,1300.0
4,62,2008-11-26 01:50:03,2005-01-25,The Quiet Room,845.0


In [17]:
pretty_print(tracks_df, 'Tracks', DEBUG)


Tracks size: 285078

NaN count by column:
track_id               0
track_title            1
track_duration         0
artist_id              0
album_id               0
track_language    270054
track_listens      14409
track_tags             0
danceability       93445
energy             93445
valence            93445
song_currency     271949
song_hotness       93445
dtype: int64


Unnamed: 0,track_id,track_title,track_duration,artist_id,album_id,track_language,track_listens,track_tags,danceability,energy,valence,song_currency,song_hotness
0,2,Food,168.0,1,1,en,1293.0,[],0.675894,0.634476,0.576661,0.0,0.0
1,3,Electric Ave,237.0,1,1,en,514.0,[],0.528643,0.817461,0.26924,0.0,0.0
2,5,This World,206.0,1,1,en,1151.0,[],0.745566,0.70147,0.621661,0.0,0.0
3,10,Freeway,161.0,6,6,en,50135.0,[],0.658179,0.924525,0.96359,0.005158,0.354516
4,134,Street Music,207.0,1,1,en,943.0,[],0.513238,0.56041,0.894072,0.0,0.0


In [18]:
pretty_print(track_genre_rel_df, 'Track-Genre', DEBUG)


Track-Genre size: 287767

NaN count by column:
track_id    0
genre_id    0
dtype: int64


Unnamed: 0,track_id,genre_id
0,2,21
1,3,21
2,5,21
3,10,10
4,20,76
