# Music Recommendation System
In this project, I'll be building recommendation system using Spotify's and Pitchfork's datasets that recommends music to the users through content-based and collaborative filtering techniques.

## Datasets

### Spotify
The [Spotify dataset](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) contains more than 170,000 songs collected through Spotify Web API. (Credits: Yamaç Eren Ay)

**Primary**:
- id (Id of track generated by Spotify)

**Numerical**:
- acousticness (Ranges from 0 to 1)
- danceability (Ranges from 0 to 1)
- energy (Ranges from 0 to 1)
- duration_ms (Integer typically ranging from 200k to 300k)
- instrumentalness (Ranges from 0 to 1)
- valence (Ranges from 0 to 1)
- popularity (Ranges from 0 to 100)
- tempo (Float typically ranging from 50 to 150)
- liveness (Ranges from 0 to 1)
- loudness (Float typically ranging from -60 to 0)
- speechiness (Ranges from 0 to 1)
- year (Ranges from 1921 to 2020)

**Dummy**:
- mode (0 = Minor, 1 = Major)
- explicit (0 = No explicit content, 1 = Explicit content)

**Categorical**:
- key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…)
- artists (List of artists mentioned)
- release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary)
- name (Name of the song)

### Pitchfork
The [Pitchfork dataset](https://www.kaggle.com/nolanbconaway/pitchfork-data) contains 19,000+ reviews from music-centric online magazine going back as early as January 1999. (Credits: Nolan Conaway) 

**Numerical**:
- best (Considered best album 0 or 1)
- score (Review score 0 to 10)

**Categorical**:
- album (Album name)
- artist (Artist name)
- genre (Genre type - 'electronic', 'hip-hop', etc.)
- review (text)
- date (Album release date)


In [157]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19555 entries, 1 to 19555
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   album   19550 non-null  string        
 1   artist  19555 non-null  string        
 2   best    19555 non-null  int64         
 3   date    19555 non-null  datetime64[ns]
 4   genre   19555 non-null  string        
 5   review  19554 non-null  string        
 6   score   19555 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), string(4)
memory usage: 1.2 MB


## Import relevant modules

In [1]:
import pandas as pd
import numpy as np

# Visualization
import plotly.graph_objects as go
import plotly.express as px

# Display more columns and rows
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

## Import files

In [165]:
music_df = pd.read_csv('./dataset/spotify_data.csv')

# Remove 'Unnamed column from being created' and set different encoding to read reviews data
review_df = pd.read_csv('./dataset/pitchfork_reviews.csv', encoding = "ISO-8859-1", index_col=0)

## 2) Explore and Clean Spotify Dataset
- Inspect datasets for shape, datatype for column, statistics, and missing values.

### 2.1 Inspect datasets
#### Spotify dataset

In [166]:
music_df.shape # 170,653 rows and 19 columns

(170653, 19)

In [167]:
music_df.describe()

Unnamed: 0,valence,year,acousticness,danceability,duration_ms,energy,explicit,instrumentalness,key,liveness,loudness,mode,popularity,speechiness,tempo
count,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0,170653.0
mean,0.528587,1976.787241,0.502115,0.537396,230948.3,0.482389,0.084575,0.16701,5.199844,0.205839,-11.46799,0.706902,31.431794,0.098393,116.86159
std,0.263171,25.917853,0.376032,0.176138,126118.4,0.267646,0.278249,0.313475,3.515094,0.174805,5.697943,0.455184,21.826615,0.16274,30.708533
min,0.0,1921.0,0.0,0.0,5108.0,0.0,0.0,0.0,0.0,0.0,-60.0,0.0,0.0,0.0,0.0
25%,0.317,1956.0,0.102,0.415,169827.0,0.255,0.0,0.0,2.0,0.0988,-14.615,0.0,11.0,0.0349,93.421
50%,0.54,1977.0,0.516,0.548,207467.0,0.471,0.0,0.000216,5.0,0.136,-10.58,1.0,33.0,0.045,114.729
75%,0.747,1999.0,0.893,0.668,262400.0,0.703,0.0,0.102,8.0,0.261,-7.183,1.0,48.0,0.0756,135.537
max,1.0,2020.0,0.996,0.988,5403500.0,1.0,1.0,1.0,11.0,1.0,3.855,1.0,100.0,0.97,243.507


In [168]:
# Check for any null or missing values
music_df.isnull().sum() / music_df.shape[0] # Check whether spotify dataset has null value

valence             0.0
year                0.0
acousticness        0.0
artists             0.0
danceability        0.0
duration_ms         0.0
energy              0.0
explicit            0.0
id                  0.0
instrumentalness    0.0
key                 0.0
liveness            0.0
loudness            0.0
mode                0.0
name                0.0
popularity          0.0
release_date        0.0
speechiness         0.0
tempo               0.0
dtype: float64

In [133]:
np.where(music_df.applymap(lambda x: x == '')) # Check for empty string/value

(array([], dtype=int64), array([], dtype=int64))

There are no null or missing values nor empty string/values.

### 2.2 Format Values
- Artists column - convert array string to array with lowercased artists' name
- Name (track name) column - lowercase all track names and change column name to 'track_name'
- Release Date column - convert to default datetime format (Year-Month-Day)
- All other numerical columns - scale range from 0 to 1 for uniformity
- Re-order columns
- Get genres per song (row)

In [134]:
music_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170653 entries, 0 to 170652
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   valence           170653 non-null  float64
 1   year              170653 non-null  int64  
 2   acousticness      170653 non-null  float64
 3   artists           170653 non-null  object 
 4   danceability      170653 non-null  float64
 5   duration_ms       170653 non-null  int64  
 6   energy            170653 non-null  float64
 7   explicit          170653 non-null  int64  
 8   id                170653 non-null  object 
 9   instrumentalness  170653 non-null  float64
 10  key               170653 non-null  int64  
 11  liveness          170653 non-null  float64
 12  loudness          170653 non-null  float64
 13  mode              170653 non-null  int64  
 14  name              170653 non-null  object 
 15  popularity        170653 non-null  int64  
 16  release_date      17

#### Artists Column

In [135]:
# Inspecting artists column's unique values.
music_df['artists'].unique()

array(["['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']",
       "['Dennis Day']",
       "['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat']", ...,
       "['Najma Wallin']",
       "['Anuel AA', 'Daddy Yankee', 'KAROL G', 'Ozuna', 'J Balvin']",
       "['KEVVO', 'J Balvin']"], dtype=object)

In [136]:
# Converting array string to array
def convertToArr(artists):
    artists = artists.replace(']','').replace('[','') # Remove brackets
    artists = artists.replace('"','').split(",") # Replace and split by comma (creates array)
    artists = [artist.lower() for artist in artists] # Lowercase all characters
    return artists

music_df['artists'] = music_df['artists'].apply(convertToArr)

#### Name (Track Name) Column

In [137]:
# Inspect name column's unique values
music_df['name'].unique()

array(['Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve',
       'Clancy Lowered the Boom', 'Gati Bali', ...,
       'Halloweenie III: Seven Days', 'AYA',
       'Billetes Azules (with J Balvin)'], dtype=object)

In [138]:
# Convert to string dtype and lowercase all values
music_df['name'] = music_df['name'].apply(lambda track_name: str(track_name).lower())
music_df['name'] = music_df['name'].astype(str)
music_df['name'].unique()

array(['piano concerto no. 3 in d minor, op. 30: iii. finale. alla breve',
       'clancy lowered the boom', 'gati bali', ...,
       'halloweenie iii: seven days', 'aya',
       'billetes azules (with j balvin)'], dtype=object)

In [139]:
# Change 'name' to 'track_name'
music_df.rename(columns={"name": "track_name"}, inplace=True)

#### Release Date Column

In [140]:
# Inspect release date column's unique values
music_df['release_date'].unique()

array(['1921', '1921-03-20', '1921-03-27', ..., '2020-04-15',
       '2020-05-25', '2020-11-03'], dtype=object)

In [141]:
# Convert date column to datetime
music_df['release_date'] = pd.to_datetime(music_df['release_date'])

#### Normalize value between 0-1 for all numerical columns
Normalizing value on all columns to improve the convergence speed during machine learning

In [142]:
music_df.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,track_name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['sergei rachmaninoff', 'james levine', 'ber...",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"piano concerto no. 3 in d minor, op. 30: iii. ...",4,1921-01-01,0.0366,80.954
1,0.963,1921,0.732,['dennis day'],0.819,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,clancy lowered the boom,5,1921-01-01,0.415,60.936
2,0.0394,1921,0.961,['khp kridhamardawa karaton ngayogyakarta hadi...,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,gati bali,5,1921-01-01,0.0339,110.339
3,0.165,1921,0.967,['frank parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,-9.316,1,danny boy,3,1921-01-01,0.0354,100.109
4,0.253,1921,0.957,['phil regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,-10.096,1,when irish eyes are smiling,2,1921-01-01,0.038,101.665


In [143]:
# According to dataframe above, the columns below need to scale from 0 to 1
cols = ['duration_ms', 'loudness', 'popularity', 'tempo']

In [144]:
# Scale all numerical columns from 0 to 1 for uniformity.
from sklearn.preprocessing import MinMaxScaler

min_max_scaler = MinMaxScaler() # instantiate

music_df[cols] = min_max_scaler.fit_transform(music_df[cols])
music_df.head()

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,track_name,popularity,release_date,speechiness,tempo
0,0.0594,1921,0.982,"['sergei rachmaninoff', 'james levine', 'ber...",0.279,0.153112,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,0.624916,1,"piano concerto no. 3 in d minor, op. 30: iii. ...",0.04,1921-01-01,0.0366,0.33245
1,0.963,1921,0.732,['dennis day'],0.819,0.032496,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,0.744797,1,clancy lowered the boom,0.05,1921-01-01,0.415,0.250243
2,0.0394,1921,0.961,['khp kridhamardawa karaton ngayogyakarta hadi...,0.328,0.091685,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,0.707071,1,gati bali,0.05,1921-01-01,0.0339,0.453125
3,0.165,1921,0.967,['frank parker'],0.275,0.037954,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.8e-05,5,0.381,0.793736,1,danny boy,0.03,1921-01-01,0.0354,0.411113
4,0.253,1921,0.957,['phil regan'],0.418,0.029932,0.193,0,4d6HGyGT8e121BsdKmw9v6,2e-06,3,0.229,0.781521,1,when irish eyes are smiling,0.02,1921-01-01,0.038,0.417503


In [145]:
# Change 'duration_ms' to 'duration'
music_df.rename(columns={'duration_ms': 'duration'}, inplace=True)

In [147]:
# Re-order columns
first_cols = ['id', 'artists', 'track_name', 'release_date', 'year', 'key', 'explicit', 'mode']
second_cols = [col for col in music_df.columns if col not in first_cols]
music_df = music_df[first_cols + second_cols]

# Finalized music dataframe
music_df.head()

Unnamed: 0,id,artists,track_name,release_date,year,key,explicit,mode,valence,acousticness,danceability,duration,energy,instrumentalness,liveness,loudness,popularity,speechiness,tempo
0,4BJqT0PrAfrxzMOxytFOIz,"['sergei rachmaninoff', 'james levine', 'ber...","piano concerto no. 3 in d minor, op. 30: iii. ...",1921-01-01,1921,10,0,1,0.0594,0.982,0.279,0.153112,0.211,0.878,0.665,0.624916,0.04,0.0366,0.33245
1,7xPhfUan2yNtyFG0cUWkt8,['dennis day'],clancy lowered the boom,1921-01-01,1921,7,0,1,0.963,0.732,0.819,0.032496,0.341,0.0,0.16,0.744797,0.05,0.415,0.250243
2,1o6I8BglA6ylDMrIELygv1,['khp kridhamardawa karaton ngayogyakarta hadi...,gati bali,1921-01-01,1921,3,0,1,0.0394,0.961,0.328,0.091685,0.166,0.913,0.101,0.707071,0.05,0.0339,0.453125
3,3ftBPsC5vPBKxYSee08FDH,['frank parker'],danny boy,1921-01-01,1921,5,0,1,0.165,0.967,0.275,0.037954,0.309,2.8e-05,0.381,0.793736,0.03,0.0354,0.411113
4,4d6HGyGT8e121BsdKmw9v6,['phil regan'],when irish eyes are smiling,1921-01-01,1921,3,0,1,0.253,0.957,0.418,0.029932,0.193,2e-06,0.229,0.781521,0.02,0.038,0.417503


### 2.3 Spotify EDA
- Finding correlation among columns (heatmap)
- Is the track more likely to be explicit when speechiness has high value?
- Identify characteristics of songs that has high and low popularity scores (distplot of all numerical columns)
- Identify music trends (line chart)
- Find duration mean and identify characters of songs that has high duration value
- Identify popular artists and songs based on popularity scores
- Song characteristics of each genre

#### Pitchfork reviews dataset

In [148]:
review_df.shape # 19,555 rows and 7 columns

(19555, 7)

In [149]:
review_df.describe()

Unnamed: 0,best,score
count,19555.0,19555.0
mean,0.053183,7.027446
std,0.224405,1.277544
min,0.0,0.0
25%,0.0,6.5
50%,0.0,7.3
75%,0.0,7.8
max,1.0,10.0


In [150]:
music_df.isnull().sum() / music_df.shape[0] # Check whether spotify dataset has null value

id                  0.0
artists             0.0
track_name          0.0
release_date        0.0
year                0.0
key                 0.0
explicit            0.0
mode                0.0
valence             0.0
acousticness        0.0
danceability        0.0
duration            0.0
energy              0.0
instrumentalness    0.0
liveness            0.0
loudness            0.0
popularity          0.0
speechiness         0.0
tempo               0.0
dtype: float64

In [151]:
np.where(music_df.applymap(lambda x: x == '')) # Check for empty string/value

(array([], dtype=int64), array([], dtype=int64))

Pitchfork's dataset does not contain any null and empty string values.

In [152]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19555 entries, 1 to 19555
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   album   19550 non-null  object 
 1   artist  19555 non-null  object 
 2   best    19555 non-null  int64  
 3   date    19555 non-null  object 
 4   genre   19555 non-null  object 
 5   review  19554 non-null  object 
 6   score   19555 non-null  float64
dtypes: float64(1), int64(1), object(5)
memory usage: 1.2+ MB


In [153]:
review_df.head()

Unnamed: 0,album,artist,best,date,genre,review,score
1,A.M./Being There,Wilco,1,December 6 2017,Rock,Best new reissue 1 / 2 Albums Newly reissued a...,7.0
2,No Shame,Hopsin,0,December 6 2017,Rap,"On his corrosive fifth album, the rapper takes...",3.5
3,Material Control,Glassjaw,0,December 6 2017,Rock,"On their first album in 15 years, the Long Isl...",6.6
4,Weighing of the Heart,Nabihah Iqbal,0,December 6 2017,Pop/R&B,"On her debut LP, British producer Nabihah Iqba...",7.7
5,The Visitor,Neil Young / Promise of the Real,0,December 5 2017,Rock,"While still pointedly political, Neil Youngs ...",6.7


By inspecting album, artist, genre, and review columns' row values; it needs to be converted 'string' datatype. While date columns needs to be converted to 'datetime' from 'object' datatype.

In [154]:
# Convert album, artist, genre, and review to string datatype and lowercase all characters.
cols = ['album', 'artist', 'genre', 'review']
review_df[cols] = review_df[cols].astype('string')

# Lowercase all row values
for col in cols:
    review_df[col] = review_df[col].str.lower()
    
# Convert date column to datetime
review_df['date'] = pd.to_datetime(review_df['date'])

In [155]:
review_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19555 entries, 1 to 19555
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   album   19550 non-null  string        
 1   artist  19555 non-null  string        
 2   best    19555 non-null  int64         
 3   date    19555 non-null  datetime64[ns]
 4   genre   19555 non-null  string        
 5   review  19554 non-null  string        
 6   score   19555 non-null  float64       
dtypes: datetime64[ns](1), float64(1), int64(1), string(4)
memory usage: 1.2 MB


In [156]:
review_df.head()

Unnamed: 0,album,artist,best,date,genre,review,score
1,a.m./being there,wilco,1,2017-12-06,rock,best new reissue 1 / 2 albums newly reissued a...,7.0
2,no shame,hopsin,0,2017-12-06,rap,"on his corrosive fifth album, the rapper takes...",3.5
3,material control,glassjaw,0,2017-12-06,rock,"on their first album in 15 years, the long isl...",6.6
4,weighing of the heart,nabihah iqbal,0,2017-12-06,pop/r&b,"on her debut lp, british producer nabihah iqba...",7.7
5,the visitor,neil young / promise of the real,0,2017-12-05,rock,"while still pointedly political, neil youngs ...",6.7


Columns have correct datatypes and date column has default datetime format.

In [160]:
review_df['review'][1]

'best new reissue 1 / 2 albums newly reissued and remastered, the group\x92s first two albums find jeff tweedy and his chicago band transforming themselves from alt-country also-rans into a formidable rock\x91n\x92roll outfit. the nuclear detonation of uncle tupelo launched an alt-country arms race, with the band\x92s two chief singer-songwriters mutating from old friends into bitter enemies trying to outdo each other with their follow-up records. jay farrar started son volt with tupelo\x92s drummer, mike heidorn, and released trace, which yielded the radio hit \x93drown\x94 and found him greeted as a visionary. jeff tweedy, on the other hand, rushed into the studio to record a set of demos with his new band, wilco, barely a couple months after his old band had played its final show. nearly a year later they released their first album, a.m., which was greeted with a big shrug from critics and fans alike. tweedy had managed to retain almost every member of uncle tupelo\x92s expanded lin

In [171]:
music_df[music_df['artists'] == "['Dua Lipa']"]

Unnamed: 0,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
19165,0.608,2017,0.00261,['Dua Lipa'],0.762,209320,0.7,0,2ekn2ttSfGqwhhate0LSR0,1.6e-05,9,0.153,-6.021,0,New Rules,80,2017-06-02,0.0694,116.073
19180,0.51,2017,0.0403,['Dua Lipa'],0.836,217947,0.544,1,76cy1WJvNGJTj78UqeA5zr,0.0,7,0.0824,-5.975,1,IDGAF,80,2017-06-02,0.0943,97.028
19488,0.677,2019,0.0125,['Dua Lipa'],0.794,183290,0.793,0,6WrI0LAC5M1Rw2MnX2ZvEg,0.0,11,0.0952,-4.521,0,Don't Start Now,86,2019-10-31,0.0842,123.941
38583,0.467,2020,0.167,['Dua Lipa'],0.73,221820,0.729,0,017PF4Q3l4DBUiWoXk4OWT,1e-06,4,0.349,-3.434,0,Break My Heart,85,2020-03-27,0.0883,113.013
56451,0.491,2016,0.0188,['Dua Lipa'],0.654,178583,0.796,1,7kJlTKjNZVT26iwiDUVhRm,0.0,2,0.0948,-4.761,0,Blow Your Mind (Mwah),71,2016-08-26,0.122,108.854
57340,0.679,2020,0.0123,['Dua Lipa'],0.793,183290,0.793,0,3PfIrDoz19wz7qK7tYeu62,0.0,11,0.0951,-4.521,0,Don't Start Now,84,2020-03-27,0.083,123.95
75236,0.467,2020,0.167,['Dua Lipa'],0.73,221820,0.729,0,1raaNykBg1bDnWENUiglUA,1e-06,4,0.349,-3.434,0,Break My Heart,79,2020-03-25,0.0886,113.012
91224,0.368,2015,0.117,['Dua Lipa'],0.661,202915,0.651,0,1ixphys4A3NEXp6MDScfih,1.3e-05,7,0.056,-3.771,0,Be the One,71,2015-10-30,0.0499,87.46
92265,0.746,2020,0.0137,['Dua Lipa'],0.647,193829,0.844,0,3AzjcOeAmA57TIOr9zF1ZW,0.000658,0,0.102,-3.756,1,Physical,81,2020-03-27,0.0457,146.967
108809,0.627,2020,0.033,['Dua Lipa'],0.627,208505,0.69,0,1nYeVF5vIBxMxfPoL0SIWg,0.0,10,0.0742,-5.396,0,Hallucinate,80,2020-03-27,0.139,122.053
