# Sentiment Analysis and Recommender Systems Part 4 - Exercises with Results

## Exercise 1

#### Task 1 
##### Load libraries that are used in this module.

#### Result:

In [1]:
import os
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import wordcloud
from wordcloud import WordCloud, STOPWORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt
from scipy.sparse.linalg import svds
from surprise import Reader
from surprise import Dataset
from surprise import SVD
from surprise.model_selection import cross_validate

#### Task 2 
##### Set working directory to folder where the dataset is present.

#### Result:

In [2]:
from pathlib import Path
home_dir = Path(".").resolve()
main_dir = home_dir.parent.parent

data_dir = str(main_dir) + "/data"


#### Task 3
##### Read in 'lastfm_ratings.csv' dataset to a dataframe named 'fm_ratings' and 'lastfm_artists.csv' as 'fm_artists'.

#### Result:

In [3]:
fm_ratings = pd.read_csv(data_dir + '/lastfm_ratings.csv')
fm_artists = pd.read_csv(data_dir + '/lastfm_artists.csv')

#### Task 4
##### Transform the fm_ratings with userID as the row and artist_name as the column and the rating as the value. Set it as `userRating`.
##### Find the correlation matrix for the artist_name. Do not use min_periods as we did in module, since our dataset is small here.
##### It could take a lot of time to form the correlation matrix, so you can also load from our data_dir where we already have it calculated and saved.
##### Load `corrMatrix_ex.csv` as corrMatrix - also, set the first column `artist_name` as the index for the dataframe.

#### Result:

In [4]:
userRating = fm_ratings.pivot_table(index = ['userID'],
                                    columns = ['artist_name'], values = 'rating')

userRating.head()

artist_name,(hed) Planet Earth,*NSYNC,...And The Earth Swarmed With Them,...And You Will Know Us by the Trail of Dead,.38 Special,.crrust,1-800-ZOMBIE,10 Years,100 Bitches,100DEADRABBITS!!!,...,Башня Rowan,МакSим,Мультfильмы,Розовые Очки От Ferre,Розовые очки от ferre,аутside,℃-ute,けちゃっぷmania,月島きらり starring 久住小春(モーニング娘。),雅-MIYAVI-
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,


In [5]:
# corrMatrix = userRating.corr(method = 'pearson')
corrMatrix = pd.read_csv(data_dir + "/corrMatrix_ex.csv")
corrMatrix.head()

Unnamed: 0,artist_name,(hed) Planet Earth,*NSYNC,...And The Earth Swarmed With Them,...And You Will Know Us by the Trail of Dead,.38 Special,.crrust,1-800-ZOMBIE,10 Years,100 Bitches,...,Башня Rowan,МакSим,Мультfильмы,Розовые Очки От Ferre,Розовые очки от ferre,аутside,℃-ute,けちゃっぷmania,月島きらり starring 久住小春(モーニング娘。),雅-MIYAVI-
0,(hed) Planet Earth,1.0,,,,,,,,,...,,,,,,,,,,
1,*NSYNC,,1.0,,,,,,,,...,,,,,,,,,,
2,...And The Earth Swarmed With Them,,,,,,,,,,...,,,,,,,,,,
3,...And You Will Know Us by the Trail of Dead,,,,1.0,,,,,,...,,,,,,,,,,
4,.38 Special,,,,,1.0,,,,,...,,,,,,,,,,


In [6]:
corrMatrix = corrMatrix.set_index('artist_name')

#### Task 5
##### We will find recommendations for userID 25. Assign user_id as 25 and use the same steps we did in the module to find the artist recommendation.
##### First, create a list of all artists with all correlations multiplied by rating.
##### Group by artist_id and sum the ratings to remove the duplicates.

#### Result:

In [7]:
user_corr = pd.Series(dtype='float64')

user_id = 24

# Create a list of all films with all correlations multiplied by rating.
for film in userRating.iloc[user_id].dropna().index:
    corr_list = corrMatrix[film].dropna() * userRating.iloc[user_id][film]
    user_corr = user_corr.append(corr_list)

# Group by artist ID and sum the ratings to remove duplicates.
user_corr = user_corr.groupby(user_corr.index).sum()

#### Task 6
##### Find the list of artists that the user has already heard and remove them.
##### Give top 10 recommendations.

#### Result:

In [8]:
# Create a list of artists the user has already listened to and remove them.
title_list = []

for i in range(len(userRating.iloc[user_id].dropna().index)):
    if userRating.iloc[user_id].dropna().index[i] in user_corr:
        title_list.append(userRating.iloc[user_id].dropna().index[i])
    else:
        pass
user_corr = user_corr.drop(title_list)

In [9]:
print('Hi! Based on the artists that you listen, you might like: \n')
for i in userRating.iloc[user_id].dropna().index:
    print(i)

Hi! Based on the artists that you listen, you might like: 

3 Doors Down
3OH!3
AC/DC
Anarbor
Avril Lavigne
Backstreet Boys
Bon Jovi
Boys Like Girls
Breathe Carolina
Britney Spears
Bruno Mars
Bullet for My Valentine
Christina Aguilera
David Guetta
Enrique Iglesias
Evanescence
Forever the Sickest Kids
Fresno
Glee Cast
Gloria
Hilary Duff
Jessie J
Katy Perry
Ke$ha
Keri Hilson
Kerli
Kylie Minogue
Lady Gaga
Miley Cyrus
My Chemical Romance
New Found Glory
Nickelback
Nicki Minaj
Nicole Scherzinger
No Doubt
P!nk
Paramore
Pixie Lott
Rihanna
Runner Runner
Selena Gomez & the Scene
Simple Plan
Slipknot
Taio Cruz
Teen Hearts
The Maine
The Pretty Reckless
The Used
Vanessa Hudgens


In [10]:
for i in user_corr.sort_values(ascending = False).index[:10]:
    print(i)

System of a Down
Pitty
Fergie
Gwen Stefani
Alesha Dixon
Jordin Sparks
Owl City
Kings of Leon
La Roux
Goldfrapp


## Exercise 2

#### Task 1
##### Find the total number of users and artists.
##### Transform our fm_ratings dataset using pivot_table where we have 1 row per user and 1 column per artist.

#### Result:

In [11]:
# Find total number of unique users and artists.
n_users = fm_ratings.userID.unique().shape[0]
n_artists = fm_ratings.artistID.unique().shape[0]
print('Number of users = ' + str(n_users) + ' | Number of artists = ' + str(n_artists))

Number of users = 1836 | Number of artists = 11065


In [12]:
Ratings = fm_ratings.pivot(index = 'userID',columns ='artistID', 
                           values = 'rating').fillna(0)

Ratings.head()

artistID,1,2,3,4,5,6,7,8,9,10,...,18719,18721,18722,18723,18724,18737,18739,18740,18741,18744
userID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Task 2
##### Convert the pivot table into matrix and find the sparsity percentage.

#### Result:

In [13]:
R = Ratings.to_numpy()

user_ratings_mean = np.mean(R, axis = 1)
Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)

# Check the percentage of sparsity.
sparsity = round(1.0 - len(fm_ratings) / float(n_users * n_artists), 3)
print('The sparsity level of lastfm dataset is ' +  str(sparsity * 100) + '%')

The sparsity level of lastfm dataset is 99.7%


#### Task 3
##### Fetch the first 50 latent features and return the list of artists the user has already rated.

#### Result:

In [14]:
U, sigma, Vt = svds(Ratings_demeaned, k = 50)

# Convert the sigma matrix into the diagonal matrix form.
sigma = np.diag(sigma)

In [15]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)

In [16]:
preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)
preds.head()

artistID,1,2,3,4,5,6,7,8,9,10,...,18719,18721,18722,18723,18724,18737,18739,18740,18741,18744
0,0.007284,0.002137,0.005183,0.004242,0.00423,0.001238,-0.075847,-0.001194,-0.007429,0.005201,...,0.007793,0.007793,0.007793,0.007793,0.007793,0.01056,0.009471,0.009471,0.008927,0.008927
1,0.046128,0.105739,-0.020564,0.030255,-0.01039,-0.019251,-0.283393,0.185779,0.132054,0.023294,...,-0.005091,-0.005091,-0.005091,-0.005091,-0.005091,-0.009523,-0.008847,-0.008847,-0.008508,-0.008508
2,0.043913,-0.058395,0.006666,0.014857,-0.035397,0.036515,0.621666,0.033331,-0.071096,-0.071283,...,-0.006928,-0.006928,-0.006928,-0.006928,-0.006928,-0.015935,-0.013086,-0.013086,-0.011662,-0.011662
3,0.005726,0.000141,0.000836,0.001847,0.005236,-0.002575,-0.02177,0.005871,0.003249,-0.001754,...,0.003492,0.003492,0.003492,0.003492,0.003492,0.000979,0.001637,0.001637,0.001966,0.001966
4,-0.020761,-0.018145,0.010787,0.03729,-0.060241,0.042211,-0.141564,-0.03884,-0.103246,-0.016038,...,0.000691,0.000691,0.000691,0.000691,0.000691,0.007617,0.005339,0.005339,0.004201,0.004201


#### Task 4
##### The following function has been modified from the function we used in the module to recommend artists.
##### Use this function and predict 20 new artists to user with ID 400.

In [17]:
def recommend_songs(predictions, user, artists, original_ratings, num_recommendations):
    
    # Get and sort the user's predictions.
    user_row_number = user - 1 # User ID starts at 1, not 0
    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1
    
    # Get the user's data and merge in the artist information.
    user_data = original_ratings[original_ratings.userID == (user)]
    user_full = (user_data.merge(fm_artists, how = 'left', left_on = 'artistID', right_on = 'artistID').
                     sort_values(['rating'], ascending=False)
                 )

    print('User {0} has already rated {1} artists.'.format(user, user_full.shape[0]))
    print('Recommending highest {0} predicted rating artists not already rated.'.format(num_recommendations))
    
    # Recommend the highest predicted rating artists that the user hasn't listened yet.
    recommendations = (fm_artists[~fm_artists['artistID'].isin(user_full['artistID'])].
         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',
               left_on = 'artistID',
               right_on = 'artistID').
         rename(columns = {user_row_number: 'Predictions'}).
         sort_values('Predictions', ascending = False).
                       iloc[:num_recommendations, :-1]
                      )

    return user_full, recommendations

#### Result:

In [18]:
already_rated, predictions = recommend_songs(preds, 400, fm_artists, fm_ratings, 20)

User 400 has already rated 46 artists.
Recommending highest 20 predicted rating artists not already rated.


In [19]:
# Top 20 artists that User 400 has rated. 
already_rated

Unnamed: 0,userID,artistID,rating,artist_name_x,genre_x,artist_name_y,genre_y
16,400,333,9.0,Avril Lavigne,pop| rock| pop rock| pop| female vocalists| ho...,Avril Lavigne,pop| rock| pop rock| pop| female vocalists| ho...
14,400,318,8.0,Hilary Duff,pop| dance| rock| electro pop| disney| pop| di...,Hilary Duff,pop| dance| rock| electro pop| disney| pop| di...
9,400,300,8.0,Katy Perry,pop| pop rock| alternative rock| electro pop| ...,Katy Perry,pop| pop rock| alternative rock| electro pop| ...
37,400,1458,8.0,Miranda Cosgrove,pop| female vocalists| electro pop| teen pop| ...,Miranda Cosgrove,pop| female vocalists| electro pop| teen pop| ...
27,400,686,7.0,Selena Gomez & the Scene,electro pop| disney| stand out be proud| pop| ...,Selena Gomez & the Scene,electro pop| disney| stand out be proud| pop| ...
4,400,288,6.0,Rihanna,seen live| hit| pop| dance| rnb| pop| dance| e...,Rihanna,seen live| hit| pop| dance| rnb| pop| dance| e...
28,400,701,6.0,Shakira,specials to 3mmey| pop| rock| copa| latin| pop...,Shakira,specials to 3mmey| pop| rock| copa| latin| pop...
34,400,1037,5.0,Nicki Minaj,pop| dance| hip-hop| rap| female vocalists| rn...,Nicki Minaj,pop| dance| hip-hop| rap| female vocalists| rn...
11,400,306,5.0,Black Eyed Peas,pop| hip-hop| rap| rnb| electronic| pop| hip-h...,Black Eyed Peas,pop| hip-hop| rap| rnb| electronic| pop| hip-h...
2,400,89,4.0,Lady Gaga,pop| electronic| pop| dance| electronic| pop| ...,Lady Gaga,pop| electronic| pop| dance| electronic| pop| ...


In [20]:
predictions

Unnamed: 0,artistID,artist_name,genre
137,229,The Killers,rock| rock| alternative rock| alternative| ind...
13,65,Coldplay,lastfm elitist repellent| ballad| pissbass| al...
136,228,Kings of Leon,rock| rock| indie rock| awesome| southern rock...
777,982,Foo Fighters,favorite| rock| alternative rock| electronica|...
146,238,Massive Attack,electronic| electronica| trip-hop| instrumenta...
134,226,Queens of the Stone Age,stoner rock| hard rock| rock| alternative rock...
884,1098,Björk,electronic| alternative| indie| electronic| al...
109,199,Arcade Fire,indie| good music| indie rock| marisa mix| alt...
19,72,Depeche Mode,electronic| electronic| industrial| new wave| ...
12,64,Röyksopp,chillout| electronic| dance| norwegian| chillo...


#### Task 5
##### Load the reader library and load the dataset `fm_ratings` to reader.

#### Result:

In [21]:
# Load Reader library.
reader = Reader()

# Load ratings dataset with Dataset library.
data = Dataset.load_from_df(fm_ratings[['userID', 'artistID', 'rating']], reader)


#### Task 6
##### Build the SVD algorithm and evaluate with our data using RMSE metric with cross validation set to 5.
##### What are the RMSE values?
##### Train the dataset and fit it.

#### Result:

In [22]:
# Use the SVD algorithm.
svd = SVD()

# Compute the RMSE of the SVD algorithm.
evaluate_model = cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=True)

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.7423  1.7538  1.7487  1.7647  1.7658  1.7551  0.0091  
Fit time          2.96    2.80    3.00    2.91    3.01    2.94    0.08    
Test time         0.08    0.12    0.09    0.13    0.09    0.10    0.02    


In [23]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fe8a10fc6a0>

#### Task 7
##### Using our algorithm, predict what would be the rating given by user 1200 to artist ID 400.

#### Result:

In [24]:
# User 1200 and the ratings she already gave.
fm_ratings[fm_ratings['userID'] == 1200].head() 

Unnamed: 0,userID,artistID,rating,artist_name,genre
902,1200,65,4.0,Coldplay,lastfm elitist repellent| ballad| pissbass| al...
1564,1200,72,5.0,Depeche Mode,electronic| electronic| industrial| new wave| ...
1946,1200,88,3.0,Gorillaz,electronic| rock| alternative| indie| hip-hop|...
2844,1200,154,3.0,Radiohead,alternative| winter| electronic| rock| alterna...
3175,1200,157,3.0,Michael Jackson,pop| pop| legend| king of pop| the king of pop...


In [25]:
svd.predict(1200, 400)

Prediction(uid=1200, iid=400, r_ui=None, est=4.632766232680058, details={'was_impossible': False})