# RECOMMENDATION ENGINES - Group Assignment

### *Group E* - Amos M., Li Z., Pablo B., Sarang Z., Vania C.

## Amazon product data

The dataset used contains user reviews (numerical rating and textual comment) towards amazon products on **Digital Music**.

## 1. Read Data

In [1]:
#pwd

In [1]:
#Load the reviews

import os
import json
import gzip
import pandas as pd
import numpy as np

data = []
with gzip.open(r'C:\Master\Recommendation Engines\Group Assignment\reviews_Digital_Music_5.json.gz') as f:
# with gzip.open(r'../reviews_Digital_Music_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
       
    reviewsdf = pd.DataFrame.from_dict(data)   

In [2]:
# Load Surprise libraries
from surprise import Reader
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV

# Algos
from surprise import NormalPredictor
from surprise import BaselineOnly
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering

In [3]:
reviewsdf.head(1)

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A3EBHHCZO6V2A4,5555991584,"Amaranth ""music fan""","[3, 3]","It's hard to believe ""Memory of Trees"" came ou...",5.0,Enya's last great album,1158019200,"09 12, 2006"


In [4]:
from datetime import datetime
from datetime import timedelta

# Create the date object and sort the reviews by time for each user
def parse_date(x):
    new_date = datetime.strptime(x, "%m %d, %Y")
    return new_date
    
reviewsdf['parse_review_time'] = reviewsdf.reviewTime.apply(parse_date)

reviewsdf = reviewsdf.groupby(["reviewerID"]).apply(lambda x: x.sort_values(["parse_review_time"], ascending = True)).reset_index(drop=True)


In [5]:
ratingsdf = reviewsdf[['reviewerID', 'asin', 'overall']]
ratingsdf = ratingsdf.rename({'reviewerID':'userID', 'asin':'itemID', 'overall':'rating'}, axis='columns')

In [6]:
print('Dataset shape: {}'.format(ratingsdf.shape))
print('-Dataset examples-')
print(ratingsdf.iloc[::20000, :])

Dataset shape: (64706, 3)
-Dataset examples-
                     userID      itemID  rating
0      A08161909WK3HU7UYTMW  B0041WLBEC     4.0
20000        A25NAXA5TGI078  B00000JXSA     5.0
40000         A3CFB16J8HWHG  B006OITIWS     5.0
60000         AP18F4HV5L3BG  B00005PJFV     4.0


## 2. Analyze the data

### a) Ratings distribution

In [7]:
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

# Count the number of times each rating appears in the dataset
data = ratingsdf['rating'].value_counts().sort_index(ascending=False)

# Create the histogram
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / ratingsdf.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} album ratings'.format(ratingsdf.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

- The rating scale is 1-5
- The distribution of the ratings are biased to high rates

### b) Numbers of ratings per album

In [8]:
# Number of ratings per movie
data = ratingsdf.groupby('itemID')['rating'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per Album',
                   xaxis = dict(title = 'Number of Ratings Per Album'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [11]:
min(data)

5

* Most of the albums received less than 10 ratings
* Each album has at least 5 overlapping customers

### c) Number of ratings per user

In [9]:
# Number of ratings per user
data = ratingsdf.groupby('userID')['rating'].count()

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0, size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

In [10]:
min(data)

5

* Most of the users rated less than 7 albums.
* All users have rated at least 5 albums, which should satisfy the minumum number of ratings per user

### d) Are there more users or more albums?

In [12]:
print('There are {} unique users in the dataset'.format(len(pd.unique(ratingsdf.userID))))
print('There are {} unique albums in the dataset'.format(len(pd.unique(ratingsdf.itemID))))

There are 5541 unique users in the dataset
There are 3568 unique albums in the dataset


There are more users than items in the dataset, so we could focus on a item-item CF method.

## 3. Data selection and preprocessing

2 methods of train/test split were tried, i.e. random split vs split based on review time. Split based on review time was impletemented since the prediction is aimed at future interest. 

In [117]:
# Random train / test split

# train_set = pd.DataFrame(columns=('userID','itemID','rating'))
# test_set = pd.DataFrame(columns=('userID','itemID','rating'))

# train_ratio = 0.8
# users = ratingsdf.userID.unique()

# for i in range(len(users)):
#     temp = ratingsdf[ratingsdf.userID == users[i]]
#     split_cut = np.int(np.round(len(temp) * train_ratio))
#     train = temp.sample(split_cut)
#     test = temp[~temp.index.isin(train.index)]

#     train_set = train_set.append(train)
#     test_set = test_set.append(test)

In [13]:
# Review time based train/test split

train_set = pd.DataFrame(columns=('userID','itemID','rating'))
test_set = pd.DataFrame(columns=('userID','itemID','rating'))

train_ratio = 0.8
users = ratingsdf.userID.unique()

# Since the dataframe has been sorted by review time, the training set would be the older reviews and the test set would 
# be the most recent review 

for i in range(len(users)):
    temp = ratingsdf[ratingsdf.userID == users[i]]
    split_cut = np.int(np.round(len(temp) * train_ratio))
    train = temp.iloc[0:split_cut]
    test = temp.iloc[split_cut::]

    train_set = train_set.append(train)
    test_set = test_set.append(test)


In [14]:
print("The training set has {} rows and {} columns".format(train_set.shape[0],train_set.shape[1]))
print("The test set has {} rows and {} columns".format(test_set.shape[0],test_set.shape[1]))

The training set has 51999 rows and 3 columns
The test set has 12707 rows and 3 columns


In [15]:
# Read the user-item matrix for train and test set 

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(train_set[['userID','itemID', 'rating']], reader)
test_data = Dataset.load_from_df(test_set[['userID','itemID', 'rating']], reader)

## 4. Collaborative Filtering Recommender System

This section builds a Collaborative Filtering model on the train set and predict on the test set, using Surprise library. The results are evaluated by RMSE in both CV and prediction.

### a. Start from benchmarking all the algorithms

In [16]:
benchmark = []
# Iterate over all algorithms
for algorithm in [NormalPredictor(), BaselineOnly(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), KNNBaseline(), SVD(), SVDpp(), NMF(), SlopeOne(), CoClustering()]:
    
    print("Testing {}".format(algorithm))
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=["rmse", "mae"], cv=3, verbose=False) 
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Testing <surprise.prediction_algorithms.random_pred.NormalPredictor object at 0x000002B7789B2708>
Testing <surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x000002B7789B2348>
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Testing <surprise.prediction_algorithms.knns.KNNBasic object at 0x000002B7789B2488>
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Testing <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x000002B7789B23C8>
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Testing <surprise.prediction_algorithms.knns.KNNWithZScore object at 0x000002B7789B24C8>
Computing the

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVDpp,0.926874,0.687193,10.272077,0.453483
SVD,0.938544,0.703616,1.630323,0.119481
BaselineOnly,0.94302,0.72042,0.061834,0.055518
KNNBaseline,1.000143,0.719467,0.675816,0.621912
CoClustering,1.016788,0.698612,1.152195,0.066157
KNNWithMeans,1.026834,0.707292,0.669882,0.522265
KNNWithZScore,1.044924,0.710567,0.801681,0.596149
SlopeOne,1.050363,0.729746,0.292965,0.255508
KNNBasic,1.082201,0.776066,0.653243,0.578467
NMF,1.103948,0.835922,2.056582,0.100101


 - Matrix factorization algorithms provide the best results.
 - The baseline model provides very good performance too. 

Before disregarding the **k-NN** inspired algorithms let's try them computing the similarities between **items** this time.

In [17]:
#KNN algorithms item-based
benchmark2 = []
# Iterate over all algorithms
for algorithm in [KNNBasic(sim_options={"user_based": False}), KNNWithMeans(sim_options={"user_based": False}), KNNWithZScore(sim_options={"user_based": False}), KNNBaseline(sim_options={"user_based": False})]:
    
    print("Testing {}".format(algorithm))
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=["rmse", "mae"], cv=3, verbose=False)
    
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark2.append(tmp)
    
pd.DataFrame(benchmark2).set_index('Algorithm').sort_values('test_rmse')

Testing <surprise.prediction_algorithms.knns.KNNBasic object at 0x000002B778C75888>
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Testing <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x000002B778C75808>
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Testing <surprise.prediction_algorithms.knns.KNNWithZScore object at 0x000002B778C756C8>
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Testing <surprise.prediction_algorithms.knns.KNNBaseline object at 0x000002B778C75

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
KNNBaseline,0.985761,0.691273,0.340746,0.527267
KNNWithMeans,1.010046,0.703256,0.305358,0.414911
KNNWithZScore,1.021268,0.703723,0.394436,0.482056
KNNBasic,1.071707,0.735543,0.268939,0.426537


**Item-based** approach performs better. Among the KNN algorights, **KNNBaseline** seems to be the best.

Before deciding to go with the Matrix factorization algorithms we will **tune** the **BaselineOnly** and **KNNBaseline** approaches to use them as benchmarks.

### b. Parameter tuning

In [18]:
# Tuning BaselineOnly

bsl_options = {
    "n_epochs": [1, 2, 3],
    "reg_i": [3, 4, 5],
    "reg_u": [5, 7, 10]
}

param_grid = {"bsl_options": bsl_options}

gs = GridSearchCV(BaselineOnly, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)


print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimati

In [19]:
# Tuning KNNBaseline

sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [20, 21, 22],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

gs = GridSearchCV(KNNBaseline, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matr

In [21]:
# Tuning SVD

from surprise.model_selection import GridSearchCV

param_grid = {'n_factors':[10, 15, 20, 30],'n_epochs':[16, 18, 20, 24],  'lr_all':[0.01, 0.015, 0.02, 0.04],'reg_all':[0.1, 0.15, 0.2, 0.4]}

gs = GridSearchCV(SVDpp, param_grid, measures=['rmse'], cv=3)
gs.fit(data)

print(gs.best_score["rmse"])
print(gs.best_params["rmse"])

0.919500912964848
{'n_factors': 20, 'n_epochs': 18, 'lr_all': 0.015, 'reg_all': 0.15}


**In summary:**

In [22]:
benchmark3 = []
# Iterate over all algorithms
for algorithm in [KNNBaseline(sim_options={"name":  "cosine", "min_support": 20, "user_based": False}),
                  BaselineOnly({"n_epochs": 3, "reg_i": 4, "reg_u": 5}),
                  SVDpp(n_factors=20, n_epochs=18, lr_all= 0.015, reg_all=0.15)]:
    
    print("Testing {}".format(algorithm))
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=["rmse", "mae"], cv=3, verbose=False)
    
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark3.append(tmp)
    
pd.DataFrame(benchmark3).set_index('Algorithm').sort_values('test_rmse')

Testing <surprise.prediction_algorithms.knns.KNNBaseline object at 0x000002B77FC33DC8>
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the cosine similarity matrix...
Done computing similarity matrix.
Testing <surprise.prediction_algorithms.baseline_only.BaselineOnly object at 0x000002B77FC33448>
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Testing <surprise.prediction_algorithms.matrix_factorization.SVDpp object at 0x000002B77FC33F48>


Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVDpp,0.920526,0.677143,9.072074,0.457182
BaselineOnly,0.920961,0.686394,0.040558,0.087442
KNNBaseline,0.943116,0.718372,0.456128,0.440845


The tuned **SVDpp** has a slightly smaller rmse than **BaselineOnly** algorithm,however, the latter is much faster. Therefore, **BaselineOnly** will be used to predict the test set.

### c. Prediction on test set

In [23]:
from surprise import accuracy

# BaselineOnly with tuned parameters

blo = BaselineOnly({"n_epochs": 3, "reg_i": 4, "reg_u": 5})
blo.fit(data.build_full_trainset())

test_set = test_data.build_full_trainset()
predictions = blo.test(test_set.build_testset())

accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 0.9208


0.9207844197992255

In [24]:
predictions[:5]

[Prediction(uid='A08161909WK3HU7UYTMW', iid='B00000053B', r_ui=5.0, est=4.528058605392337, details={'was_impossible': False}),
 Prediction(uid='A1020L7BWW9RAX', iid='B0006ZQ9BS', r_ui=4.0, est=3.601264775181595, details={'was_impossible': False}),
 Prediction(uid='A10323WWTFPSGP', iid='B00006690F', r_ui=5.0, est=4.006148588829333, details={'was_impossible': False}),
 Prediction(uid='A103KNDW8GN92L', iid='B000002PBH', r_ui=5.0, est=4.664262337815387, details={'was_impossible': False}),
 Prediction(uid='A103KNDW8GN92L', iid='B00000IPAC', r_ui=5.0, est=4.952900777657577, details={'was_impossible': False})]

**Inspect Results**

In [25]:
from collections import defaultdict


def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.
    reference: https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-top-n-recommendations-for-each-user

    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, r_ui, est,_ in predictions:
        top_n[uid].append((iid, est))
        
    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [26]:
top_n = get_top_n(predictions, n=5)

# Print the recommended items for each user

user_ls = []
item_ls = []
for uid, user_ratings in top_n.items():
    user_ls.append(uid)
    item_ls.append([iid for (iid, _) in user_ratings])
#     print(uid, [iid for (iid, _) in user_ratings])

In [27]:
data = {'UserID':user_ls, 'Recommended Items': item_ls}
final_predictions_cf = pd.DataFrame(data,columns=['UserID','Recommended Items'])
final_predictions_cf.head()

Unnamed: 0,UserID,Recommended Items
0,A08161909WK3HU7UYTMW,[B00000053B]
1,A1020L7BWW9RAX,[B0006ZQ9BS]
2,A10323WWTFPSGP,[B00006690F]
3,A103KNDW8GN92L,"[B00000IPAC, B00000253N, B000000HRP, B000002PB..."
4,A103W7ZPKGOCC9,"[B000002OME, B000001EOG, B000002KH1, B000001E5..."


### 5. Content-based Recommender System

Topic modelling strategy is used for creating the content-based rs:

1. Leverage only the positive reviews
2. Extract and pre-process the review texts: remove stopwords, special characters, lemmatize, etc.
3. Generate item-item similarities using the LSI algorithm provided in the Genism library, the input texts are transformed into a TF-IDF weighted body first

reference: https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html#available-transformations


In [28]:
reviewsdf.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,parse_review_time
0,A08161909WK3HU7UYTMW,B0041WLBEC,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",DID LIKE THIS GOOD I WILL LOOK AND SEE WHAT I...,4.0,DID LIKE THIS GOOD,1356739200,"12 29, 2012",2012-12-29
1,A08161909WK3HU7UYTMW,B000002H1C,"Jody G. Collyer ""Jody G Collyer""","[0, 1]",Eagles Greatest Hits Volume 2I WANT TO SEND TH...,3.0,eagles greatest hits i like a few songs on it ...,1358467200,"01 18, 2013",2013-01-18
2,A08161909WK3HU7UYTMW,B0000032XY,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",i have just learned about this man as a favori...,5.0,i also like his singing and what ever songs he...,1389484800,"01 12, 2014",2014-01-12
3,A08161909WK3HU7UYTMW,B000002HRC,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",this cd is greatly appreciated and you should ...,5.0,thanks i like this cd is greatly appreciated,1398729600,"04 29, 2014",2014-04-29
4,A08161909WK3HU7UYTMW,B00000053B,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",i like these cd that he has in amazon we re...,5.0,i like this cd i will get moreof his cds,1401580800,"06 1, 2014",2014-06-01


In [28]:
#pd.set_option('max_colwidth', None)

### Load Gensim

In [29]:
# Gensim
import gensim
import gensim.corpora as corpora
from gensim import similarities
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from gensim import models
import pprint
from pprint import pprint

import re
import nltk


# spacy for lemmatization
import spacy

# Plotting tools
# import pyLDAvis
# import pyLDAvis.gensim  # don't skip this
# import matplotlib.pyplot as plt
# %matplotlib inline
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

### Text pre-processing

In [30]:
#focus just on positive reviews
content = reviewsdf[(reviewsdf.overall == 3) |
                    (reviewsdf.overall == 4) |
                    (reviewsdf.overall == 5)]

In [31]:
#  Join the review texts on the same item
content = content.groupby(['asin'])['reviewText'].apply(''.join).reset_index()
content.head()

Unnamed: 0,asin,reviewText
0,5555991584,"[One CD, with a running time of 44 minutes.] A..."
1,B0000000ZW,"I first bought the CD-Single of ""Stroke You Up..."
2,B00000016T,The Cars was one of the most original bands to...
3,B00000016W,The only way I could write anything about this...
4,B00000017R,"Back in 1962, Stan Getz (tenor sax) and Charli..."


In [32]:
# Convert to list
data = content.reviewText.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

['[One CD, with a running time of 44 minutes.] Ah...this wonderful CD shows '
 'off Enyas silky smooth voice. The eleven tunes are hauntingly beautiful. I '
 'wont try to categorize her music, it is simply too rich and complex.The '
 'paperwork includes the words to ten of the songs (The Memory of Trees is a '
 'musical number), and two nice color pictures of Enya. (I love extras!) So, '
 'if you like Enyas past hits, then I think that you will like this one as '
 'well. Buy it!Enya has one of the most beautiful voices of this day. Her '
 'voice is fitting for the style of music she sings. Clear as a bell, her '
 'voice is airy and light, as she sings in English, Latin, or Gaelic. There is '
 'a lot of power in her voice. The music she chooses supports her voice '
 'appropriately, and adds to her sound. She has the ability to conjure up '
 'images with each note she sings. Like all of her previous albums, this one '
 'is again excellent.There are a many great songs on this album. "Chin

In [33]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['one', 'cd', 'with', 'running', 'time', 'of', 'minutes', 'ah', 'this', 'wonderful', 'cd', 'shows', 'off', 'enyas', 'silky', 'smooth', 'voice', 'the', 'eleven', 'tunes', 'are', 'hauntingly', 'beautiful', 'wont', 'try', 'to', 'categorize', 'her', 'music', 'it', 'is', 'simply', 'too', 'rich', 'and', 'complex', 'the', 'paperwork', 'includes', 'the', 'words', 'to', 'ten', 'of', 'the', 'songs', 'the', 'memory', 'of', 'trees', 'is', 'musical', 'number', 'and', 'two', 'nice', 'color', 'pictures', 'of', 'enya', 'love', 'extras', 'so', 'if', 'you', 'like', 'enyas', 'past', 'hits', 'then', 'think', 'that', 'you', 'will', 'like', 'this', 'one', 'as', 'well', 'buy', 'it', 'enya', 'has', 'one', 'of', 'the', 'most', 'beautiful', 'voices', 'of', 'this', 'day', 'her', 'voice', 'is', 'fitting', 'for', 'the', 'style', 'of', 'music', 'she', 'sings', 'clear', 'as', 'bell', 'her', 'voice', 'is', 'airy', 'and', 'light', 'as', 'she', 'sings', 'in', 'english', 'latin', 'or', 'gaelic', 'there', 'is', 'lot', '

In [34]:
stop_words = nltk.corpus.stopwords.words('english')
newStopWords = ['album','cd', 'song', 'lyric', 'track', 'also', 'music', 'band', 'sound', 'also', 'already', 'production']
stop_words.extend(newStopWords)

def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

# def make_bigrams(texts):
#     return [bigram_mod[doc] for doc in texts]

# def make_trigrams(texts):
#     return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [35]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
# data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_nostops, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

[['running', 'time', 'minute', 'wonderful', 'show', 'enyas', 'smooth', 'voice', 'tune', 'hauntingly', 'beautiful', 'will', 'try', 'categorize', 'simply', 'rich', 'complex', 'paperwork', 'include', 'word', 'song', 'memory', 'tree', 'musical', 'number', 'nice', 'color', 'picture', 'love', 'extra', 'enyas', 'past', 'hit', 'think', 'well', 'buy', 'beautiful', 'voice', 'day', 'voice', 'fitting', 'style', 'sing', 'clear', 'bell', 'sing', 'english', 'latin', 'choose', 'support', 'voice', 'appropriately', 'add', 'ability', 'conjure', 'image', 'note', 'sing', 'previous', 'album', 'excellent', 'many', 'great', 'song', 'rose', 'well', 'know', 'sing', 'back', 'long', 'note', 'voice', 'crisp', 'tone', 'almost', 'ring', 'latin', 'sound', 'utilize', 'voice', 'singing', 'background', 'add', 'counterpoint', 'lyric', 'anywhere', 'sound', 'perfect', 'single', 'much', 'upbeat', 'song', 'feel', 'make', 'energetic', 'opposite', 'end', 'spectrum', 'voice', 'haunting', 'quality', 'make', 'even', 'organ', 'pla

### Creating the Corpus

In [36]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 9), (6, 1), (7, 1), (8, 1), (9, 4), (10, 2), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 9), (18, 1), (19, 2), (20, 1), (21, 14), (22, 3), (23, 1), (24, 11), (25, 1), (26, 1), (27, 1), (28, 1), (29, 9), (30, 2), (31, 1), (32, 1), (33, 1), (34, 1), (35, 5), (36, 1), (37, 18), (38, 1), (39, 1), (40, 3), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 5), (49, 2), (50, 12), (51, 1), (52, 1), (53, 2), (54, 4), (55, 1), (56, 1), (57, 1), (58, 2), (59, 1), (60, 2), (61, 1), (62, 1), (63, 2), (64, 1), (65, 1), (66, 1), (67, 1), (68, 5), (69, 1), (70, 6), (71, 3), (72, 1), (73, 1), (74, 2), (75, 1), (76, 2), (77, 1), (78, 13), (79, 2), (80, 4), (81, 22), (82, 1), (83, 3), (84, 4), (85, 4), (86, 1), (87, 1), (88, 1), (89, 2), (90, 2), (91, 2), (92, 8), (93, 1), (94, 2), (95, 1), (96, 1), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 2), (108, 1), (109, 2), (1

### Creating a transformation
We use our corpus to initialize (train) the transformation model

In [37]:
tfidf = models.TfidfModel(corpus)  # step 1 -- initialize a model

### Transforming vectors
From now on, tfidf is treated as a read-only object that can be used to convert any vector from the old representation (bag-of-words integer counts) to the new representation (TfIdf real-valued weights):

In [38]:
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])  # step 2 -- use the model to transform vectors

[(0, 0.9829432740221139), (1, 0.1839089993847159)]


In [39]:
#apply the transformation to a whole corpus
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc[1])

(1, 0.0043208049574999475)
(21, 0.0034222073001661385)
(4, 0.010596263076751421)
(3, 0.011597042917179879)
(8, 0.026619758780259607)
(5, 0.06964952986252561)
(19, 0.030681809086407295)
(48, 0.02180097412280645)
(21, 0.033514689616414524)
(6, 0.015797312156428833)
(10, 0.00461637757726518)
(29, 0.009387825426527245)
(10, 0.005004820760359799)
(19, 0.008440174302459186)
(21, 0.016454408934734634)
(19, 0.015429965432578832)
(8, 0.04049897064279126)
(40, 0.06816342328308828)
(9, 0.008382982419554097)
(19, 0.021123925247118997)
(10, 0.026768947665976928)
(21, 0.0034665528815369185)
(9, 0.003969933293069544)
(10, 0.01278564503005235)
(21, 0.011025270827277714)
(10, 0.003216237386434241)
(21, 0.0033841783590108756)
(17, 0.016618542860214032)
(2, 0.01106688519224116)
(2, 0.01112785138693326)
(68, 0.008819490033861456)
(21, 0.005013163026273694)
(10, 0.005612672401386927)
(21, 0.006071880564800225)
(21, 0.020149394490443003)
(21, 0.009207553103564319)
(17, 0.008189144846209103)
(19, 0.009487051

(9, 0.0033137338872778763)
(13, 0.02823085337474599)
(3, 0.00394985876825161)
(9, 0.0037127029226746485)
(24, 0.010681291808229876)
(8, 0.023536789881184118)
(8, 0.012351780320671274)
(3, 0.02167745423017406)
(3, 0.006038054363594033)
(3, 0.006815773508103667)
(3, 0.00952100296514101)
(6, 0.0059430665713326265)
(4, 0.012014711464926109)
(9, 0.01276967504857604)
(9, 0.018504348647105196)
(4, 0.014095261763413705)
(9, 0.002578737830886859)
(9, 0.004066815749270036)
(6, 0.010422202586502225)
(6, 0.01109842127129923)
(19, 0.019796031029181753)
(3, 0.010196124944958224)
(10, 0.005881005467205717)
(10, 0.006399797209190585)
(17, 0.022298611404335082)
(10, 0.006456706507251612)
(15, 0.10029030172411596)
(9, 0.007488206582911685)
(4, 0.015761128755407857)
(9, 0.010857584297726312)
(24, 0.009096076391610333)
(2, 0.002420986405396621)
(8, 0.03398249262890202)
(9, 0.004455706382741404)
(29, 0.021661304799288055)
(9, 0.007116930904390939)
(3, 0.004333742929640308)
(9, 0.004996132214769199)
(20, 0.

(8, 0.004109151110490151)
(5, 0.00685766777995333)
(10, 0.007014522741026171)
(9, 0.016367455270859702)
(21, 0.003924885476122375)
(3, 0.006025399816916207)
(29, 0.01723866633594521)
(19, 0.008827272844608177)
(10, 0.004686202615536174)
(4, 0.010768288999657266)
(9, 0.00477527552500052)
(2, 0.010499316942602439)
(9, 0.016832529842992563)
(2, 0.00547460482339576)
(21, 0.009577591918872318)
(9, 0.0018045339570024709)
(9, 0.013109769792262789)
(3, 0.00564789882494789)
(10, 0.008246089872267976)
(2, 0.009948704772196877)
(21, 0.03364711011530261)
(9, 0.009809657530931944)
(3, 0.0103223145909794)
(9, 0.01900726326730347)
(21, 0.009565400430154178)
(24, 0.012228745883889286)
(2, 0.0052979862971301046)
(2, 0.0026694502840397976)
(8, 0.005909598186125083)
(10, 0.016869086653262463)
(3, 0.008415113994970429)
(21, 0.003965000413573697)
(9, 0.008181523055105092)
(24, 0.016580411359275397)
(9, 0.005020520467625363)
(10, 0.00546952714633995)
(9, 0.008079906611340799)
(3, 0.007505107645106755)
(5, 0

(17, 0.01559562632072539)
(5, 0.017145318178407986)
(21, 0.01420851020784456)
(9, 0.010521794323483765)
(9, 0.0042558454773865)
(5, 0.030232188869090397)
(10, 0.010579621020255075)
(9, 0.0072756779872575)
(8, 0.017449856336232156)
(10, 0.011614035891310912)
(21, 0.008090161879530837)
(10, 0.01498053771685622)
(9, 0.012259170220895631)
(17, 0.008900027048504532)
(9, 0.008816596116451884)
(8, 0.0115909967475984)
(21, 0.002362750631635038)
(10, 0.03164543721507739)
(24, 0.00844489026499367)
(19, 0.012746780213721368)
(9, 0.00423262937997542)
(3, 0.005040858178965165)
(22, 0.041913207473449156)
(8, 0.006697360091785122)
(22, 0.0255513081452106)
(6, 0.00820674820663756)
(2, 0.010139204963010094)
(9, 0.008329533774710322)
(3, 0.01671130125440263)
(21, 0.009534724129507496)
(13, 0.03083948501133739)
(21, 0.005314873005385581)
(22, 0.02094712488495963)
(9, 0.015161655516011919)
(32, 0.02095012256355194)
(2, 0.007718277491731411)
(10, 0.010354961923330672)
(9, 0.005397520207473103)
(8, 0.039664

(2, 0.007477582644423326)
(16, 0.017101821120573905)
(10, 0.025866392180750958)
(11, 0.03502965635132254)
(16, 0.022965937718946068)
(19, 0.02217960041757241)
(15, 0.02042094153557223)
(17, 0.023257992705988936)
(10, 0.010021120902575623)
(19, 0.025049612762681396)
(10, 0.007249456861284834)
(2, 0.0077191083791178575)
(9, 0.004232436771412536)
(2, 0.0030499826392437544)
(2, 0.012453144936332159)
(9, 0.010220558264948013)
(10, 0.0074837165224755945)
(3, 0.013121792699989093)
(9, 0.005912007879258189)
(8, 0.04447828603322931)
(10, 0.004540053482280686)
(21, 0.007708079261646977)
(9, 0.0017261642009537117)
(124, 0.01983935723090187)
(7, 0.0624080529024001)
(9, 0.012469816770838572)
(9, 0.0023077963797956723)
(21, 0.011913950814947321)
(9, 0.004011368595806545)
(21, 0.01582646180335307)
(10, 0.005059717363031681)
(8, 0.04363613336809346)
(2, 0.005951128060896075)
(10, 0.015170877565069933)
(9, 0.007032414032600807)
(9, 0.015401703624521392)
(6, 0.03192954262045768)
(10, 0.00649614387065968

(4, 0.005227836133190763)
(4, 0.010963404590651723)
(21, 0.029065466839066984)
(4, 0.024472624663559133)
(5, 0.04292473901122955)
(21, 0.00420028363473963)
(50, 0.009691396388571022)
(21, 0.013790738998680212)
(2, 0.002318858851988431)
(3, 0.012803609639913898)
(6, 0.006657836030249958)
(9, 0.002263554994538672)
(21, 0.003931360610018247)
(21, 0.004059618414106941)
(38, 0.03330975483472855)
(19, 0.006640312908157182)
(3, 0.002459000680474098)
(19, 0.018061610952019205)
(4, 0.017817961636923747)
(9, 0.018188119531825656)
(3, 0.009041212420036165)
(4, 0.01606859565524366)
(2, 0.008105941453146465)
(91, 0.014895495545915334)
(9, 0.007492756087895332)
(15, 0.022569970794678734)
(3, 0.01113518420657389)
(19, 0.023361884873251096)
(8, 0.007866739262321109)
(3, 0.0015484755898321542)
(3, 0.01348575260940348)
(21, 0.004506148126929921)
(21, 0.007606144384692653)
(9, 0.017423848530236194)
(8, 0.017180991691983147)
(19, 0.01510118310737575)
(21, 0.0033246602689022985)
(21, 0.004631332445290054)


(2, 0.003898909648079791)
(19, 0.009070605280266184)
(9, 0.0065991809950661515)
(3, 0.016761475532859244)
(3, 0.008687703335706492)
(9, 0.00970946861077452)
(9, 0.00854280179640386)
(10, 0.021713170671798034)
(21, 0.008121382865670056)
(3, 0.0014785698612862106)
(3, 0.005140454897152852)
(21, 0.007339372681508278)
(2, 0.0011113975943031309)
(9, 0.006525643342309478)
(8, 0.010127413170436781)
(10, 0.005811039267442563)
(9, 0.0031048226235133223)
(8, 0.004095734686806203)
(10, 0.005618011035773511)
(32, 0.021154588575765375)
(24, 0.005000836087824736)
(19, 0.01448834505570729)
(9, 0.010236112830571648)
(9, 0.01094036505414026)
(9, 0.0028638819278378907)
(9, 0.00802990346435289)
(21, 0.006486456513127196)
(2, 0.012736630076142392)
(9, 0.004393271020024651)
(9, 0.015768452326372227)
(2, 0.00554544263440749)
(2, 0.01204083733798076)
(21, 0.009147149259904784)
(3, 0.0035045091909112797)
(4, 0.007255034611134389)
(9, 0.003923747731323099)
(10, 0.00480310393921897)
(3, 0.009833457846047075)
(3

(2, 0.006791556371038304)
(8, 0.010011017785871617)
(19, 0.011868008202459474)
(2, 0.015325951783167232)
(3, 0.008569224176108392)
(78, 0.0035882757821184327)
(10, 0.010990561665994353)
(21, 0.008571434020929818)
(4, 0.012122110434958502)
(10, 0.009898175108384701)
(9, 0.00549240895913251)
(3, 0.006684338791118339)
(17, 0.007671972821092401)
(9, 0.0028931080830511277)
(8, 0.004432441243751554)
(2, 0.0036857423207262673)
(9, 0.023238502884423345)
(10, 0.0031622211013779293)
(2, 0.002735969609329743)
(28, 0.02004179089541793)
(6, 0.016228357661294626)
(19, 0.01234265168587577)
(3, 0.02072495401123458)
(9, 0.009055028435900442)
(10, 0.0067898419330103665)
(3, 0.0070693957544579085)
(10, 0.0063920780134821345)
(9, 0.00995842853383595)
(9, 0.010501130155913804)
(21, 0.007341820414686405)
(10, 0.005134788902955359)
(3, 0.00253572214146494)
(3, 0.003594230116065336)
(19, 0.013978314483315523)
(9, 0.0071932391868064615)
(2, 0.0018094741017340005)
(9, 0.007883926360553161)
(9, 0.008984185556550

(3, 0.01635372955954606)
(10, 0.021774261063558224)
(2, 0.010424031378656407)
(246, 0.17575525091027747)
(76, 0.10851561742932019)
(29, 0.009043056110331746)
(78, 0.013979184847266587)
(10, 0.022878072492459565)
(40, 0.047548722690965536)
(78, 0.008020676815842684)
(78, 0.00399831150111244)
(168, 0.039035936299990755)
(80, 0.029332712223401735)
(81, 0.023969004648051552)
(3, 0.010091630937915807)
(80, 0.030712988248368274)
(24, 0.013209256370116806)
(9, 0.0030916942724834645)
(8, 0.042868029000414976)
(76, 0.04573278508491707)
(80, 0.0683177443266997)
(80, 0.03945604908134091)
(10, 0.024703725021073958)
(10, 0.008104057625960576)
(8, 0.017964449996917986)
(2, 0.007348968550172882)
(3, 0.006516702402962793)
(78, 0.004217742847321698)
(10, 0.0045030108917352195)
(29, 0.026276498775638108)
(6, 0.01839388913578029)
(10, 0.017834399999378685)
(9, 0.015462722207980605)
(29, 0.02407846784560224)
(21, 0.004993099717167331)
(7, 0.02296024947251301)
(21, 0.010016854293367254)
(4, 0.0081446113993

IndexError: list index out of range

### LSI

#### Item profiles generation
Here the Tf-Idf corpus will be transfored via Latent Semantic Indexing, similar to Latent Dirichlet Allocation

In [41]:
lsi = models.LsiModel(corpus_tfidf, id2word=id2word, num_topics=10)  # initialize an LSI transformation
corpus_lsi = lsi[corpus_tfidf]  # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

In [42]:
lsi.print_topics(10)

[(0,
  '0.087*"rock" + 0.083*"rap" + 0.080*"jazz" + 0.078*"quot" + 0.077*"blue" + 0.077*"beat" + 0.074*"funk" + 0.072*"country" + 0.071*"soul" + 0.067*"hit"'),
 (1,
  '0.356*"rap" + 0.255*"rapper" + 0.210*"beat" + 0.180*"hip" + 0.179*"hop" + 0.148*"rhyme" + 0.147*"game" + 0.122*"cube" + 0.114*"guest" + -0.108*"blue"'),
 (2,
  '0.431*"funk" + 0.431*"jazz" + 0.166*"soul" + 0.138*"funky" + -0.138*"elton" + -0.131*"country" + 0.129*"groove" + 0.112*"blue" + 0.105*"stevie" + 0.101*"amp"'),
 (3,
  '0.733*"elton" + 0.126*"country" + -0.111*"punk" + -0.110*"band" + -0.105*"metal" + 0.100*"blue" + 0.085*"captain" + 0.081*"duet" + 0.079*"hit" + 0.079*"brick"'),
 (4,
  '-0.391*"elton" + 0.336*"country" + -0.166*"metal" + -0.162*"jazz" + -0.135*"band" + -0.127*"punk" + 0.111*"eagle" + -0.103*"funk" + -0.098*"rock" + 0.092*"carpenter"'),
 (5,
  '-0.446*"blue" + 0.255*"funk" + -0.229*"jazz" + -0.216*"country" + 0.202*"dance" + 0.199*"elton" + -0.174*"moody" + -0.124*"muddy" + 0.100*"trance" + 0.100*

In [43]:
# Create the similarity matrix to prepare for similarity queries 

index = similarities.MatrixSimilarity(corpus_lsi)

In [44]:
# Inspect results based on on 1 review
doc = content.reviewText[0]
vec_bow = id2word.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, 55.361544074535125), (1, -6.222430849686605), (2, -8.84208883617526), (3, -2.8008733295286206), (4, 4.190181660060438), (5, 4.301014143850823), (6, 8.713980952328496), (7, 5.378645473943266), (8, 5.530323984630668), (9, 1.7931475465101232)]


In [45]:
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])

print("Examples of the (index,similarity score) tuples for the example review are below: ")
print(sims[:10])

Examples of the (index,similarity score) tuples for the example review are below: 
[(2665, 0.9947796), (1933, 0.9946005), (1986, 0.9942534), (2502, 0.9940952), (1971, 0.9939951), (1667, 0.9935789), (323, 0.99344754), (2440, 0.99312836), (1775, 0.99291706), (2075, 0.99212855)]


In [46]:
# Create a dataframe to store the item profiles

lsi_item_profile = pd.DataFrame(columns=['Item','RecommendedList'])

for i in range(len(content)):
    
    doc = content.reviewText[i]
    vec_bow = id2word.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]
    
    sims = index[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    sims = sims[:5] # retrieve the top 5 similar items
    
    lsi_doc_score = []
    lsi_doc = []
    
    for doc_position, doc_score in sims:
        lsi_doc.append(content.asin[doc_position])
        lsi_doc_score.append(doc_score)
    
    lsi_item_profile = lsi_item_profile.append({'Item':content.asin[i],'RecommendedList':lsi_doc}, ignore_index=True)  

In [47]:
 lsi_item_profile.head()

Unnamed: 0,Item,RecommendedList
0,5555991584,"[B000LXHGBC, B00005LODB, B00005QW3E, B000AGTQJ..."
1,B0000000ZW,"[B00000053X, B0000025YB, B0059H09DC, B002X063L..."
2,B00000016T,"[B00000DTIF, B000002LSJ, B00000JMIQ, B00000016..."
3,B00000016W,"[B0000025R3, B000002GIG, B00000DFFQ, B000002TS..."
4,B00000017R,"[B000000ORJ, B0006VXF4G, B000002AGW, B00005A0R..."


In [48]:
# checking there are recommended list for every unique album

lsi_item_profile.shape

(3567, 2)

#### User Profile Generation

Following the approach in `Practical Recommender Systems`, Chapter 10, section 10.9.4, User profiles are generated from the reviews on items that user liked, in this case, with rating = 5. Then similar items will be found and recommended to the users.

In [49]:
# Items that users liked, rating = 5 (since the rating is skewed to the high)

userDf = reviewsdf[reviewsdf.overall==5]
userContent = userDf.groupby(['reviewerID'])['reviewText'].apply(''.join).reset_index()

# Use user profiles to filter recommendations

cbRecommendation = pd.DataFrame(columns=['reviewerID','RecommendedList'])

for i in range(len(userContent)):
    
    doc = userContent.reviewText[i]
    vec_bow = id2word.doc2bow(doc.lower().split())
    vec_lsi = lsi[vec_bow]
    
    sims = index[vec_lsi]
    sims = sorted(enumerate(sims), key=lambda item: -item[1])
    sims = sims[:5] # retrieve the top 5 similar items
    
    user_doc = []
#     position = []
    
    for doc_position, doc_score in sims:
        user_doc.append(content.asin[doc_position])
#         score.append(user_score)
    
    cbRecommendation = cbRecommendation.append({'reviewerID':userContent.reviewerID[i],'RecommendedList':user_doc}, ignore_index=True)  


In [50]:
cbRecommendation.head()

Unnamed: 0,reviewerID,RecommendedList
0,A08161909WK3HU7UYTMW,"[B00A7ZWW3G, B00AHXIDA4, B001CJOHG6, B005QJZ5F..."
1,A10323WWTFPSGP,"[B00009VRDI, B0012X9KLO, B002X063LA, B000G1ALR..."
2,A103KNDW8GN92L,"[B000002P1B, B003DZM54I, B000002BOJ, B000002HK..."
3,A103W7ZPKGOCC9,"[B000002OQ3, B000002P2E, B000002PHV, B00136Q3H..."
4,A105188E1HFWRX,"[B0032CYG9Y, B00006GO8V, B004948NSO, B007PSPRV..."


In [51]:
cbRecommendation.shape

(5326, 2)

Evaluation

In [52]:
def load_users_ratings(df):
    users_ratings = {}
    for _, row in df.iterrows():
        if row["reviewerID"] not in users_ratings:
            users_ratings[row["reviewerID"]] = {}
        users_ratings[row["reviewerID"]][row["asin"]] = row["overall"]
    return users_ratings

user_rating = load_users_ratings(reviewsdf)

#### User-item evaluation

In [53]:
from math import sqrt

def evaluation(user_id,results=cbRecommendation):
    mse = 0.0
    total = 0
    recommendations = results[results.reviewerID ==user_id]['RecommendedList']
    for i in range(len(recommendations)):
        for user_id in user_rating.keys():
            if recommendations.iloc[0][i] in user_rating[user_id].keys():
                mse += (int(user_rating[user_id][recommendations.iloc[0][i]]) - 5)**2
                total += 1
            else:
                pass

    return sqrt(mse/total)



evaluation("A10323WWTFPSGP",cbRecommendation)

2.7063488918327447

#### Item-Item evaluation

In [57]:
CBrecommendation = pd.merge(userDf, lsi_item_profile, how='inner', left_on='asin', right_on='Item')
CBrecommendation = CBrecommendation[['reviewerID','asin','RecommendedList']]
CBrecommendation.head()

Unnamed: 0,reviewerID,asin,RecommendedList
0,A08161909WK3HU7UYTMW,B0000032XY,"[B000002VFA, B0007OY456, B000JU8HHE, B00005N8V..."
1,A14W8HXP3RM3ZS,B0000032XY,"[B000002VFA, B0007OY456, B000JU8HHE, B00005N8V..."
2,A18EPAQ44YJTW5,B0000032XY,"[B000002VFA, B0007OY456, B000JU8HHE, B00005N8V..."
3,A1BVCH82W0M2W2,B0000032XY,"[B000002VFA, B0007OY456, B000JU8HHE, B00005N8V..."
4,A1JIW8GOSSGUQR,B0000032XY,"[B000002VFA, B0007OY456, B000JU8HHE, B00005N8V..."


In [131]:
def evaluation(user_item,results=lsi_item_profile):
    mse = 0.0
    total = 0
    recommendations = results[lsi_item_profile.Item ==user_item]['RecommendedList']
    for i in range(5):
        for user_id in user_rating.keys():
            if recommendations.iloc[0][i] in user_rating[user_id].keys():
                mse += (int(user_rating[user_id][recommendations.iloc[0][i]]) - 5)**2
                total += 1
            else:
                pass

    return sqrt(mse/total)



evaluation("B0000032XY",lsi_item_profile)

1.5509369443679537

### 6. Hybrid Recommender System

As an extra, you can propose a hybrid recommender system joining the operation of the
2 previously developed systems. To that end, you can make use of any of the ideas
explained in class.

In [None]:
# Mixed Recommender System

As seen in class, a Mixed Recommender System combines the results from different recommendation logics and present them according to certain business rules, Netflix and Spotify being the examples. This method should work well in our music recommendation scenario as well.

We will follow the below rules:

1. The primary recommendation section will be based on results from collaborative filtering
2. The secondary recommendation section will be based on results from content based
3. The third will be nonpersonalized recommendation, i.e. the highest rated and the trending albums

In [63]:
user_id = 'AZPWAXJG9OJXV'

# from CF - 'User with similar interest also listened to '
rs_1 = final_predictions_cf[final_predictions_cf.UserID==user_id]
print(rs_1)

# from content based - 'Albums similar to your taste'
rs_2 = cbRecommendation[cbRecommendation['reviewerID']==user_id][['reviewerID','RecommendedList']]
print(rs_2)

# Trending albums - 'Albums that have received the higest rating on average in the latest month'

             UserID                     Recommended Items
5529  AZPWAXJG9OJXV  [B0000058MS, B0000058MY, B00005QXY9]
         reviewerID                                    RecommendedList
5314  AZPWAXJG9OJXV  [B00006A6ZO, B00005OL56, B004NYNGTQ, B00006562...


In [64]:
from datetime import datetime
from datetime import timedelta

def parse_date(x):
    new_date = datetime.strptime(x, "%m %d, %Y")
    return new_date
    
reviewsdf['parse_review_time'] = reviewsdf.reviewTime.apply(parse_date)

reviewsdf.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime,parse_review_time
0,A08161909WK3HU7UYTMW,B0041WLBEC,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",DID LIKE THIS GOOD I WILL LOOK AND SEE WHAT I...,4.0,DID LIKE THIS GOOD,1356739200,"12 29, 2012",2012-12-29
1,A08161909WK3HU7UYTMW,B000002H1C,"Jody G. Collyer ""Jody G Collyer""","[0, 1]",Eagles Greatest Hits Volume 2I WANT TO SEND TH...,3.0,eagles greatest hits i like a few songs on it ...,1358467200,"01 18, 2013",2013-01-18
2,A08161909WK3HU7UYTMW,B0000032XY,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",i have just learned about this man as a favori...,5.0,i also like his singing and what ever songs he...,1389484800,"01 12, 2014",2014-01-12
3,A08161909WK3HU7UYTMW,B000002HRC,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",this cd is greatly appreciated and you should ...,5.0,thanks i like this cd is greatly appreciated,1398729600,"04 29, 2014",2014-04-29
4,A08161909WK3HU7UYTMW,B00000053B,"Jody G. Collyer ""Jody G Collyer""","[0, 0]",i like these cd that he has in amazon we re...,5.0,i like this cd i will get moreof his cds,1401580800,"06 1, 2014",2014-06-01


In [65]:
l30d = max(reviewsdf.parse_review_time) - timedelta(days=30)

l30d_rating = reviewsdf[reviewsdf.parse_review_time >= l30d].groupby('asin')['overall'].mean()

l30d_rating.sort_values(ascending=False)[:10]

asin
B0002CHI4C    5.0
B00006SM86    5.0
B00005Y1XY    5.0
B00005UWL9    5.0
B00005N9CV    5.0
B000059T1U    5.0
B000051XVN    5.0
B00004ZE8C    5.0
B00004UART    5.0
B00004OCFE    5.0
Name: overall, dtype: float64