# Movie Recommendation System Using Different Models

## Introduction

Recommendation system is one of the most widespread machine learning applications in the industry. It is not only built for movies and music streaming, but also for multiple products and services like e-commerce business, news and so on. Companies like Netflix, Youtube, and Amazon have leveraged recommendation systems to discover user preference and provide them with more relevant goods and services to enhance user experience and generate recurring revenue. No doubt, successful recommendation system is an indispensable part of their business success.

This assignment went through processes of building necessary elements for recommender engine, implementing an ensemble of machine learning algorithms and making use of AWS Sagemaker endpoint. The final pipeline is capable of offering specific movie suggestions for different users based on a combination result of three different models: ALS, AWS SageMaker and Scikit-Surprise.


## Dataset

The dataset used in this assignment is the famous MovieLens dataset. There are various versions available online( https://grouplens.org/datasets/movielens/ ), but for ease of use, 100k version is chosen to apply to those three models mentioned above. Features included in 100k version dataset are rating, genre, tag, movieid, userid, tagid and so on. 

## MODEL 1: ALS Estimator Using Spark

### Step 1 : Set up the environment / necessary housekeeping

In [22]:
!pip install --upgrade pip
!pip install -q findspark

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.0.2)


In [23]:
#Python library PySpark is imported to interface with Spark
import pyspark
 # get a spark context
sc = pyspark.SparkContext.getOrCreate()
print(sc)
# get a spark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
print(spark)
spark.version

<SparkContext master=local[*] appName=pyspark-shell>
<pyspark.sql.session.SparkSession object at 0x7f0b47595198>


'2.3.4'

In [26]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import *
import pandas as pd
import numpy as np

### Step 2 : Data prepocessing

In [40]:
#download data from grouplens
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip
#shuffle the data
%cd ml-100k
!shuf ua.base -o ua.base.shuffled

--2020-04-28 14:06:19--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip.1’


2020-04-28 14:06:20 (8.23 MB/s) - ‘ml-100k.zip.1’ saved [4924029/4924029]

Archive:  ml-100k.zip
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating:

In [41]:
#load training data
train_movie_ratings= pd.read_csv('ua.base.shuffled', sep='\t', 
                                       index_col=False, names=['user_id' , 'movie_id' , 'rating'])
#load test data
test_movie_ratings= pd.read_csv('ua.test', sep='\t', 
                                      index_col=False, names=['user_id' , 'movie_id' , 'rating'])
#train_movie_ratings.head()
#test_movie_ratings.head()

In [251]:
# create a SparkSession
spark = SparkSession.builder.getOrCreate()  
# create the training and test dataframe
training=spark.createDataFrame(train_movie_ratings)
test=spark.createDataFrame(test_movie_ratings)
#show the column names of the dataset
training.printSchema() 
# count length of the dataset
print(training.count()) 

root
 |-- user_id: long (nullable = true)
 |-- movie_id: long (nullable = true)
 |-- rating: long (nullable = true)

90570


### Step 3 : Train ALS estimator and perform cross validation

In [253]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

# Build the recommendation model using ALS on the training data
als = ALS(maxIter=3, rank=10, regParam=0.1, userCol="user_id", itemCol="movie_id", ratingCol="rating",coldStartStrategy="drop")

#Set the parameter grid
paramGrid = ParamGridBuilder() \
  .addGrid(als.regParam, [0.03,0.1,0.3]) \
  .addGrid(als.rank, [3,10,30]).build()

regEval = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

#Start the cross validation and fit on the training dataset 
crossVal = CrossValidator(estimator=als, estimatorParamMaps=paramGrid, evaluator=regEval)
cvModel = crossVal.fit(training)

In [254]:
# Show the metrics form the CrossValidation
print(cvModel.avgMetrics) 

# Gives us the parameter combinations
print(cvModel.getEstimatorParamMaps()) 
paramMap = list(zip(cvModel.getEstimatorParamMaps(),cvModel.avgMetrics))

# Print the parameter that gives us the smallest rmse
paramMin = min(paramMap, key=lambda x: x[1])
print(paramMin)

[0.9736478590531773, 1.0256734372601422, 1.0732557393493907, 0.9479794111368625, 0.9499695766544693, 0.9504500279730561, 0.9713087649862429, 1.0234699008431651, 1.0448014103739727]
[{Param(parent='ALS_446f9135931edb765cfb', name='regParam', doc='regularization parameter (>= 0).'): 0.03, Param(parent='ALS_446f9135931edb765cfb', name='rank', doc='rank of the factorization'): 3}, {Param(parent='ALS_446f9135931edb765cfb', name='regParam', doc='regularization parameter (>= 0).'): 0.03, Param(parent='ALS_446f9135931edb765cfb', name='rank', doc='rank of the factorization'): 10}, {Param(parent='ALS_446f9135931edb765cfb', name='regParam', doc='regularization parameter (>= 0).'): 0.03, Param(parent='ALS_446f9135931edb765cfb', name='rank', doc='rank of the factorization'): 30}, {Param(parent='ALS_446f9135931edb765cfb', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='ALS_446f9135931edb765cfb', name='rank', doc='rank of the factorization'): 3}, {Param(parent='ALS_446f91

In [256]:
#Implement the als model with the best parameter
als = ALS(maxIter=3, rank=3, regParam=0.1, userCol="user_id", itemCol="movie_id", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(training)

In [257]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("RMSE = " + str(rmse))

RMSE = 0.9568521793447027


To train an ALS estimator, initially we can simply use default parameters or set certain values, at the same time, specifying user, rating and item column respectively. For cold start strategy, users can choose between “drop” and "NaN".In order for us to derive a performance metric to evaluate the recommendation system, we set the coldStartStrategy parameter as “drop” to drop rows in the prediction dataframe which contains NaN values. 

Further, cross validation with parameter tuning is applied to enhance ALS model. We set a parameter grid and apply the cross validation method, finally the best parameter with the smallest rmse is derived. When implementing the ALS model with the best parameter, the RMSE computed on the test data is 0.957.

The prediction result of the first model ALS estimator might not be satisfactory, but this is only from a single ALS model. We can expect further improvement later on when aggregating multiple effective models altogether.

In [258]:
#Get the prediction result
predictions.show()
predictions.createOrReplaceTempView('predictions') 

+-------+--------+------+----------+
|user_id|movie_id|rating|prediction|
+-------+--------+------+----------+
|    251|     148|     2|   3.18209|
|    580|     148|     4| 3.5047975|
|    602|     148|     4|  3.725699|
|    372|     148|     5|  4.218792|
|    274|     148|     2| 3.4845352|
|    923|     148|     4| 3.8626158|
|    447|     148|     4|  2.895263|
|    586|     148|     3| 3.4014761|
|    761|     148|     5|  3.065056|
|    677|     148|     4| 3.8198245|
|    930|     148|     1| 3.0983498|
|     51|     148|     3| 3.8849201|
|    403|     148|     5| 3.7467852|
|    361|     148|     1| 2.5771089|
|    893|     148|     3| 3.3657274|
|    396|     148|     4|  3.575336|
|    203|     148|     3| 3.2016485|
|    186|     148|     4| 3.5464697|
|      6|     463|     4| 3.6850455|
|     23|     463|     4| 3.6496463|
+-------+--------+------+----------+
only showing top 20 rows



### Step 4 : Movie recommendation results for each user

In [259]:
# Get top 5 movie recommendations for each user
userRecs = model.recommendForAllUsers(5)
userRecs.show()

+-------+--------------------+
|user_id|     recommendations|
+-------+--------------------+
|    471|[[1554, 7.181269]...|
|    463|[[1155, 4.350897]...|
|    833|[[1643, 5.1402636...|
|    496|[[1536, 4.9627852...|
|    148|[[1463, 5.825232]...|
|    540|[[814, 5.399788],...|
|    392|[[1463, 6.0887423...|
|    243|[[1536, 5.3301077...|
|    623|[[1463, 5.2869897...|
|    737|[[1558, 6.0107007...|
|    897|[[814, 5.7407727]...|
|    858|[[1536, 5.438197]...|
|     31|[[1463, 5.5697336...|
|    516|[[814, 5.6217227]...|
|    580|[[1155, 5.105934]...|
|    251|[[1463, 5.6510725...|
|    451|[[1554, 6.0504746...|
|     85|[[1463, 5.261311]...|
|    137|[[1554, 6.4420385...|
|    808|[[1463, 6.94685],...|
+-------+--------------------+
only showing top 20 rows



In [261]:
# Get top 5 movie recommendations for a specified set of users
users = test.select(als.getUserCol()).distinct()
userSubsetRecs = model.recommendForUserSubset(users, 5)
model.recommendProductsForUsers(3).take(2)
userSubsetRecs.show()

+-------+--------------------+
|user_id|     recommendations|
+-------+--------------------+
|    463|[[1155, 4.350897]...|
|    496|[[1536, 4.9627852...|
|    833|[[1643, 5.1402636...|
|    471|[[1554, 7.181269]...|
|    148|[[1463, 5.825232]...|
|    243|[[1536, 5.3301077...|
|    858|[[1536, 5.438197]...|
|    737|[[1558, 6.0107007...|
|    897|[[814, 5.7407727]...|
|    623|[[1463, 5.2869897...|
|    392|[[1463, 6.0887423...|
|    540|[[814, 5.399788],...|
|     31|[[1463, 5.5697336...|
|    516|[[814, 5.6217227]...|
|     85|[[1463, 5.261311]...|
|    451|[[1554, 6.0504746...|
|    580|[[1155, 5.105934]...|
|    137|[[1554, 6.4420385...|
|    251|[[1463, 5.6510725...|
|    808|[[1463, 6.94685],...|
+-------+--------------------+
only showing top 20 rows



### Step 5 : Adjust prediction result format 

In [262]:
#convert spark dataframe to pandas dataframe for ease of use
pred_als=userSubsetRecs.toPandas()
pred_als.head()

Unnamed: 0,user_id,recommendations
0,463,"[(1155, 4.350896835327148), (1242, 4.192982673..."
1,496,"[(1536, 4.962785243988037), (1639, 4.897671222..."
2,833,"[(1643, 5.140263557434082), (1536, 5.043638706..."
3,471,"[(1554, 7.18126916885376), (1472, 6.2251939773..."
4,148,"[(1463, 5.825232028961182), (1536, 5.598376750..."


In [270]:
# expand recommendation results into a separate dataframe
rmd=pred_als['recommendations'].apply(pd.Series)

# rename each feature and add userid
rmd=rmd.rename(columns = lambda x : 'recommendations_' + str(x))
rmd=pd.concat([pred_als["user_id"], rmd[:]], axis=1)
rmd.head()

Unnamed: 0,user_id,recommendations_0,recommendations_1,recommendations_2,recommendations_3,recommendations_4
0,463,"(1155, 4.350896835327148)","(1242, 4.1929826736450195)","(1639, 4.190911293029785)","(1645, 4.1444854736328125)","(1651, 4.1444854736328125)"
1,496,"(1536, 4.962785243988037)","(1639, 4.897671222686768)","(1643, 4.802825450897217)","(1558, 4.748034954071045)","(884, 4.703214168548584)"
2,833,"(1643, 5.140263557434082)","(1536, 5.043638706207275)","(1463, 4.988624095916748)","(1558, 4.986846446990967)","(1512, 4.692383766174316)"
3,471,"(1554, 7.18126916885376)","(1472, 6.225193977355957)","(868, 5.669477462768555)","(74, 5.483709812164307)","(867, 5.446508407592773)"
4,148,"(1463, 5.825232028961182)","(1536, 5.598376750946045)","(1643, 5.589471817016602)","(814, 5.552923202514648)","(1449, 5.295812129974365)"


In [271]:
user_rmd=pd.DataFrame(columns=["userId","recommendations"])
# place different recommendations in different rows
for i in range(len(rmd)):
    user_rmd=user_rmd.append({"userId":rmd.loc[i]["user_id"],"recommendations":rmd.loc[i]["recommendations_0"]}, ignore_index=True)
    user_rmd=user_rmd.append({"userId":rmd.loc[i]["user_id"],"recommendations":rmd.loc[i]["recommendations_1"]}, ignore_index=True)
    user_rmd=user_rmd.append({"userId":rmd.loc[i]["user_id"],"recommendations":rmd.loc[i]["recommendations_2"]}, ignore_index=True)
    user_rmd=user_rmd.append({"userId":rmd.loc[i]["user_id"],"recommendations":rmd.loc[i]["recommendations_3"]}, ignore_index=True)
    user_rmd=user_rmd.append({"userId":rmd.loc[i]["user_id"],"recommendations":rmd.loc[i]["recommendations_4"]}, ignore_index=True)

user_rmd.head()

Unnamed: 0,userId,recommendations
0,463,"(1155, 4.350896835327148)"
1,463,"(1242, 4.1929826736450195)"
2,463,"(1639, 4.190911293029785)"
3,463,"(1645, 4.1444854736328125)"
4,463,"(1651, 4.1444854736328125)"


In [357]:
# expand recommendations column into its own dataframe
recommendations = user_rmd['recommendations'].apply(pd.Series)
# rename those two columns
recommendations.columns=["movieId", "predicted_rating"]
# Get the final format
als_recommendations=pd.concat([user_rmd["userId"], recommendations[:]], axis=1)
als_recommendations.head()

Unnamed: 0,userId,movieId,predicted_rating
0,463,1155.0,4.350897
1,463,1242.0,4.192983
2,463,1639.0,4.190911
3,463,1645.0,4.144485
4,463,1651.0,4.144485


### Step 6 : Recommendation results for given user list

In [272]:
# Specified user list 
user_list=[198,11,314,184,163,710,881,504,267,653]

In [553]:
als_user=als_recommendations[als_recommendations.userId.isin(user_list)]
als_user

Unnamed: 0,userId,movieId,predicted_rating
1350,881,1463.0,5.182528
1351,881,814.0,5.130446
1352,881,1554.0,4.892687
1353,881,1536.0,4.831121
1354,881,1500.0,4.821637
1565,163,1554.0,4.327571
1566,163,814.0,4.225963
1567,163,1463.0,4.040522
1568,163,1500.0,4.017696
1569,163,1293.0,3.912068


In [360]:
#define a function to get the recommendation stored in a dictionary
def GetAlsRec(userId):
    df_als=als_user[als_user.userId==userId]
    rec=df_als.movieId.values.astype('int').tolist()
    rec_dic={'userId':userId, 'als_recommendations':rec}
    return rec_dic
GetAlsRec(184)

{'userId': 184, 'als_recommendations': [1536, 1463, 1643, 814, 1449]}

In [361]:
als_user_rec=[GetAlsRec(i) for i in user_list]
als_user_result=pd.DataFrame(als_user_rec)
als_user_result

Unnamed: 0,als_recommendations,userId
0,"[1463, 1643, 1536, 814, 1449]",198
1,"[814, 1463, 1536, 1449, 1500]",11
2,"[1554, 1662, 814, 1472, 867]",314
3,"[1536, 1463, 1643, 814, 1449]",184
4,"[1554, 814, 1463, 1500, 1293]",163
5,"[1463, 1643, 1536, 814, 1449]",710
6,"[1463, 814, 1554, 1536, 1500]",881
7,"[814, 1463, 1449, 1500, 1293]",504
8,"[1463, 1536, 1643, 814, 1449]",267
9,"[814, 1463, 1554, 1500, 1449]",653


# Model 2 : Scikit Surprise

Surprise is a python scikit package used for recommdation system and it stands for Simple Python RecommendatIon System Engine. Users can use built-in datasets like movielens and custom dataset to build a recommendation system. There are various ready-to-use algorithms in Surprise package including collaborative filtering, matrix decomposition and so on. This package is a quite simple and easy to use. However, Surprise doesn’t support content-based information and implicit ratings.

Example below demonstrates the process of building a simple recommendation system using Surprise. The RMSE on the testdata prediction is 0.675, which is much lower than ALS estimator. Then a function is defined to find the top movie recommendations for each user. 

### Step 1 : Install and import Surprise package

In [156]:
#relevant surprise imports
!pip install scikit-surprise
import surprise
from collections import defaultdict
from surprise import SVD
from surprise import Dataset
from surprise import accuracy



### Step 2 : Fit SVD model and compute accuracy

In [147]:
#load the dataset and separate into trainset and testset
#download the movielens-100k dataset if it has not already been downloaded
data = Dataset.load_builtin('ml-100k')
trainset = data.build_full_trainset()
testset = trainset.build_testset()

# train SVD algorithm on the movielens dataset.
alg_svd = SVD()
alg_svd.fit(trainset)
pred_svd = alg_svd.test(testset)

# Get the RMSE of the prediction result
accuracy.rmse(pred_svd, verbose=True)

RMSE: 0.6750


0.6750463451886863

In [155]:
def get_top_n(predictions, n=5):
    
    # map the predictions to each userid
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # order the predictions for each user and get the 5 highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

In [148]:
# retrieve top recommendations for all user and movie pairs (u, i) in the test set.
top_n = get_top_n(pred_svd, n=5)

In [306]:
# Print the recommended movieid for each user
sp_recommendations=[]
for uid, user_ratings in top_n.items():
    sp_recommendations.append({'userId':uid, 'sp_recomendations':[iid for (iid, _) in user_ratings]})
sp_recommendations

[{'userId': '196', 'sp_recomendations': ['285', '663', '655', '1007', '8']},
 {'userId': '186', 'sp_recomendations': ['98', '71', '939', '79', '300']},
 {'userId': '22', 'sp_recomendations': ['173', '50', '96', '144', '174']},
 {'userId': '244', 'sp_recomendations': ['475', '9', '56', '208', '169']},
 {'userId': '166', 'sp_recomendations': ['313', '300', '258', '347', '322']},
 {'userId': '298', 'sp_recomendations': ['318', '174', '496', '483', '132']},
 {'userId': '115', 'sp_recomendations': ['56', '127', '100', '511', '137']},
 {'userId': '253', 'sp_recomendations': ['318', '496', '64', '427', '483']},
 {'userId': '305', 'sp_recomendations': ['474', '462', '134', '709', '483']},
 {'userId': '6', 'sp_recomendations': ['511', '474', '134', '135', '483']},
 {'userId': '62', 'sp_recomendations': ['56', '12', '173', '50', '172']},
 {'userId': '286', 'sp_recomendations': ['83', '588', '707', '315', '133']},
 {'userId': '200', 'sp_recomendations': ['174', '50', '172', '483', '169']},
 {'use

In [307]:
user_list_sp=[]

for i in sp_recommendations:
    if int(i["userId"]) in user_list:
        user_list_sp.append(i)
user_list_sp        

[{'userId': '267', 'sp_recomendations': ['114', '50', '474', '408', '169']},
 {'userId': '11', 'sp_recomendations': ['83', '318', '9', '286', '603']},
 {'userId': '198', 'sp_recomendations': ['50', '474', '127', '172', '173']},
 {'userId': '184', 'sp_recomendations': ['488', '134', '285', '483', '9']},
 {'userId': '314', 'sp_recomendations': ['64', '121', '692', '282', '143']},
 {'userId': '163', 'sp_recomendations': ['316', '318', '272', '64', '357']},
 {'userId': '504', 'sp_recomendations': ['318', '127', '735', '258', '98']},
 {'userId': '653', 'sp_recomendations': ['195', '22', '174', '50', '746']},
 {'userId': '710', 'sp_recomendations': ['127', '483', '134', '192', '23']},
 {'userId': '881', 'sp_recomendations': ['174', '22', '651', '121', '265']}]

In [308]:
sp_user_result=pd.DataFrame(user_list_sp)
sp_user_result

Unnamed: 0,sp_recomendations,userId
0,"[114, 50, 474, 408, 169]",267
1,"[83, 318, 9, 286, 603]",11
2,"[50, 474, 127, 172, 173]",198
3,"[488, 134, 285, 483, 9]",184
4,"[64, 121, 692, 282, 143]",314
5,"[316, 318, 272, 64, 357]",163
6,"[318, 127, 735, 258, 98]",504
7,"[195, 22, 174, 50, 746]",653
8,"[127, 483, 134, 192, 23]",710
9,"[174, 22, 651, 121, 265]",881


In [380]:
# Get the test pandas dataframe
test_pd=test.toPandas()
test_pd.columns=["userId","movieId","rating"]
test_pd.head()

Unnamed: 0,userId,movieId,rating
0,1,20,4
1,1,33,4
2,1,61,4
3,1,117,3
4,1,155,2


In [475]:
# rearrage the output format
df_sp=pd.DataFrame(top_n)
df_sp2 = pd.DataFrame(df_sp.values.T,index=df_sp.columns, columns=df_sp.index)
df_sp2.head()

Unnamed: 0,0,1,2,3,4
196,"(285, 4.282994600470375)","(663, 4.265198850470498)","(655, 4.1866216612752485)","(1007, 4.157957746470623)","(8, 4.0801881534084945)"
186,"(98, 4.98048269419873)","(71, 4.471490033280152)","(939, 4.383274280970822)","(79, 4.3635237627152055)","(300, 4.238584715449606)"
22,"(173, 5)","(50, 4.8279578514880095)","(96, 4.642193943427386)","(144, 4.62467283054835)","(174, 4.617472844047995)"
244,"(475, 5)","(9, 5)","(56, 5)","(208, 4.8161568425113295)","(169, 4.755425133689048)"
166,"(313, 4.788111641230368)","(300, 4.621002939116737)","(258, 4.254044671164396)","(347, 4.21435984535225)","(322, 4.051733576954385)"


In [562]:
# adjust the output dataframe to the right format 
rmd = df_sp2.rename(columns = lambda x : 'rmd_' + str(x))
rmd_list=["rmd_0","rmd_1","rmd_2","rmd_3","rmd_4"]

rmd_sp = rmd['rmd_0'].apply(pd.Series)
rmd_sp.columns=["movieId","score"]
rmd_sp["userId"]=df_sp2.index
for i in range(4):
    df_rmd = rmd[rmd_list[i+1]].apply(pd.Series)
    df_rmd.columns=["movieId","score"]
    df_rmd["userId"]=df_sp2.index
    rmd_sp=pd.concat([rmd_sp, df_rmd], ignore_index=True)

rmd_sp[['userId']]= rmd_sp[['userId']].astype(int)
rmd_sp[['movieId']]= rmd_sp[['movieId']].astype(int)
rmd_sp.head()

Unnamed: 0,movieId,score,userId
0,285,4.282995,196
1,98,4.980483,186
2,173,5.0,22
3,475,5.0,244
4,313,4.788112,166


In [563]:
#merge with the real rating
sp_user_df=pd.merge(rmd_sp,test_pd,on=['userId','movieId'])
sp_user_df.head()

Unnamed: 0,movieId,score,userId,rating
0,100,4.34455,63,5
1,1084,4.201571,50,5
2,127,5.0,157,5
3,22,4.98351,278,5
4,661,5.0,7,5


In [564]:
sp_user=rmd_sp[rmd_sp["userId"].isin(user_list)]
sp_user.head()

rmd_sp.columns=["movieId","predict_score","userId"]
rmd_sp_final = rmd_sp[['userId', 'movieId', 'predict_score']]
rmd_sp_final.head()

Unnamed: 0,userId,movieId,predict_score
0,196,285,4.282995
1,186,98,4.980483
2,22,173,5.0
3,244,475,5.0
4,166,313,4.788112


In [540]:
# get the user prediction summary dataframe in the user list
sp_user_final=rmd_sp_final.merge(test_pd,on=['userId','movieId'])
sp_user_final=sp_user_final[sp_user_final.userId.isin(user_list)]
sp_user_final

Unnamed: 0,userId,movieId,predict_score,rating
271,163,318,4.130945,4
445,314,692,4.953824,5
630,163,64,3.882521,4
680,653,50,4.214062,5


# Model 3 : SageMaker's Factorization Machines

Another model built for this recommendation system is Factorization Machines(FM), the famous built-in algorithm of Amazon SageMaker, which is widely used for recommendation system.(eg, FM models are used by Netflix to recommend movies for users.) Factorization Machinesas is one class of collaborative filtering algorithms. As the name suggests, FM uses matrix factorization to reduce problem dimensionality and thus, greatly boost computational efficiency on large sparse dataset. 

In the movielens dataset (and also in real world practice !), the number of users and items are often large whereas users normally rate a small portion of all movies available. Therefore, the actual number of recommendations is quite small, resulting a large sparse dataset. The basic idea of factorization in FM model is that a sparse rating matrix can be decomposed into a dense user matrix and item matrix with lower dimension. Another benefit of using FM is that matrix factorization can also help us fill the blank values in the rating matrix, which means we can recommend new items to users.

### Step 1 : Prepocessing and creating sparse matrix

In [39]:
#necessary imports
import sagemaker
import sagemaker.amazon.common as smac
from sagemaker import get_execution_role
from sagemaker.predictor import json_deserializer
from sagemaker.amazon.amazon_estimator import get_image_uri

from scipy.sparse import lil_matrix
import boto3, io, os

In [42]:
# get the number of unique users and movies
n_users= train_movie_ratings['user_id'].max()
n_movies=train_movie_ratings['movie_id'].max()
# In sparse matrix, features are one-hot encoded
#number of features should be the sum of number of movies and users
n_features=n_users+n_movies

#refence previously created training and test ddataframe in als model
n_test_ratings=len(test_movie_ratings.index)
n_train_ratings=len(train_movie_ratings.index)

In [43]:
# demonstrate the result
print (" number of users: ", n_users)
print (" number of movies: ", n_movies)
print (" Training Count: ", n_train_ratings)
print (" Test Count: ", n_test_ratings)
print (" Features (number of users + number of movies): ", n_features)

 number of users:  943
 number of movies:  1682
 Training Count:  90570
 Test Count:  9430
 Features (number of users + number of movies):  2625


In [44]:
#define a function to convert dataframe into sparse matrix
def loadDataset(data, lines, columns):
    # Features are one-hot encoded in a sparse matrix
    X = lil_matrix((lines, columns)).astype('float32')
    # Labels are stored in a vector
    y = []
    line=0
    for index, row in data.iterrows():
            X[line,row['user_id']-1] = 1
            X[line, n_users+(row['movie_id']-1)] = 1
            if int(row['rating']) >= 4:
                y.append(1)
            else:
                y.append(0)
            line=line+1

    y=np.array(y).astype('float32')            
    return X,y

#derive the sparse matrix for test and training datset
X_train, y_train = loadDataset(train_movie_ratings, n_train_ratings, n_features)
X_test, y_test = loadDataset(test_movie_ratings, n_test_ratings, n_features)

In [316]:
#show the shape of sparse matrix
print(X_test.shape)
print(y_test.shape)
assert X_test.shape  == (n_test_ratings, n_features)
assert y_test.shape  == (n_test_ratings, )

#test labels are quite balanced 
zero_labels = np.count_nonzero(y_test)
print("Test labels: %d zeros, %d ones" % (zero_labels, n_test_ratings-zero_labels))
print("Ratio of ones to zeros:", round((n_test_ratings-zero_labels)/zero_labels, 2) )

(9430, 2625)
(9430,)
Test labels: 5469 zeros, 3961 ones
Ratio of ones to zeros: 0.72


### Step 2 : Store the data to S3 in protobuf format

In [46]:
#specify my personal bucket name 
bucket = 'movierecom'
prefix = 'fm'

#write the key and prefix for train, test and output
train_key      = 'train.protobuf'
train_prefix   = '{}/{}'.format(prefix, 'train')

test_key       = 'test.protobuf'
test_prefix    = '{}/{}'.format(prefix, 'test')

output_prefix  = 's3://{}/{}/output'.format(bucket, prefix)

In [47]:
# define a function to convert data to Protobuf format
def writeDatasetToProtobuf(X, bucket, prefix, key, d_type, y=None):
    Pbuf = io.BytesIO()
    if d_type == "sparse":
        smac.write_spmatrix_to_sparse_tensor(Pbuf, X, labels=y)
    else:
        smac.write_numpy_to_dense_tensor(Pbuf, X, labels=y)
        
    Pbuf.seek(0)
    obj = '{}/{}'.format(prefix, key)
    boto3.resource('s3').Bucket(bucket).Object(obj).upload_fileobj(Pbuf)
    return 's3://{}/{}'.format(bucket,obj)

# show the storage path in S3
fm_train_data_path = writeDatasetToProtobuf(X_train, bucket, train_prefix, train_key, "sparse", y_train)    
fm_test_data_path  = writeDatasetToProtobuf(X_test, bucket, test_prefix, test_key, "sparse", y_test)    
  
print ("Training data S3 path: ",fm_train_data_path)
print ("Test data S3 path: ",fm_test_data_path)
print ("FM model output S3 path: {}".format(output_prefix))

Training data S3 path:  s3://movierecom/fm/train/train.protobuf
Test data S3 path:  s3://movierecom/fm/test/test.protobuf
FM model output S3 path: s3://movierecom/fm/output


### Step 3 : Fit SageMaker factorization machines

In [48]:
fm = sagemaker.estimator.Estimator(get_image_uri(boto3.Session().region_name, "factorization-machines"),
                                   get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type='ml.m5.large',
                                   output_path=output_prefix,
                                   sagemaker_session=sagemaker.Session())

In [49]:
# initial a set of hyperparameters first
fm.set_hyperparameters(feature_dim=n_features,
                      predictor_type='binary_classifier',
                      mini_batch_size=1000,
                      num_factors=64,
                      epochs=50)

In [50]:
#fit the fm model specifying storage path
fm.fit({'train': fm_train_data_path, 'test': fm_test_data_path})

2020-04-28 14:06:55 Starting - Starting the training job...
2020-04-28 14:06:56 Starting - Launching requested ML instances...
2020-04-28 14:07:54 Starting - Preparing the instances for training......
2020-04-28 14:08:40 Downloading - Downloading input data...
2020-04-28 14:09:18 Training - Downloading the training image..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
  from numpy.testing import nosetester[0m
[34m[04/28/2020 14:09:33 INFO 140478314145600] Reading default configuration from /opt/amazon/lib/python2.7/site-packages/algorithm/resources/default-conf.json: {u'factors_lr': u'0.0001', u'linear_init_sigma': u'0.01', u'epochs': 1, u'_wd': u'1.0', u'_num_kv_servers': u'auto', u'use_bias': u'true', u'factors_init_sigma': u'0.001', u'_log_level': u'info', u'bias_init_method': u'normal', u'linear_init_method': u'normal', u'linear_lr': u'0.001', u'factors_init_method': u'normal', u'_tuning_objective_metric': u'', 


2020-04-28 14:09:32 Training - Training image download completed. Training in progress.[34m[2020-04-28 14:09:44.764] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 22, "duration": 893, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/28/2020 14:09:44 INFO 140478314145600] #quality_metric: host=algo-1, epoch=10, train binary_classification_accuracy <score>=0.709252747253[0m
[34m[04/28/2020 14:09:44 INFO 140478314145600] #quality_metric: host=algo-1, epoch=10, train binary_classification_cross_entropy <loss>=0.610058519301[0m
[34m[04/28/2020 14:09:44 INFO 140478314145600] #quality_metric: host=algo-1, epoch=10, train binary_f_1.000 <score>=0.763163077143[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 896.2259292602539, "sum": 896.2259292602539, "min": 896.2259292602539}}, "EndTime": 1588082984.764859, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 158808298

[34m[2020-04-28 14:09:54.673] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 44, "duration": 893, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/28/2020 14:09:54 INFO 140478314145600] #quality_metric: host=algo-1, epoch=21, train binary_classification_accuracy <score>=0.732340659341[0m
[34m[04/28/2020 14:09:54 INFO 140478314145600] #quality_metric: host=algo-1, epoch=21, train binary_classification_cross_entropy <loss>=0.573220805745[0m
[34m[04/28/2020 14:09:54 INFO 140478314145600] #quality_metric: host=algo-1, epoch=21, train binary_f_1.000 <score>=0.770401093463[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 895.8101272583008, "sum": 895.8101272583008, "min": 895.8101272583008}}, "EndTime": 1588082994.674255, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1588082993.777332}
[0m
[34m[04/28/2020 14:09:54 INFO 140478314145600] #progress_metric: host=al

[34m[2020-04-28 14:10:04.443] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 66, "duration": 886, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/28/2020 14:10:04 INFO 140478314145600] #quality_metric: host=algo-1, epoch=32, train binary_classification_accuracy <score>=0.73821978022[0m
[34m[04/28/2020 14:10:04 INFO 140478314145600] #quality_metric: host=algo-1, epoch=32, train binary_classification_cross_entropy <loss>=0.554142837021[0m
[34m[04/28/2020 14:10:04 INFO 140478314145600] #quality_metric: host=algo-1, epoch=32, train binary_f_1.000 <score>=0.77219959072[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 888.3960247039795, "sum": 888.3960247039795, "min": 888.3960247039795}}, "EndTime": 1588083004.443954, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1588083003.554575}
[0m
[34m[04/28/2020 14:10:04 INFO 140478314145600] #progress_metric: host=algo


2020-04-28 14:10:24 Uploading - Uploading generated training model[34m[2020-04-28 14:10:09.896] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 78, "duration": 896, "num_examples": 91, "num_bytes": 5796480}[0m
[34m[04/28/2020 14:10:09 INFO 140478314145600] #quality_metric: host=algo-1, epoch=38, train binary_classification_accuracy <score>=0.739956043956[0m
[34m[04/28/2020 14:10:09 INFO 140478314145600] #quality_metric: host=algo-1, epoch=38, train binary_classification_cross_entropy <loss>=0.54694677399[0m
[34m[04/28/2020 14:10:09 INFO 140478314145600] #quality_metric: host=algo-1, epoch=38, train binary_f_1.000 <score>=0.772776156092[0m
[34m#metrics {"Metrics": {"update.time": {"count": 1, "max": 899.2660045623779, "sum": 899.2660045623779, "min": 899.2660045623779}}, "EndTime": 1588083009.897833, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "factorization-machines"}, "StartTime": 1588083008.997622}
[0m
[34m[0


2020-04-28 14:10:31 Completed - Training job completed
Training seconds: 111
Billable seconds: 111


### Step 4 : Create hyperparameter tuning jobs in SageMaker 

In [51]:
# Use the best hyperparameters that give us the highest binary classification accuracy
fm.set_hyperparameters(feature_dim=n_features,
                      predictor_type='binary_classifier',
                      mini_batch_size=359,
                      num_factors=64,
                      epochs=403)

### Step 5 : Deploy fm model and predict

In [None]:
#deploy fm model and get the prediction 
fm_predictor = fm.deploy(initial_instance_count=1,
                         instance_type='ml.t2.medium')

------------!

In [171]:
#serializing data into JSON 
import json
from sagemaker.predictor import json_deserializer

def fm_serializer(data):
    js = {'instances': []}
    for row in data:
        js['instances'].append({'features': row.tolist()})
    return json.dumps(js)

fm_predictor.content_type = 'application/json'
fm_predictor.serializer = fm_serializer
fm_predictor.deserializer = json_deserializer

In [172]:
#show prediction for a single record using endpoint 
prediction = X_test[1000].toarray()
result = fm_predictor.predict(prediction)

print(y_test[1000])
print(result)

0.0
{'predictions': [{'score': 0.5855739712715149, 'predicted_label': 1.0}]}


In [356]:
#show predictions for a set of records(100 records)
predictions = []
for array in np.array_split(X_test[1000:1100].toarray(), 1):
    result = fm_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)
#predictions.shape

In [174]:
#show confusion matrix 
pd.crosstab(y_test, predictions, rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,49,12
1.0,22,17


### Step 6 : Get the recommendation result for each user in user list

In [468]:
# get the corresponding recommendation movie index/location (not movieid)
def GetRecIndex(user_id):
    #get each user's sparse matrix(specify the range)
    loc_low=(user_id-1)*10
    loc_high=loc_low+10
    
    #store prediction pair result for each user into a list
    predictions = []
    for array in np.array_split(X_test[loc_low:loc_high].toarray(), 1):
        result = fm_predictor.predict(array)
        predictions += [r['predicted_label'] for r in result['predictions']]
    predictions = np.array(predictions)
   
    #add the movie index/location for finding the movieid later
    pred=result["predictions"]
    for i in pred:
        i["index"]=pred.index(i)
    #print(pred)
   
    # fiter records with predicted_label being 1(recommend)
    pred_ft= filter(lambda dic: dic["predicted_label"]==1, pred)
    pred_filter=list(pred_ft)
    
    #get the top 5 movie recommendaions index for each user
    pred_filter.sort(key=lambda x: x['score'], reverse=True)
    #sort the score and get the top 5 records
    top5_index=pred_filter[:5]
    #agg_index=[i["index"] for i in top5_index]
    #agg_score=[i["score"] for i in top5_index]
    return top5_index

In [469]:
# write function to map recommendation movie id according to movie index/location
def GetRecMovieID(user_id):
    #Get the dataframe for each userid
    df_user=test_movie_ratings[test_movie_ratings.user_id==user_id]
    #Get the recommendation movie index for each user
    top5_index=GetRecIndex(user_id)
    loc=[i["index"] for i in top5_index]
    #map the index to the movieid
    recommendation=df_user.reset_index().movie_id[loc].values.tolist()
    return {'userId':user_id,'fm_recommendations': recommendation }

In [466]:
def UserFmDf(userId):
    agg_score=[i["score"] for i in GetRecIndex(userId)]
    a=pd.DataFrame(agg_score,columns=["predicted_score"])
    b=pd.DataFrame(GetRecMovieID(userId))
    c=pd.concat([a,b],axis=1)
    c.rename(columns={'fm_recommendations':'movieId'}, inplace=True)
    fm_result=pd.merge(c,test_pd,on=['userId','movieId'])
    fm_result=fm_result[["userId","movieId","predicted_score","rating"]]
    return fm_result

In [443]:
#try to get recommendations movieid for userid 198
GetRecMovieID(198)

{'userId': 198, 'fm_recommendations': [100, 179, 498, 135, 7]}

In [467]:
#get the user recommendation summary dataframe
UserFmDf(198)

Unnamed: 0,userId,movieId,predicted_score,rating
0,198,100,0.766967,1
1,198,179,0.720465,4
2,198,498,0.717518,3
3,198,135,0.70093,5
4,198,7,0.506485,4


In [228]:
user_list=[198,11,314,184,163,710,881,504,267,653]

In [444]:
# Show the recommendation result for each user in the list
fm_user_list=[GetRecMovieID(i) for i in user_list]
fm_user_list

[{'userId': 198, 'fm_recommendations': [100, 179, 498, 135, 7]},
 {'userId': 11, 'fm_recommendations': [425, 558]},
 {'userId': 314, 'fm_recommendations': [28, 95, 692, 417, 1518]},
 {'userId': 184, 'fm_recommendations': [98, 191, 187, 153, 602]},
 {'userId': 163, 'fm_recommendations': [98, 318, 64]},
 {'userId': 710, 'fm_recommendations': [50, 197, 22, 116, 200]},
 {'userId': 881, 'fm_recommendations': [180, 423, 663, 133]},
 {'userId': 504, 'fm_recommendations': [66, 163, 581, 72, 179]},
 {'userId': 267, 'fm_recommendations': [423, 238, 518, 980, 403]},
 {'userId': 653, 'fm_recommendations': [50, 272]}]

In [445]:
fm_user_result=pd.DataFrame(fm_user_list)
fm_user_result

Unnamed: 0,fm_recommendations,userId
0,"[100, 179, 498, 135, 7]",198
1,"[425, 558]",11
2,"[28, 95, 692, 417, 1518]",314
3,"[98, 191, 187, 153, 602]",184
4,"[98, 318, 64]",163
5,"[50, 197, 22, 116, 200]",710
6,"[180, 423, 663, 133]",881
7,"[66, 163, 581, 72, 179]",504
8,"[423, 238, 518, 980, 403]",267
9,"[50, 272]",653


In [557]:
final_fm_user = []
for i in user_list:
    data = UserFmDf(i)
    # store DataFrame in list
    final_fm_user.append(data)
# see pd.concat documentation for more info
final_fm_user = pd.concat(final_fm_user)
final_fm_user.head(10)

Unnamed: 0,userId,movieId,predicted_score,rating
0,198,100,0.766967,1
1,198,179,0.720465,4
2,198,498,0.717518,3
3,198,135,0.70093,5
4,198,7,0.506485,4
0,11,425,0.545703,4
1,11,558,0.517921,3
0,314,28,0.776762,5
1,314,95,0.749725,5
2,314,692,0.741075,5


In [None]:
#import sagemaker
#sagemaker.Session().delete_endpoint(fm_predictor.endpoint)

## Model Combination : Vote Strategy

The scoring strategy shown below considers the effect of different rmse of different models and the predicted scores of each model. To make sure those the predicted score of each model is comparable, standardization transform process is implemented since we only care about the relative distance of each predicted score in a model. Next, a weight metic is constructed as using predicted score(after scaling) divided by the rmse of each model. The greater the weight is, the more accurate the recommendation result will be. Then, we add the weight of each pair(userid, movieid) generated by each model. By sorting the weight of different pairs, we can easily get the final recommendation result.

In [354]:
# print the prediction result of three models 
print(als_user_result)
print(sp_user_result)
print(fm_user_result)

             als_recommendations  userId
0  [1463, 1643, 1536, 814, 1449]     198
1  [814, 1463, 1536, 1449, 1500]      11
2   [1554, 1662, 814, 1472, 867]     314
3  [1536, 1463, 1643, 814, 1449]     184
4  [1554, 814, 1463, 1500, 1293]     163
5  [1463, 1643, 1536, 814, 1449]     710
6  [1463, 814, 1554, 1536, 1500]     881
7  [814, 1463, 1449, 1500, 1293]     504
8  [1463, 1536, 1643, 814, 1449]     267
9  [814, 1463, 1554, 1500, 1449]     653
          sp_recomendations userId
0  [114, 50, 474, 408, 169]    267
1    [83, 318, 9, 286, 603]     11
2  [50, 474, 127, 172, 173]    198
3   [488, 134, 285, 483, 9]    184
4  [64, 121, 692, 282, 143]    314
5  [316, 318, 272, 64, 357]    163
6  [318, 127, 735, 258, 98]    504
7   [195, 22, 174, 50, 746]    653
8  [127, 483, 134, 192, 23]    710
9  [174, 22, 651, 121, 265]    881
          fm_recommendations  userId
0    [100, 179, 498, 135, 7]     198
1                 [425, 558]      11
2   [28, 95, 692, 417, 1518]     314
3   [98, 191, 18

In [558]:
final_fm_user.head()

Unnamed: 0,userId,movieId,predicted_score,rating
0,198,100,0.766967,1
1,198,179,0.720465,4
2,198,498,0.717518,3
3,198,135,0.70093,5
4,198,7,0.506485,4


In [550]:
als_user.head()

Unnamed: 0,userId,movieId,predicted_rating
1350,881,1463.0,5.182528
1351,881,814.0,5.130446
1352,881,1554.0,4.892687
1353,881,1536.0,4.831121
1354,881,1500.0,4.821637


In [565]:
sp_user.head()

Unnamed: 0,movieId,score,userId
71,114,5.0,267
72,83,4.558399,11
85,50,4.537311,198
162,488,4.746187,184
264,64,5.0,314


In [554]:
import sklearn
from sklearn import decomposition
# perform standardization transformation of the predicted rating
x = als_user.loc[:, ["predicted_rating"]].values
x = sklearn.preprocessing.StandardScaler().fit_transform(x)
als_user.drop(['predicted_rating'], axis=1)
als_user["predicted_rating"]=x
# recall the rmse of als estimator on test data is around 0.957
als_user["weight"]=als_user["predicted_rating"]*(1/0.957)
als_user.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,userId,movieId,predicted_rating,weight
1350,881,1463.0,0.298999,0.312434
1351,881,814.0,0.213841,0.223449
1352,881,1554.0,-0.174917,-0.182777
1353,881,1536.0,-0.275582,-0.287964
1354,881,1500.0,-0.29109,-0.304169


In [525]:
# calculate the rmse for the fm prediction result
from sklearn.metrics import mean_squared_error
from math import sqrt

x = final_fm_user.loc[:, ["predicted_score","rating"]].values
x = sklearn.preprocessing.StandardScaler().fit_transform(x)
RMSE_table=pd.DataFrame(x,columns=["predicted_score","rating"])
RMSE_table.head()

rmse = sqrt(mean_squared_error(RMSE_table["predicted_score"], RMSE_table["rating"]))
print('Root mean squared error of the test_data for fm result: %.4f' % rmse)

Root mean squared error of the test_data for fm result: 1.3624


In [559]:
#perform the same transformation on fm recommendation result

x = final_fm_user.loc[:, ["predicted_score"]].values
x = sklearn.preprocessing.StandardScaler().fit_transform(x)
final_fm_user.drop(['predicted_score'], axis=1)
final_fm_user["predicted_score"]=x
final_fm_user["weight"]=final_fm_user["predicted_score"]*(1/1.36)
final_fm_user.head()

Unnamed: 0,userId,movieId,predicted_score,rating,weight
0,198,100,0.758106,1,0.557431
1,198,179,0.377026,4,0.277225
2,198,498,0.352875,3,0.259467
3,198,135,0.216937,5,0.159513
4,198,7,-1.376539,4,-1.012161


In [566]:
# perform the same transformation on surprise result

x =sp_user.loc[:, ["score"]].values
x = sklearn.preprocessing.StandardScaler().fit_transform(x)
sp_user.drop(['score'], axis=1)
sp_user["score"]=x
sp_user["score"]=sp_user["score"]*0.675
sp_user.rename(columns={'score':'predicted_score'}, inplace=True)
sp_user["weight"]=sp_user["predicted_score"]*(1/0.675)
sp_user.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,movieId,predicted_score,userId,weight
71,114,1.062711,267,1.574387
72,83,0.075229,11,0.111451
85,50,0.028073,198,0.04159
162,488,0.495149,184,0.733555
264,64,1.062711,314,1.574387


In [573]:
#concate the weight into one big dataframe
final_recommdend=pd.concat([als_user, final_fm_user], ignore_index=True)
final_recom=pd.concat([final_recommdend, sp_user], ignore_index=True)
final_recom

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  from ipykernel import kernelapp as app
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  app.launch_new_instance()


Unnamed: 0,movieId,predicted_rating,predicted_score,rating,userId,weight
0,1463.0,0.298999,,,881,0.312434
1,814.0,0.213841,,,881,0.223449
2,1554.0,-0.174917,,,881,-0.182777
3,1536.0,-0.275582,,,881,-0.287964
4,1500.0,-0.291090,,,881,-0.304169
5,1554.0,-1.098930,,,163,-1.148308
6,814.0,-1.265069,,,163,-1.321911
7,1463.0,-1.568281,,,163,-1.638747
8,1500.0,-1.605603,,,163,-1.677746
9,1293.0,-1.778316,,,163,-1.858219


In [574]:
#drop irrelevant columns and only retain the weight for pair comparision 
final_recom['movieId']=final_recom['movieId'].astype(int)
final_recom=final_recom.drop(['predicted_score',"rating","predicted_rating"], axis=1)
final_recom.head()

Unnamed: 0,movieId,userId,weight
0,1463,881,0.312434
1,814,881,0.223449
2,1554,881,-0.182777
3,1536,881,-0.287964
4,1500,881,-0.304169


In [578]:
summary=final_recom.sort_values(["userId","weight"],ascending=[True,False])
summary.head(10)

Unnamed: 0,movieId,userId,weight
35,814,11,0.182833
92,83,11,0.111451
36,1463,11,0.033692
102,318,11,-0.086218
37,1536,11,-0.359141
38,1449,11,-0.398001
39,1500,11,-0.448859
112,9,11,-0.53344
122,286,11,-0.574012
132,603,11,-0.628978


In [583]:
#get the recommendations for each user in the list 
final_df_198=summary[summary.userId==198].head(5)
final_df_11=summary[summary.userId==11].head(5)
final_df_314=summary[summary.userId==314].head(5)
final_df_184=summary[summary.userId==184].head(5)
final_df_163=summary[summary.userId==163].head(5)
final_df_710=summary[summary.userId==710].head(5)
final_df_881=summary[summary.userId==881].head(5)
final_df_504=summary[summary.userId==504].head(5)
final_df_267=summary[summary.userId==267].head(5)
final_df_653=summary[summary.userId==653].head(5)

In [589]:
#get the final result
df_result=pd.concat([final_df_198,final_df_11,final_df_314,final_df_184,
           final_df_163,final_df_710,final_df_881,final_df_504,final_df_267,final_df_653])
df_result


Unnamed: 0,movieId,userId,weight
50,100,198,0.557431
40,1463,198,0.294939
51,179,198,0.277225
52,498,198,0.259467
53,135,198,0.159513
35,814,11,0.182833
92,83,11,0.111451
36,1463,11,0.033692
102,318,11,-0.086218
37,1536,11,-0.359141


In [None]:
df_result.to_csv("recommendation_for_user_list.csv")

## Github 

https://github.com/Lyhq1996/LHQ
