## Collaborative Filtering Movie Recommendation System with Explicit Rating

In [1]:
## 1. Content-based (product-based/user-based) recommandation system and collaborative filtering recommendation are two major approaches for recommendation system. Collaborative filtering is more commom and widely used. 
## 2. Use Alternating-least-square (ALS) method to estimate the rating matrix.
## 3. Depends on the latent factor, number of free parameters is usually very large and likely lead to overfitting. Regularization can be added to penalize large parameters.
## 4. Common difficulties in rating estimation: a. sparsity, b. cold start, c. computational intensity
### a. Sparsity: chose smart  rating measures : explicit rating (review, rating, like/dislike) and implicit rating (# of views, length of time, etc.)
####    challenges of implicit feedback: no negative feedback, noisy, no preference or order, can't be evaluated by RMSE (fine for optimization)
### b. Cold Start: need to be handled differently in validation and production

### 1. Initiate App and Load Raw Data

In [2]:
spark=SparkSession\
    .builder\
    .appName('Collaborative Filtering Movie Recommendation System')\
    .getOrCreate()

In [3]:
ratingRawData=spark.read.format('csv').option('header','true').load('02/demos/datasets/movielens/ratings.csv')

In [4]:
ratingRawData.toPandas().head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [5]:
## select all columns except timestamp
from pyspark.sql.functions import col

dataset=ratingRawData.select(col('userId').cast('int'),
                             col('movieId').cast('int'),
                             col('rating').cast('float'))

dataset.toPandas().head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [6]:
## It's a pretty clean explicit rating dataset, no need for further feature engineering.
## Check the distribution of the dataset

dataset.select('rating').toPandas().describe()

Unnamed: 0,rating
count,100004.0
mean,3.543608
std,1.058064
min,0.5
25%,3.0
50%,4.0
75%,4.0
max,5.0


In [7]:
## Split into traning and test datasets

(trainingData, testData)=dataset.randomSplit([0.8,0.2])

### 2. Define CF model with ALS-WR method

In [8]:
##maxIter: The max # of iterations 
##regParam: regularization parameter in ALS (defaults to 1.0)
##coldStartStrategy: 'drop'/'nan'

In [9]:
from pyspark.ml.recommendation import ALS

als=ALS(maxIter=10,
        regParam=0.1,
        userCol='userId',
        itemCol='movieId',
        ratingCol='rating',
        coldStartStrategy='drop'
       )

In [10]:
## build the ALS Model with training dataset

model=als.fit(trainingData)

In [11]:
## transform test dataset with predictions

predictions=model.transform(testData)
predictions.toPandas().head(10)

Unnamed: 0,userId,movieId,rating,prediction
0,311,463,3.0,2.884489
1,602,471,3.0,4.247396
2,274,471,5.0,3.304128
3,440,471,3.0,3.431053
4,30,471,4.0,3.813798
5,184,471,5.0,4.225024
6,294,833,2.0,1.978942
7,516,1088,3.0,3.546981
8,372,1088,4.0,3.662085
9,54,1088,5.0,2.948514


### 3. Model Evaluation

In [12]:
## Compare the distribution of values for true ratings and predicitons
### There is no constraint in predicted ratings, it can be negative or over 5.

predictions.select('rating','prediction').toPandas().describe()

Unnamed: 0,rating,prediction
count,19226.0,19226.0
mean,3.558176,3.390774
std,1.050238,0.749404
min,0.5,-0.352382
25%,3.0,2.960795
50%,4.0,3.472612
75%,4.0,3.909958
max,5.0,5.514994


In [13]:
## Get Root Mean Square Error RMSE on the test data 
## Explicit rating can use RMSE for evaluation, but implicit dataset can't

from pyspark.ml.evaluation import RegressionEvaluator

evaluator= RegressionEvaluator (metricName='rmse',
                                labelCol='rating',
                                predictionCol='prediction'
                                )

rmse=evaluator.evaluate(predictions)

In [14]:
rmse

0.9152329483582983

### 4. Movie Recommendation

#### 4.1 Recommendations for all users/items

In [15]:
## 3 recommendations for each user
recForUsers=model.recommendForAllUsers(3)
recForUsers.toPandas().head()

Unnamed: 0,userId,recommendations
0,471,"[(54328, 4.962196350097656), (65037, 4.9458522..."
1,463,"[(83411, 5.183130741119385), (67504, 5.1831307..."
2,496,"[(1680, 5.401086330413818), (4427, 5.089951992..."
3,148,"[(83411, 5.777069568634033), (67504, 5.7770695..."
4,540,"[(5791, 5.752669811248779), (59684, 5.72042131..."


In [16]:
## top 3 users for each movie
userForMovie=model.recommendForAllItems(3)
userForMovie.toPandas().head()

Unnamed: 0,movieId,recommendations
0,1580,"[(113, 4.927398681640625), (46, 4.915234565734..."
1,5300,"[(296, 6.016067981719971), (469, 5.59177589416..."
2,6620,"[(331, 4.8599534034729), (52, 4.80920648574829..."
3,7340,"[(113, 4.468500137329102), (621, 4.43926239013..."
4,32460,"[(298, 4.934330940246582), (46, 4.754263877868..."


#### 4.2 Recommendations for a specific user

In [17]:
userMovieList=recForUsers.filter(recForUsers.userId==148).select('recommendations')
recMovieList=userMovieList.collect()[0].recommendations

In [18]:
recMovieDF=spark.createDataFrame(recMovieList)
recMovieDF.toPandas()

Unnamed: 0,movieId,rating
0,83411,5.77707
1,67504,5.77707
2,54328,5.337588


In [19]:
## Load movie info

movieDF=spark.read.csv('02/demos/datasets/movielens/movies.csv',header=True,ignoreLeadingWhiteSpace=True)
movieDF.toPandas().head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [20]:
## join recMovieDF and movieDF 

recMovieFinalDF=movieDF.join(recMovieDF, on=['movieId']).orderBy('rating',ascending=False).select('title','genres','rating')
recMovieFinalDF.toPandas()

Unnamed: 0,title,genres,rating
0,Cops (1922),Comedy,5.77707
1,Land of Silence and Darkness (Land des Schweig...,Documentary,5.77707
2,My Best Friend (Mon meilleur ami) (2006),Comedy,5.337588


#### 4.3 Recommendation Engine  

In [21]:
## Combine 4.1 and 4.2 to a Recommendation Egnine for movie recommendation
## this project is runing on Spark 2.2, new Spark 2.3 has a new ALS attribute recommendForUserSubset 
## which is more flexible in this case

def getMovieRecommendationsForUser(userId,numRecs):
    allUserRecs=model.recommendForAllUsers(numRecs)
    
    userMovieList=allUserRecs.filter(allUserRecs.userId==userId).select('recommendations')
    recMovieList=userMovieList.collect()[0].recommendations
    recMovieDF=spark.createDataFrame(recMovieList)
    
    recMovieFinalDF=movieDF.join(recMovieDF, on=['movieId']).orderBy('rating',ascending=False).select('title','genres','rating')
    
    return recMovieFinalDF

In [22]:
getMovieRecommendationsForUser(219,5).toPandas()

Unnamed: 0,title,genres,rating
0,Ben X (2007),Drama,5.393658
1,Lake of Fire (2006),Documentary,5.393658
2,Hachiko: A Dog's Story (a.k.a. Hachi: A Dog's ...,Drama,5.360955
3,The Imitation Game (2014),Drama|Thriller|War,5.308814
4,"Electric Horseman, The (1979)",Comedy|Western,5.299702
