# Exercise 2: Recommender System using Apache Spark MLLIB

 We have to implement Recommender system using Apache Spark, we first read the data into an RDD, we delete the timestamp column as it is not factored in making suggestions. The ratings column is converted to a double type as the built in function supports only this datatype

In [2]:
#Setting up spark
from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext()
sqlContext = SQLContext(sc)

In [4]:
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType

#Reading data
rating = pd.read_csv(r'/home/kritz/Documents/DDL/Ex10/movieLens/ratings.csv')
ratings = sqlContext.createDataFrame(rating)

#Dropping timestamp as it is not necessary
columns_to_drop = ['timestamp']
rate = ratings.drop(*columns_to_drop)
rate.show(5)

#Renaming columns and converting to double type for function to use
rate = rate.select(col("userId").alias("user"),col("movieId")
                                   .alias("item"),col("rating").alias("rating"))
newrate = rate.withColumn("rating", rate["rating"].cast(DoubleType()))

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|     31|   2.5|
|     1|   1029|   3.0|
|     1|   1061|   3.0|
|     1|   1129|   2.0|
|     1|   1172|   4.0|
+------+-------+------+
only showing top 5 rows



After the data is prepared it is split into train data and test data, with 80% as train and remaining 20% as test

In [9]:
#train and test split
train,test = newrate.randomSplit([0.8, 0.2])

To make recommendations using matrix factorization method, we use the ALS function, which is Alternating Least Square matrix factorization.It trains a matrix factorization model given an RDD of ratings by users for a subset of products. The ratings matrix is approximated as the product of two lower-rank matrices of a given rank (number of features). To solve for these features, ALS is run iteratively with a configurable level of parallelism

The hyper-parameters used in cross-validation are rank, maximum iterations, regularization parameter and alpha. The RMSE evalutor is used, i.e the cv minimizes the RMSE loss function, the best combination of parameters got from this cross-validation are used to make the predictions using which the RMSE is computed for Train and Test datasets

In [12]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

#ALS model
alsImplicit = ALS(implicitPrefs=True)

#Param grid for cv
paramMapImplicit = ParamGridBuilder() \
                    .addGrid(alsImplicit.rank, [20.0,100.0])\
                    .addGrid(alsImplicit.maxIter, [10.0, 15.0]) \
                    .addGrid(alsImplicit.regParam, [0.01, 1.0]) \
                    .addGrid(alsImplicit.alpha, [10.0, 40.0]) \
                    .build()

#RMSE Evaluator
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating
                                ",predictionCol="prediction")

#CV 
cvEstimator= CrossValidator(estimator=alsImplicit, 
                            estimatorParamMaps=paramMapImplicit, evaluator=evaluator)

#Fitting CV
cvModel=cvEstimator.fit(train)

print(cvModel)

CrossValidatorModel_4c408ac6fee24df5c0ed


In [30]:
# Evaluate the model on training data
predictions = cvModel.transform(train)
rmse = evaluator.evaluate(predictions)
print("Root Mean Squared Error in Train data = " + str(rmse))

# Evaluate the model on training data
prediction = cvModel.transform(test)
rmse = evaluator.evaluate(prediction)
print("Root Mean Squared Error in Test data = " + str(rmse))

Root Mean Squared Error in Train data = 2.8648101624433124
Root Mean Squared Error in Test data = 1.6589547812569874


 It can be seen that the RMSE for test data is 1.66 which is a little more compared to the 0.98 baseline in "http://www.mymedialite.net". Since we are getting our own hyper paramters with the random split of data we make some deviations are bound to happen. Comparing with the values from the previous implementation these are much lesser RMSE values, as we have a wider combination of hyper-parameters being tested.The mediaLite is still the better model...and the predictions are as follows

In [32]:
#Prediction
predsImplicit = cvModel.bestModel.transform(test)
predsImplicit.show(5)

+----+----+------+----------+
|user|item|rating|prediction|
+----+----+------+----------+
| 452| 463|   2.0|0.69587874|
|  85| 471|   3.0|0.85212296|
| 588| 471|   3.0|0.73714733|
| 548| 471|   4.0|  1.001869|
| 452| 471|   3.0| 0.3340096|
+----+----+------+----------+
only showing top 5 rows

