* https://spark.apache.org/docs/2.1.0/ml-classification-regression.html
* https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
* https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html

----------------------------
https://medium.com/@patelneha1495/recommendation-system-in-python-using-als-algorithm-and-apache-spark-27aca08eaab3

----------------------------------

In [2]:
#from pyspark import SparkConf, SparkContext


#from pyspark.ml.classification import LogisticRegression
#from pyspark.ml.regression import LinearRegression

from pyspark.sql import Row, SparkSession

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS


In [7]:
# set up environment
#conf = SparkConf() \
#    .setAppName("MovieLensALS") \
#    .set("spark.executor.memory", "2g")
#sc = SparkContext(conf=conf)

spark = SparkSession.builder.appName('Recommendation_system').getOrCreate()

In [10]:
path_data = "/home/p5hngk/Downloads/GitHub/SD_701---Data_Mining/ml-latest-small"

df = spark.read.format("csv").option("header", "true").load(path_data+"/ratings.csv")
df.show(10)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
+------+-------+------+---------+
only showing top 10 rows



In [11]:
df1 = df.select(df['userId'],df['movieId'],df['rating'])
df1.show(10)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
+------+-------+------+
only showing top 10 rows



In [17]:
df1.printSchema()

root
 |-- userId: string (nullable = true)
 |-- movieId: string (nullable = true)
 |-- rating: string (nullable = true)



In [12]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

Before making an ALS model it needs to be clear that ALS only accepts integer value as parameters. Here, every column is in string format so we have to change that.

In [18]:
from pyspark.sql.types import IntegerType

df1 = df1.withColumn("userId", df1["userId"].cast(IntegerType()))
df1 = df1.withColumn("movieId", df1["movieId"].cast(IntegerType()))
df1 = df1.withColumn("rating", df1["rating"].cast(IntegerType()))

df1.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: integer (nullable = true)



We can now create the training and test data.

In [19]:
(training,test) = df1.randomSplit([0.8, 0.2])

Now we will create the ALS model and fit data.

In [20]:
als = ALS(maxIter=5, regParam=0.09, rank=25, userCol = "userId", itemCol = "movieId", ratingCol = "rating", coldStartStrategy = "drop", nonnegative=True)
model = als.fit(training)

So we can now generate predictions and evaluate rmse.

In [21]:
evaluator = RegressionEvaluator(metricName = "rmse", labelCol = "rating", predictionCol = "prediction")
predictions = model.transform(test)
rmse = evaluator.evaluate(predictions)

print("RMSE = "+str(rmse))
predictions.show(10)

RMSE=0.9270372961175919
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   372|    471|     3| 2.8045597|
|   603|    471|     4| 3.2363825|
|   182|    471|     4| 3.5148435|
|    57|    471|     3| 3.5302713|
|   555|    471|     3|  4.063595|
|   176|    471|     5| 3.3719444|
|    32|    471|     3| 3.5954008|
|   469|    471|     5| 3.0325816|
|   426|    471|     5|  2.927421|
|    44|    833|     2| 2.0427682|
+------+-------+------+----------+
only showing top 10 rows



## Providing Recommendations

In [22]:
user_recs = model.recommendForAllUsers(20).show(10)

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[[6818, 4.666887]...|
|   463|[[33649, 5.091219...|
|   496|[[858, 4.4274006]...|
|   148|[[98491, 4.589981...|
|   540|[[171495, 5.02515...|
|   392|[[6818, 6.2692513...|
|   243|[[33090, 6.492032...|
|    31|[[1411, 5.455585]...|
|   516|[[27611, 4.918539...|
|   580|[[6300, 4.9749503...|
+------+--------------------+
only showing top 10 rows



------------------------------------
-----------------------------------