# Recommendation Engine (MovieLens 100k data)

## pyspark.ml Implementation using  Alternating Least Squares (ALS) Matrix Factorization
ALS works by trying to find the optimal representation of a user and a product matrix – which when combined, should accurately represent the original dataset. The genius part of ALS is that it alternates between finding the optimal values for the user matrix and the product matrix. 

**Assumptions and FYI:**
- Treating the entries in user-item matrix as explicit feedback
- Using DataFrame based API, pyspark.sql.DataFrame to usemachine learning pipelines, pyspark.ml
- Scaling of the regularisation parameter is done in solving each ALS (makes regParam less dependent of scale of data)
- Cold start strategy to drop any rows in the DataFrame of predictions that contain NaN values
- Model evaluation will be based on root-mean-square error (RMSE) of rating prediction
- 'NOT' creating test-train split (as mentioned in the question)

!["Image"](https://i.pinimg.com/originals/ba/f0/c8/baf0c80a9fea91e79365630709a1fa5c.png)

### Importing relevant libraries

In [1]:
from pyspark.sql import SparkSession #to connect to spark cluster/core
from pyspark import SparkContext  #to read file aptly
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
import pandas as pd

### Setting up "SparkSession"  (and "SparkContext" to read files aptly)
**Note**: It provides a single point of entry to interact with underlying Spark functionality;
allows programming Spark with DataFrame and Dataset APIs

In [2]:
spark = SparkSession.builder \
        .master("local") \
        .appName("RecommendationSystems") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
        
sc=SparkContext.getOrCreate()

### Importing file as RDD and cleaning it; Mapping data to 'Rating' obj; Making ADS_SDF
**Note**: SDF is Spark Dataframe (cretaing it for further convenient processing)

In [3]:
#Importing file from current library to RDD object
movielens_RDD = sc.textFile("ml-100k/u.data")

#Cleaning up (Movielens RDD is tab separated) 
#Note: RDD schema ~ "user" "product" "rating" "timestamp"
movielens_RDD_clean=movielens_RDD.map(lambda x:x.split('\t'))

#Mapping cleaned RDD to "Rating" object
#Note: ADS RDD schema ~ "user" "product" "rating"(float)
ADS_RDD=movielens_RDD_clean.map(lambda x: Row(int(x[0]),\
        int(x[1]), float(x[2])))
ADS_SDF=spark.createDataFrame(ADS_RDD, ['userId', 'movieId', 'rating'] )

### Training Model (using pyspark.ml's ALS model)

In [9]:
#Setting up the parameters for ALS
rank_=7   #No.of Latent Factors (to be made)
maxIter_=10   #No.of Times to repeat 
regParam_= 0.1   #Regularization Parameter in ALS

#Instatiating ALS and Fitting model to whole data
ALS_obj = ALS(rank=rank_, maxIter=maxIter_, regParam=regParam_, 
          userCol="userId", itemCol="movieId", ratingCol="rating", 
          coldStartStrategy="drop")
ALS_model = ALS_obj.fit(ADS_SDF)

### Model Evaluations (based on RMSE)

In [10]:
#Evaluating the model by computing the RMSE on the whole data
predictions = ALS_model.transform(ADS_SDF)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

Root-mean-square error = 0.8000418417915021


### Top 10 movie recommendations for all users

In [11]:
#Generate top 10 movie recommendations for each user
userRecs_SDF = ALS_model.recommendForAllUsers(10)

#Manipulating SDF to get info in required form
user_movieRecs_SDF=userRecs_SDF.select("userId","recommendations.movieId")
user_movieRecs_DF=user_movieRecs_SDF.toPandas().sort_values("userId")
user_movieRecs_DF['movieId(s)']=[str(i)[1:-1] for i in user_movieRecs_DF.movieId]
user_movieRecs_DF.drop('movieId',axis=1,inplace=True)

#Exporting file to tab separated text file
user_movieRecs_txt=user_movieRecs_DF.to_csv(sep='\t', index=False)
file_obj=open("Swapnil_Parkhe.txt",'w')
file_obj.write(user_movieRecs_txt)
file_obj.close()