# Matrix Factorization

We will experiment with the recent MovieLens 25M Dataset and build a recommender system using two approaches:
* Factorizing the user-item matrix using Spark ALS implementation
* Factorizing the item-item PMI maatrix using randomized SVD

In both settings we will index the item embeddings and inspect their quality using KNN queries.

# Part 1

### Download the dataset

In [0]:
!wget http://files.grouplens.org/datasets/movielens/ml-25m.zip
!unzip ml-25m
dbutils.fs.ls("file:/databricks/driver/ml-25m/")
dbutils.fs.mv("file:/databricks/driver/ml-25m/", "dbfs:/ml-25m/", recurse=True)

In [0]:
dbutils.fs.ls("dbfs:/ml-25m/")

### Loading the ratings dataset

In [0]:
df=spark.read.csv('dbfs:/ml-25m/movies.csv', header=True, inferSchema=True).cache()
df = spark.read.csv('dbfs:/ml-25m/ratings.csv', header=True, inferSchema=True)





In [0]:
df.head(5)

### Split the dataset
We want to randomly split the dataset into train and test parts

In [0]:
train, test = df.cache().randomSplit([0.8, 0.2], seed=12345)

### Build ALS model
Using the Spark ALS implementation described here https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html
Build a model using the ml-25m dataset.

How long does the training take, change the rank (i.e. the dimension of the vectors) from 10 to 20. How does that affect training speed ?

In [0]:
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

lines=df.rdd

In [0]:
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")
model = als.fit(train)

#

### Evaluation
Using the code described in the Spark documentation, evaluate how good your model is doing on the test set.
The goal is to predict the held out ratings.
A good metric could be RMSE or MAE.

In [0]:
#Evaluate the model by computing the RMSE on the test data
from pyspark.ml.evaluation import RegressionEvaluator

predictions = model.transform(test)
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("Root-mean-square error = " + str(rmse))

# Generate top 10 movie recommendations for each user
userRecs = model.recommendForAllUsers(10)
# Generate top 10 user recommendations for each movie
movieRecs = model.recommendForAllItems(10)

### Inspecting the results

Retrieve the movie vectors from the learned model object (the property is called itemFactors).
and `collect` all these vectors in a list.

In [0]:
list_items_factors=model.itemFactors.collect()


In [0]:
print(list_items_factors[0])

### Using Nearest neighbours

Pick a few movies, and for each of them, find-out the top 5 nearest neighbours. This is very similar to an optional question of the PLSA project...

In [0]:
movie_vectors_df=model.itemFactors.join(df.withColumnRenamed)
#on va faire la kn n avec une queury qui serit un vecteur

Make sur your KNN algorithm is fast enough. Try to understand why some results are not so good.