A.	Perform a five-fold cross validation of ALS-based recommendation on the rating data ratings.csv with two versions of ALS to compare: one with the ALS setting used in Lab 7 notebook, and another different setting decided by you with a brief explanation of why. 

For each split, find the top 10% users in the training set who have rated the most movies, calling them as HotUsers, and the bottom 10% users who have rated the least movies (but rated at least one movie), calling them CoolUsers. 

Compute the Root Mean Square Error (RMSE) on the test set for the HotUsers and CoolUsers separately, for each of the three splits and each ALS version. Put these RMSE results in one Table in the report (2 versions x 5 splits x 2 user groups = 20 numbers in total). Visualise these 20 numbers in ONE single figure. [6 marks]



B.	After ALS, each movie is modelled with some factors. Use k-means with k=10 to cluster the movie factors (hint: see itemFactors in ALS API) learned with the ALS setting in Lab 7 notebook in A for each of the five splits. Note that each movie is associated with several tags.

For each of the five splits, use genome-scores.csv to find the top tag (with the most movies) and bottom tag (with the least movies, if there are ties, randomly pick one from them) for the top two largest clusters (i.e., 4 tags in total for each split), find out the names of these top/bottom tags using genome-tags.csv. 

For each cluster and each split, report the two tags (one top one bottom) in one table (so 2 clusters x 5 splits x 2 tags = 20 tags to report in total). Hint: For each cluster, sum up tag scores for all movies in it; find the largest/smallest scores and their indexes; go to genome-tags to find their names (not manually but writing code to do so).

You can use any information provided by the dataset to answer the question. [6 marks]


C.	Discuss two most interesting observations from A & B above, each with three sentences: 1) What is the observation? 2) What are the possible causes of the observation? 3) How useful is this observation to a movie website such as Netflix? [2 marks]
D.	    Your report must be clearly written and your code must be well documented so that it is clear what each step is doing. [1 mark]

In [1]:
import pyspark
from pyspark.sql import SparkSession
import re
import pandas as pd
import numpy as np


spark = SparkSession.Builder().master("local[6]").appName("Scalab_Q2_Part2").config("spark.local.dir","C:\temp").getOrCreate()



In [2]:
import matplotlib 
matplotlib.use('Agg') # Must be before importing matplotlib.pyplot or pylab! 
import matplotlib.pyplot as plt


In [3]:
import pyspark.sql.functions as F

ratFile = spark.read\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .csv("ml-25m/ratings.csv").cache() 

In [4]:
type(ratFile)

pyspark.sql.dataframe.DataFrame

In [5]:
ratFile.count()

25000095

In [6]:
ratFile.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: double (nullable = true)
 |-- timestamp: integer (nullable = true)



In [12]:
from pyspark.sql.functions import col

HotUsersdf = (ratFile.groupBy("userId")
                      .count()
                      .sort("count", ascending=False)
                      .select(col("userId"), 
                              col("count").alias("most reviewed")))

ColdUsersdf = (ratFile.groupBy("userId")
                      .count()
                      .sort("count", ascending=True)
                      .select(col("userId"), 
                              col("count").alias("Least reviewed")))


In [13]:
HotUsersdf.cache()

# HotUsersdf.show(30,truncate=False)


print(HotUsersdf.count())

print(int(HotUsersdf.count()*0.1))

HotUsersdf= HotUsersdf.limit(int(HotUsersdf.count()*0.1))



HotUsersdf.show(30,truncate=False)

HotUsersdf.count()

162541
16254
+------+-------------+
|userId|most reviewed|
+------+-------------+
|72315 |32202        |
|80974 |9178         |
|137293|8913         |
|33844 |7919         |
|20055 |7488         |
|109731|6647         |
|92046 |6564         |
|49403 |6553         |
|30879 |5693         |
|115102|5649         |
|110971|5633         |
|75309 |5525         |
|78849 |5276         |
|61010 |5244         |
|29803 |5219         |
|122011|5160         |
|57548 |5066         |
|93855 |5045         |
|103611|4861         |
|34987 |4831         |
|162047|4780         |
|136310|4764         |
|36618 |4710         |
|8619  |4689         |
|143049|4663         |
|132651|4578         |
|17783 |4569         |
|97452 |4553         |
|85757 |4505         |
|162516|4489         |
+------+-------------+
only showing top 30 rows



16254

In [14]:
ColdUsersdf.cache()

# ColdUsersdf.show(30,truncate=False)


print(ColdUsersdf.count())

print(int(ColdUsersdf.count()*0.1))

ColdUsersdf= ColdUsersdf.limit(int(ColdUsersdf.count()*0.1))



ColdUsersdf.show(30,truncate=False)

ColdUsersdf.count()

162541
16254
+------+--------------+
|userId|Least reviewed|
+------+--------------+
|4101  |20            |
|8389  |20            |
|31367 |20            |
|31951 |20            |
|59355 |20            |
|69478 |20            |
|70355 |20            |
|81501 |20            |
|96261 |20            |
|97186 |20            |
|99168 |20            |
|103357|20            |
|108460|20            |
|111300|20            |
|120706|20            |
|134924|20            |
|150300|20            |
|8928  |20            |
|19200 |20            |
|38707 |20            |
|60835 |20            |
|61766 |20            |
|62880 |20            |
|71709 |20            |
|72546 |20            |
|76169 |20            |
|83769 |20            |
|86400 |20            |
|93691 |20            |
|100615|20            |
+------+--------------+
only showing top 30 rows



16254

In [15]:
myseed=8744
(training, test) = ratFile.randomSplit([0.8, 0.2], myseed) #Inlocuieste cu 0.2, 0.2, 0.2, 0.2,0.2
training = training.cache()
test = test.cache()

In [16]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
# from pyspark.ml.tuning import CrossValidator

als = ALS(userCol="userId", itemCol="movieId", seed=myseed, coldStartStrategy="drop")

als2 = ALS(rank=10,userCol="userId", itemCol="movieId", seed=myseed, coldStartStrategy="drop")

#rank: the number of latent factors in the model

In [None]:
print("Test ALS fitting and transforming...")

model = als.fit(training)

predictions = model.transform(test)


In [None]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")

rmse = evaluator.evaluate(predictions)

print("Test ALS Root-mean-square-error = " + str(rmse))
# Root-mean-square error = 0.920957307650525

In [17]:
from sklearn.model_selection import KFold
 
np.random.seed(8744)
# create the range 1 to 25
rn = range(1,20)   
    
kf5 = KFold(n_splits=5, shuffle=False)

In [18]:
# to get the values from our data, we use np.take() to access a value at particular index
for train_index, test_index in kf5.split(rn):
    print(np.take(rn,train_index), np.take(rn,test_index))


    


[ 5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] [1 2 3 4]
[ 1  2  3  4  9 10 11 12 13 14 15 16 17 18 19] [5 6 7 8]
[ 1  2  3  4  5  6  7  8 13 14 15 16 17 18 19] [ 9 10 11 12]
[ 1  2  3  4  5  6  7  8  9 10 11 12 17 18 19] [13 14 15 16]
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16] [17 18 19]


In [19]:
# Nope

i = 1
for train_index, test_index in kf5.split(ratFile):
    X_train = ratFile.iloc[train_index].loc[:, features]
    X_test = ratFile.iloc[test_index][features]
    
    #Train the model
    fold_model = als.fit(X_train, y_train) #Training the model
    print(f"Accuracy for the fold no. {i} on the test set: {accuracy_score(y_test, model.predict(X_test))}")
    i += 1

TypeError: Expected sequence or array-like, got <class 'pyspark.sql.dataframe.DataFrame'>

In [None]:
#use LOOCV to evaluate model
# scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error',
#                          cv=cv, n_jobs=-1)

#view RMSE
# sqrt(mean(absolute(scores)))

In [None]:
#  NU E NeVOIE
# cv = CrossValidator(estimator=als, evaluator=evaluator, numFolds=5)

In [None]:
#Fit cross validator to the 'train' dataset

# model = cv.fit(train)#Extract best model from the cv model above


In [None]:
# best_model = model.bestModel# View the predictions

# test_predictions = best_model.transform(test)

# RMSE = evaluator.evaluate(test_predictions)

# print(RMSE)

In [None]:
# Generate n Recommendations for all users
# recommendations = best_model.recommendForAllUsers(5)
# recommendations.show()

In [None]:
# movies = ratFile.select(als.getItemCol()).distinct().limit(3)

# movieSubSetRecs = model.recommendForItemSubset(movies, 10)

# movies.show()
# +-------+
# |movieId|
# +-------+
# |    474|
# |     29|
# |     26|
# +-------+
# movieSubSetRecs.show(3,False)
# 20*5

for loop i k=5
    test set 
    trainning set unions


A.	Perform a five-fold cross validation of ALS-based recommendation on the rating data ratings.csv with two versions of ALS to compare: one with the ALS setting used in Lab 7 notebook, and another different setting decided by you with a brief explanation of why.

For each split, find the top 10% users in the training set who have rated the most movies, calling them as HotUsers, and the bottom 10% users in the training set who have rated the least movies (but rated at least one movie), calling them CoolUsers. 

Compute the Root Mean Square Error (RMSE) on the test set for the HotUsers and CoolUsers separately, for each of the FIVE splits and each ALS version. 

 Put these RMSE results in one Table in the report (2 versions x 5 splits x 2 user groups = 20 numbers in total). Visualise these 20 numbers in ONE single figure. [6 marks]

In [None]:
# five-fold cross validation 

#ALS lab 7

#ALS mine


B.	After ALS, each movie is modelled with some factors. Use k-means with k=10 to cluster the movie factors (hint: see itemFactors in ALS API) learned with the ALS setting in Lab 7 notebook in A for each of the five splits. 

Note that each movie is associated with several tags. 

For each of the five splits, find the top tag (with the most movies) and bottom tag (with the least movies, if there are ties, randomly pick one from them) for the top two largest clusters (i.e., 4 tags in total for each split).

For each cluster and each split, report the two tags (one top one bottom) in one table (so 2 clusters x 5 splits x 2 tags = 20 tags to report in total). You can use any information provided by the dataset to answer the question. [6 marks]