--------------------------------
# <center>- Spark Project - SD 701 - Data Mining-</center>
---------------------------------

***A RETRAVAILLER***

After we evaluated the list of recommended movies, we quickly identified two obvious limitations in our KNN approach. One is the “popularity bias”, the other is “item cold-start problem”. There will be another limitation, “scalability issue”, if the underlying training data is too big to fit in one machine.

* popularity bias : refers to system recommends the movies with the most interactions without any personalization
* item cold-start problem : refers to when movies added to the catalogue have either none or very little interactions while recommender rely on the movie’s interactions to make recommendations
* scalability issue : refers to lack of the ability to scale to much larger sets of data when more and more users and movies added into our database

All three above are very typical challenges for collaborative filtering recommender. They arrive naturally along with the user-movie (or movie-user) interaction matrix where each entry records an interaction of a user i and a movie j. In a real world setting, the vast majority of movies receive very few or even no ratings at all by users. We are looking at an extremely sparse matrix with many entries that are missing values.

In [1]:
import pyspark
#from pyspark import SparkConf, SparkContext

#from pyspark.ml.classification import LogisticRegression
#from pyspark.ml.regression import LinearRegression

from pyspark.sql import Row, SparkSession

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS


### 1. Initialisation de la session spark

In [2]:
spark = SparkSession.builder.appName('Recommendation_system').getOrCreate()

### 2. Chargement et préparation des données d'évaluations des films

In [3]:
path_data = "/home/p5hngk/Downloads/GitHub/SD_701---Data_Mining/ml-latest-small"

df_ratings = spark.read.format("csv").option("header", "true").load(path_data+"/ratings.csv")
df_ratings.show(10)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
+------+-------+------+---------+
only showing top 10 rows



In [4]:
df_ratings1 = df_ratings.select(df_ratings['userId'], df_ratings['movieId'], df_ratings['rating'])
df_ratings1.show(10)

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
+------+-------+------+
only showing top 10 rows



In [5]:
df_ratings1.printSchema()

root
 |-- userId: string (nullable = true)
 |-- movieId: string (nullable = true)
 |-- rating: string (nullable = true)



Avant de réaliser un modèle ALS, il faut que toutes nos données soient au format integer ou float pour pouvoir réaliser les calculs, ce qui n'est pas le cas actuellement. Occupons-nous donc dans un premier temps de changer cela.

In [6]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS

In [7]:
from pyspark.sql.types import IntegerType, FloatType

df_ratings1 = df_ratings1.withColumn("userId", df_ratings1["userId"].cast(IntegerType())) \
                .withColumn("movieId", df_ratings1["movieId"].cast(IntegerType())) \
                .withColumn("rating", df_ratings1["rating"].cast(FloatType()))

df_ratings1.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: float (nullable = true)



### 3. Création des jeux d'entraînement et de test

In [8]:
(training,test) = df_ratings1.randomSplit([0.8, 0.2])

### 4. Création du modèle ALS et entraînement du modèle

In [9]:
als = ALS(maxIter=5, regParam=0.15, rank=25, userCol = "userId", itemCol = "movieId", ratingCol = "rating", coldStartStrategy = "drop", nonnegative=True)
model = als.fit(training)

### 5. Evaluation du modèle

In [10]:
evaluator = RegressionEvaluator(metricName = "rmse", labelCol = "rating", predictionCol = "prediction")
predictions = model.transform(test)
rmse = evaluator.evaluate(predictions)

In [11]:
print(f"RMSE = {round(rmse,3)}")
predictions.show(10)

RMSE = 0.88
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   385|    471|   4.0| 3.0520854|
|   176|    471|   5.0|  3.415569|
|   608|    471|   1.5| 3.0144289|
|   426|    471|   5.0| 2.8086097|
|   260|    471|   4.5| 3.0675466|
|   104|    471|   4.5| 3.0483193|
|    44|    833|   2.0| 1.6910638|
|   492|    833|   4.0|  2.280677|
|   599|   1088|   2.5| 2.4249523|
|   169|   1088|   4.5| 4.1103616|
+------+-------+------+----------+
only showing top 10 rows



In [12]:
predictions.describe().show()

+-------+------------------+------------------+------------------+------------------+
|summary|            userId|           movieId|            rating|        prediction|
+-------+------------------+------------------+------------------+------------------+
|  count|             19287|             19287|             19287|             19287|
|   mean| 324.5970342717893|17643.650489967335|3.5159693057499872| 3.306959727854456|
| stddev|181.07904248765624|33098.318950057925| 1.044572880679155|0.6646586576818803|
|    min|                 1|                 1|               0.5|        0.14063323|
|    max|               610|            188301|               5.0|          5.376093|
+-------+------------------+------------------+------------------+------------------+



In [13]:
predictions.orderBy(predictions.userId.asc()).show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|     1|   3527|   4.0| 4.1148076|
|     1|   1220|   5.0|  4.333788|
|     1|   4006|   4.0| 3.8586867|
|     1|   3243|   3.0|  2.811433|
|     1|   2105|   4.0| 3.6724505|
|     1|   3793|   5.0|  4.114033|
|     1|   3439|   4.0| 3.1371949|
|     1|   1206|   5.0| 4.3812394|
|     1|    919|   5.0|  4.474517|
|     1|    163|   5.0| 3.8790586|
|     1|   1927|   5.0|  4.417093|
|     1|   1198|   5.0| 4.8583612|
|     1|    157|   5.0| 3.0494716|
|     1|   2094|   5.0| 3.8036983|
|     1|   1127|   4.0| 4.1047835|
|     1|   2542|   5.0|  4.624957|
|     1|     47|   5.0|   4.46125|
|     1|      1|   4.0|  4.445528|
|     1|    673|   3.0| 2.8302224|
|     1|   1291|   5.0|   4.67087|
+------+-------+------+----------+
only showing top 20 rows



In [14]:
from pyspark.sql import SQLContext

sqlContext = SQLContext(spark)
predictions.registerTempTable("predictions_table")

In [15]:
sqlContext.sql('SELECT * FROM predictions_table WHERE userId = 1 ORDER BY movieId ASC').show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|     1|      1|   4.0|  4.445528|
|     1|     47|   5.0|   4.46125|
|     1|    157|   5.0| 3.0494716|
|     1|    163|   5.0| 3.8790586|
|     1|    216|   5.0| 3.7912571|
|     1|    231|   5.0| 3.2257411|
|     1|    349|   4.0|  4.043511|
|     1|    457|   5.0| 4.6259546|
|     1|    527|   5.0| 4.6705804|
|     1|    552|   4.0| 3.6677642|
|     1|    608|   5.0| 4.5700874|
|     1|    673|   3.0| 2.8302224|
|     1|    919|   5.0|  4.474517|
|     1|    940|   5.0|  4.874849|
|     1|   1092|   5.0| 3.6729481|
|     1|   1097|   5.0|   4.58688|
|     1|   1127|   4.0| 4.1047835|
|     1|   1198|   5.0| 4.8583612|
|     1|   1206|   5.0| 4.3812394|
|     1|   1220|   5.0|  4.333788|
+------+-------+------+----------+
only showing top 20 rows



### 6. Amélioration du modèle

Pouvons-nous améliorer notre modèle avec de meilleurs hyperparamètres ? Nous allons regarder ici l'influence des paramètres de régularisations (`regParams`) et le nombre de features (`ranks`) à utiliser pour notre modèle. Nous chercherons ici à minimiser le risque moyen quadratique (***RMSE***).

In [16]:
def tune_ALS(train_data, validation_data, maxIter, regParams, ranks):
    """
    grid search function to select the best model based on RMSE of
    validation data
    Parameters
    ----------
    train_data: spark DF with columns ['userId', 'movieId', 'rating']
    
    validation_data: spark DF with columns ['userId', 'movieId', 'rating']
    
    maxIter: int, max number of learning iterations
    
    regParams: list of float, one dimension of hyper-param tuning grid
    
    ranks: list of float, one dimension of hyper-param tuning grid
    
    Return
    ------
    The best fitted ALS model with lowest RMSE score on validation data
    """
    # initial
    min_error = float('inf')
    best_rank = -1
    best_regularization = 0
    best_model = None
    for rank in ranks:
        print("\n")
        for reg in regParams:
            # get ALS model
            als = ALS(userCol = "userId", itemCol = "movieId", ratingCol = "rating", coldStartStrategy = "drop", nonnegative=True).setMaxIter(maxIter).setRank(rank).setRegParam(reg)
            # train ALS model
            model = als.fit(train_data)
            # evaluate the model by computing the RMSE on the validation data
            predictions = model.transform(validation_data)
            evaluator = RegressionEvaluator(metricName="rmse",
                                            labelCol="rating",
                                            predictionCol="prediction")
            rmse = evaluator.evaluate(predictions)
            print('{} latent factors and regularization = {}: '
                  'validation RMSE is {}'.format(rank, reg, rmse))
            if rmse < min_error:
                min_error = rmse
                best_rank = rank
                best_regularization = reg
                best_model = model
    print('\nThe best model has {} latent factors and '
          'regularization = {}'.format(best_rank, best_regularization))
    return best_model

In [17]:
regParams = [0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17]
ranks = [15, 20, 25, 30]

tune_ALS(training, test, 5, regParams, ranks)



15 latent factors and regularization = 0.1: validation RMSE is 0.8865500927388011
15 latent factors and regularization = 0.11: validation RMSE is 0.8834463507949386
15 latent factors and regularization = 0.12: validation RMSE is 0.881677029646818
15 latent factors and regularization = 0.13: validation RMSE is 0.8810576468880305
15 latent factors and regularization = 0.14: validation RMSE is 0.8815189804640466
15 latent factors and regularization = 0.15: validation RMSE is 0.8828434795303572
15 latent factors and regularization = 0.16: validation RMSE is 0.8849297777626169
15 latent factors and regularization = 0.17: validation RMSE is 0.8876244772649461
15 latent factors and regularization = 0.18: validation RMSE is 0.8908040463354645
15 latent factors and regularization = 0.19: validation RMSE is 0.8943659737958967
15 latent factors and regularization = 0.2: validation RMSE is 0.8982350124138224


20 latent factors and regularization = 0.1: validation RMSE is 0.8806326858710526
20 l

ALS_9ed8adb55e94

Essayons d'aller plus loin pour voir s'il est possible de faire mieux.

In [18]:
regParams = [0.09, 0.10, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16]
ranks = [35, 40]

tune_ALS(training, test, 5, regParams, ranks)



35 latent factors and regularization = 0.09: validation RMSE is 0.8849047802900074
35 latent factors and regularization = 0.1: validation RMSE is 0.8794574303735211
35 latent factors and regularization = 0.11: validation RMSE is 0.8760990976430557
35 latent factors and regularization = 0.12: validation RMSE is 0.8745186834609052
35 latent factors and regularization = 0.13: validation RMSE is 0.8743248082212346
35 latent factors and regularization = 0.14: validation RMSE is 0.8752999003755378
35 latent factors and regularization = 0.15: validation RMSE is 0.8771750481574169
35 latent factors and regularization = 0.16: validation RMSE is 0.8798217273400061


40 latent factors and regularization = 0.09: validation RMSE is 0.8848668695349206
40 latent factors and regularization = 0.1: validation RMSE is 0.8801980314125837
40 latent factors and regularization = 0.11: validation RMSE is 0.8772271384977226
40 latent factors and regularization = 0.12: validation RMSE is 0.8757884261793065
40

ALS_d344abe761b1

On a donc effectivement réussi à faire un peu mieux.

### 7. Création du modèle avec les meilleurs hyperparamètres

In [19]:
best_als = ALS(maxIter=5, regParam=0.13, rank=35, userCol = "userId", itemCol = "movieId", ratingCol = "rating", coldStartStrategy = "drop", nonnegative=True)
best_model = best_als.fit(training)

In [20]:
evaluator = RegressionEvaluator(metricName = "rmse", labelCol = "rating", predictionCol = "prediction")
predictions = best_model.transform(test)
rmse = evaluator.evaluate(predictions)

### 8. Evaluation du nouveau modèle

In [30]:
print(f"RMSE = {round(rmse,3)}")

RMSE = 0.874


In [22]:
predictions.show(5)

+------+-------+------+----------+-----------------+
|userId|movieId|rating|prediction|rating-prediction|
+------+-------+------+----------+-----------------+
|   385|    471|   4.0| 3.0621886|        0.9378114|
|   176|    471|   5.0| 3.5212595|        1.4787405|
|   608|    471|   1.5| 3.1267812|       -1.6267812|
|   426|    471|   5.0| 2.8944874|        2.1055126|
|   260|    471|   4.5| 3.6802127|       0.81978726|
+------+-------+------+----------+-----------------+
only showing top 5 rows



In [23]:
type(predictions)

pyspark.sql.dataframe.DataFrame

In [38]:
import pyspark.sql.functions as F
from pyspark.sql.functions import when

def transform_df(df):
    df = df.withColumn("prediction2", F.round(df['prediction']*2)/2)
    df = df.withColumn('rating-prediction2', df.rating - df.prediction2)
    df = df.withColumn("rating-prediction2", when(df['rating-prediction2'] >= 0, df['rating-prediction2']).otherwise(df['rating-prediction2']*(-1)))
    df = df.withColumn("good_pred", when(df['rating-prediction2'] <= 0.5, 1).otherwise(0)) 
    
    return df

# on arrondi la prédiction à 0.5 point près
# on calcule la différence entre le vote de l'utilisateur et la prédiction
# on fait en sorte d'avoir des valeurs positives
# on calcule si notre prédiction est bonne ou non dans une tranche de 0.5 point près

In [38]:
predictions = transform_df(predictions)

predictions.show(5)

+------+-------+------+----------+-----------+------------------+---------+
|userId|movieId|rating|prediction|prediction2|rating-prediction2|good_pred|
+------+-------+------+----------+-----------+------------------+---------+
|   385|    471|   4.0| 3.0621886|        3.0|               1.0|        0|
|   176|    471|   5.0| 3.5212595|        3.5|               1.5|        0|
|   608|    471|   1.5| 3.1267812|        3.0|               1.5|        0|
|   426|    471|   5.0| 2.8944874|        3.0|               2.0|        0|
|   260|    471|   4.5| 3.6802127|        3.5|               1.0|        0|
+------+-------+------+----------+-----------+------------------+---------+
only showing top 5 rows



In [39]:
predictions.describe().show()

+-------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+
|summary|            userId|           movieId|            rating|        prediction|       prediction2|rating-prediction2|          good_pred|
+-------+------------------+------------------+------------------+------------------+------------------+------------------+-------------------+
|  count|             19287|             19287|             19287|             19287|             19287|             19287|              19287|
|   mean| 324.5970342717893|17643.650489967335|3.5159693057499872| 3.324669946632712|3.3243894851454345|0.6668740602478354| 0.6394462591382797|
| stddev|181.07904248765624|33098.318950057925| 1.044572880679155|0.6805324376949697|0.6962703855299508|0.5823800625312185|0.48017360956793553|
|    min|                 1|                 1|               0.5|        0.15309522|               0.0|               0.0|             

In [40]:
predictions.groupBy().avg('good_pred').show()

+------------------+
|    avg(good_pred)|
+------------------+
|0.6394462591382797|
+------------------+



Cela signifie que nous notre modèle nous permet de réaliser environ **63.9 %** de prédictions correctes, *i.e.* à plus ou moins 0.5 étoile près, sur le dataset d'entraînement.

### 9. Redéfinition du modèle avec un jeu de test similaire au modèle KNN afin de comparer les résultats

Commençons par calculer la liste des films vus par plus de 50 utilisateurs.

In [56]:
# liste des films vu par plus de 50 users
list_film_test = df_ratings1.groupBy('movieId').count()
list_film_test = list_film_test.withColumn("count", list_film_test["count"].cast(IntegerType())).filter(list_film_test['count'] >= 50)

list_film_test.describe().show()

+-------+------------------+-----------------+
|summary|           movieId|            count|
+-------+------------------+-----------------+
|  count|               450|              450|
|   mean|11245.017777777777|91.91111111111111|
| stddev| 23457.08417001459|46.09838525896851|
|    min|                 1|               50|
|    max|            122904|              329|
+-------+------------------+-----------------+



On a donc **450 films** vus par plus de 50 utilisateurs.

Calculons maintenant la liste des utilisateurs ayant vus plus de 250 films.

In [57]:
# users qui ont vu plus de 250 films vus par plus de 50 users
list_user_test = df_ratings1.groupBy('userId').count()
list_user_test = list_user_test.withColumn("count", list_user_test["count"].cast(IntegerType())).filter(list_user_test['count'] >= 250)

list_user_test.describe().show()

+-------+-----------------+-----------------+
|summary|           userId|            count|
+-------+-----------------+-----------------+
|  count|              105|              105|
|   mean|309.6190476190476|591.1619047619048|
| stddev|184.4623883360039|433.8280659921517|
|    min|                6|              250|
|    max|              610|             2698|
+-------+-----------------+-----------------+



Et on a **105 utilisateurs** ayant vus plus de 250 films.

Pour comparer avec le modèle précédent, créons un échantillon de test avec les utilisateurs qui ont vu plus de 250 films parmi ceux vus par plus de 50 users. On commence par renommer les colonnes qui nous intéressent pour plus de clarté dans la jointure.

In [97]:
from pyspark.sql.functions import col

list_film_test = list_film_test.select(col("movieId").alias("mov"))
list_user_test = list_user_test.select(col("userId").alias("us"))

list_user_test.show(5)

+---+
| us|
+---+
|580|
|597|
|368|
| 28|
|596|
+---+
only showing top 5 rows



In [120]:
df_test = df_ratings1.join(list_user_test, df_ratings1.userId == list_user_test.us)
df_test = df_test.join(list_film_test, df_test.movieId == list_film_test.mov)
df_test = df_test.drop('mov', 'us')

In [121]:
df_test.describe().show()

+-------+-----------------+-----------------+-----------------+
|summary|           userId|          movieId|           rating|
+-------+-----------------+-----------------+-----------------+
|  count|            20041|            20041|            20041|
|   mean|326.3430966518637|9365.241804301182|3.650541390150192|
| stddev|184.1573170106317|  20340.317624469|0.954459974859408|
|    min|                6|                1|              0.5|
|    max|              610|           122904|              5.0|
+-------+-----------------+-----------------+-----------------+



In [122]:
df_test.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: float (nullable = true)



In [123]:
essai1 = df_test.groupBy('movieId').count()
essai1.describe().show()

+-------+------------------+------------------+
|summary|           movieId|             count|
+-------+------------------+------------------+
|  count|               450|               450|
|   mean|11245.017777777777|44.535555555555554|
| stddev| 23457.08417001459|15.062596273004987|
|    min|                 1|                 9|
|    max|            122904|                98|
+-------+------------------+------------------+



In [124]:
essai2 = df_test.groupBy('userId').count()
essai2.describe().show()

+-------+-----------------+------------------+
|summary|           userId|             count|
+-------+-----------------+------------------+
|  count|              105|               105|
|   mean|309.6190476190476|190.86666666666667|
| stddev|184.4623883360039| 75.28549082130716|
|    min|                6|                71|
|    max|              610|               429|
+-------+-----------------+------------------+



On a maintenant notre échantillon de test `list_test`, créons notre échantillon d'entraînement.

In [125]:
df_training = df_ratings1.subtract(list_test)

df_training.describe().show()

+-------+------------------+------------------+------------------+
|summary|            userId|           movieId|            rating|
+-------+------------------+------------------+------------------+
|  count|             80795|             80795|             80795|
|   mean| 326.0741011201188|21933.147694783092|3.4646017699115044|
| stddev|182.23588575047182| 37968.26498508466| 1.060015311728062|
|    min|                 1|                 1|               0.5|
|    max|               610|            193609|               5.0|
+-------+------------------+------------------+------------------+



In [126]:
df_training.printSchema()

root
 |-- userId: integer (nullable = true)
 |-- movieId: integer (nullable = true)
 |-- rating: float (nullable = true)



In [127]:
df_ratings1.describe().show()

+-------+------------------+----------------+------------------+
|summary|            userId|         movieId|            rating|
+-------+------------------+----------------+------------------+
|  count|            100836|          100836|            100836|
|   mean|326.12756356856676|19435.2957177992| 3.501556983616962|
| stddev| 182.6184914635004|35530.9871987003|1.0425292390606342|
|    min|                 1|               1|               0.5|
|    max|               610|          193609|               5.0|
+-------+------------------+----------------+------------------+



On a donc bien notre échantillon d'entraînement `df_training`, préparons donc notre modèle en recherchant les meilleurs hyper-paramètres par rapport à nos nouvelles données.

In [141]:
regParams = [0.28, 0.30, 0.31, 0.32, 0.33, 0.34, 0.35]
ranks = [20, 25, 30]

tune_ALS(df_training, df_test, 5, regParams, ranks)



20 latent factors and regularization = 0.28: validation RMSE is 1.0060292388857517
20 latent factors and regularization = 0.3: validation RMSE is 1.0034668392364237
20 latent factors and regularization = 0.31: validation RMSE is 1.0028625201145691
20 latent factors and regularization = 0.32: validation RMSE is 1.002663154061755
20 latent factors and regularization = 0.33: validation RMSE is 1.0028353922945956
20 latent factors and regularization = 0.34: validation RMSE is 1.0033547120230608
20 latent factors and regularization = 0.35: validation RMSE is 1.0041890187441462


25 latent factors and regularization = 0.28: validation RMSE is 1.0022107874840231
25 latent factors and regularization = 0.3: validation RMSE is 0.9995251202677113
25 latent factors and regularization = 0.31: validation RMSE is 0.9989480600730523
25 latent factors and regularization = 0.32: validation RMSE is 0.9988026436401769
25 latent factors and regularization = 0.33: validation RMSE is 0.999050609777553
25 l

ALS_8627431fe1bd

Notre meilleur modèle est ici obtenu avec les hyper-paramètres explicités ci-dessus.

In [142]:
als2 = ALS(maxIter=5, regParam=0.32, rank=25, userCol = "userId", itemCol = "movieId", ratingCol = "rating", coldStartStrategy = "drop", nonnegative=True)
model2 = als2.fit(df_training)

In [143]:
evaluator2 = RegressionEvaluator(metricName = "rmse", labelCol = "rating", predictionCol = "prediction")
predictions2 = model2.transform(df_test)
rmse2 = evaluator2.evaluate(predictions2)

In [144]:
print(f"RMSE = {round(rmse2,3)}")
predictions2.show(10)

RMSE = 0.999
+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   580|   1580|   4.0| 2.8656528|
|   597|   1580|   3.0|  3.291853|
|   368|   1580|   3.0| 2.4771736|
|    28|   1580|   3.0| 2.4780116|
|   606|   1580|   2.5| 3.0528028|
|    91|   1580|   3.5| 2.9021473|
|   232|   1580|   3.5| 2.8981044|
|   599|   1580|   3.0| 2.3114977|
|   111|   1580|   3.0| 2.9209921|
|   140|   1580|   3.0| 3.0011435|
+------+-------+------+----------+
only showing top 10 rows



In [145]:
predictions2 = transform_df(predictions2)

predictions2.show(5)

+------+-------+------+----------+-----------+------------------+---------+
|userId|movieId|rating|prediction|prediction2|rating-prediction2|good_pred|
+------+-------+------+----------+-----------+------------------+---------+
|   580|   1580|   4.0| 2.8656528|        3.0|               1.0|        0|
|   597|   1580|   3.0|  3.291853|        3.5|               0.5|        1|
|   368|   1580|   3.0| 2.4771736|        2.5|               0.5|        1|
|    28|   1580|   3.0| 2.4780116|        2.5|               0.5|        1|
|   606|   1580|   2.5| 3.0528028|        3.0|               0.5|        1|
+------+-------+------+----------+-----------+------------------+---------+
only showing top 5 rows



In [146]:
predictions2.describe().show()

+-------+------------------+-----------------+------------------+-------------------+------------------+------------------+------------------+
|summary|            userId|          movieId|            rating|         prediction|       prediction2|rating-prediction2|         good_pred|
+-------+------------------+-----------------+------------------+-------------------+------------------+------------------+------------------+
|  count|             20041|            20041|             20041|              20041|             20041|             20041|             20041|
|   mean| 326.3430966518637|9365.241804301182| 3.650541390150192|  3.087565660833581| 3.087869866773115|0.8208422733396538|0.5027693228880794|
| stddev|184.15731701063086|20340.31762446896|0.9544599748594105|0.47425025987645064|0.4979379217160119| 0.586056554056174|0.5000048054948569|
|    min|                 6|                1|               0.5|          1.3049136|               1.5|               0.0|                 0|

In [147]:
predictions2.groupBy().avg('good_pred').show()

+------------------+
|    avg(good_pred)|
+------------------+
|0.5027693228880794|
+------------------+



Nous réalisons donc environ **50.3 %** de prédictions correctes, *i.e.* à plus ou moins 0.5 étoile près, sur le dataset d'entraînement. Les résultats plus faibles que ceux obeservés précedemment, s'expliquent par le fait que nous nous sommes privés dans l'entraînement de notre modèle, des utilisateurs les plus importants en terme de volumétrie de films vus et des films les plus notés. Dans cette perspective donc, notre résultat est très satisfaisant.


### 10. Amélioration de notre modèle en centrant nos données initiales

Tontons maintenant une nouvelle voie d'amélioration en repartant de nos données initiales mais en les centrant avant de les injecter dans notre modèle.

In [158]:
df_rating_mean = df_ratings1.groupBy().mean('rating')
df_rating_mean.show()

+-----------------+
|      avg(rating)|
+-----------------+
|3.501556983616962|
+-----------------+



In [159]:
df_rating_mean.printSchema()

root
 |-- avg(rating): double (nullable = true)



In [162]:
# On récupère la valeur des notes moyennes
rating_mean = df_rating_mean.collect()[0][0]
rating_mean

3.501556983616962

In [166]:
df_ratings2 = df_ratings1.withColumn('rating', df_ratings1['rating'] - rating_mean)
df_ratings2.show(5)

+------+-------+-----------------+
|userId|movieId|           rating|
+------+-------+-----------------+
|     1|      1|0.498443016383038|
|     1|      3|0.498443016383038|
|     1|      6|0.498443016383038|
|     1|     47|1.498443016383038|
|     1|     50|1.498443016383038|
+------+-------+-----------------+
only showing top 5 rows



In [167]:
(training3,test3) = df_ratings2.randomSplit([0.8, 0.2])

In [170]:
regParams = [0.07, 0.08, 0.09, 0.10, 0.11, 0.12, 0.13, 0.14]
ranks = [20, 25, 30]

tune_ALS(training3, test3, 5, regParams, ranks)



20 latent factors and regularization = 0.07: validation RMSE is 0.9690075709861126
20 latent factors and regularization = 0.08: validation RMSE is 0.9677804506553153
20 latent factors and regularization = 0.09: validation RMSE is 0.9671321155148779
20 latent factors and regularization = 0.1: validation RMSE is 0.9670064219586998
20 latent factors and regularization = 0.11: validation RMSE is 0.9674524463800205
20 latent factors and regularization = 0.12: validation RMSE is 0.9682254307096025
20 latent factors and regularization = 0.13: validation RMSE is 0.969177054410368
20 latent factors and regularization = 0.14: validation RMSE is 0.9702810319701718


25 latent factors and regularization = 0.07: validation RMSE is 0.9670064116156765
25 latent factors and regularization = 0.08: validation RMSE is 0.9658297778804819
25 latent factors and regularization = 0.09: validation RMSE is 0.9654430566045946
25 latent factors and regularization = 0.1: validation RMSE is 0.9656224562128436
25 

ALS_12127f6089c7

In [171]:
als3 = ALS(maxIter=5, regParam=0.09, rank=25, userCol = "userId", itemCol = "movieId", ratingCol = "rating", coldStartStrategy = "drop", nonnegative=True)
model3 = als3.fit(training3)

In [172]:
evaluator3 = RegressionEvaluator(metricName = "rmse", labelCol = "rating", predictionCol = "prediction")
predictions3 = model3.transform(test3)
rmse3 = evaluator3.evaluate(predictions3)

In [173]:
print(f"RMSE = {round(rmse3,3)}")
predictions3.show(5)

RMSE = 0.965
+------+-------+--------------------+-----------+
|userId|movieId|              rating| prediction|
+------+-------+--------------------+-----------+
|   372|    471|  -0.501556983616962|        0.0|
|   182|    471|   0.998443016383038|0.104577005|
|   462|    471|  -1.001556983616962| 0.09827592|
|   171|    471|  -0.501556983616962| 0.59138185|
|   541|    471|  -0.501556983616962| 0.45794845|
|   357|    471|-0.00155698361696...| 0.29788032|
|   104|    471|   0.998443016383038|0.073140234|
|    44|    833|  -1.501556983616962| 0.10454204|
|   599|   1088|  -1.001556983616962|        0.0|
|   169|   1088|   0.998443016383038| 0.20165491|
+------+-------+--------------------+-----------+
only showing top 10 rows



In [180]:
# On décentre maintenant nos données et prédictions

def transform_df2(df):
    df = df.withColumn("rating", df['rating'] + rating_mean).withColumn("prediction", df['prediction'] + rating_mean)
    
    return df


In [181]:
predictions3 = transform_df2(predictions3)
predictions3 = transform_df(predictions3)

predictions3.show(5)

+------+-------+------+------------------+-----------+------------------+---------+
|userId|movieId|rating|        prediction|prediction2|rating-prediction2|good_pred|
+------+-------+------+------------------+-----------+------------------+---------+
|   372|    471|   3.0| 3.501556983616962|        3.5|               0.5|        1|
|   182|    471|   4.5|3.6061339885264774|        3.5|               1.0|        0|
|   462|    471|   2.5|3.5998329058557887|        3.5|               1.0|        0|
|   171|    471|   3.0| 4.092938831475391|        4.0|               1.0|        0|
|   541|    471|   3.0|3.9595054298907657|        4.0|               1.0|        0|
+------+-------+------+------------------+-----------+------------------+---------+
only showing top 5 rows



In [182]:
predictions3.describe().show()

+-------+------------------+-----------------+------------------+------------------+-------------------+------------------+------------------+
|summary|            userId|          movieId|            rating|        prediction|        prediction2|rating-prediction2|         good_pred|
+-------+------------------+-----------------+------------------+------------------+-------------------+------------------+------------------+
|  count|             19332|            19332|             19332|             19332|              19332|             19332|             19332|
|   mean| 323.8162631905649|17218.57366025243|3.5148199875853505| 3.722910325494733|  3.701272501551831|0.7226619077177736|0.6422511897372233|
| stddev|181.95828537786656|32964.03660919268|1.0364023577076933|0.2600996262186089|0.29831911429862595|0.6454659081565471|0.4793500650137123|
|    min|                 1|                1|               0.5| 3.501556983616962|                3.5|               0.0|                 0|

In [183]:
predictions3.groupBy().avg('good_pred').show()

+------------------+
|    avg(good_pred)|
+------------------+
|0.6422511897372233|
+------------------+



En centrant nos données, nous obtenons donc environ **64.2 %** de prédictions justes. C'est légèrement mieux qu'avec notre premier modèle mais peut-être est-ce dû au split initial différent.

### 11. Jointure avec les données des titres des films
Pour plus d'interprétabilité de nos résultats, on peut joindre ce que nous avons obtenus avec les données de `movies.csv`.

In [68]:
path_data = "/home/p5hngk/Downloads/GitHub/SD_701---Data_Mining/ml-latest-small"

df_movies = spark.read.format("csv").option("header", "true").load(path_data+"/movies.csv")
df_movies.show(10)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
|      6|         Heat (1995)|Action|Crime|Thri...|
|      7|      Sabrina (1995)|      Comedy|Romance|
|      8| Tom and Huck (1995)|  Adventure|Children|
|      9| Sudden Death (1995)|              Action|
|     10|    GoldenEye (1995)|Action|Adventure|...|
+-------+--------------------+--------------------+
only showing top 10 rows



In [88]:
predictions = predictions.join(df_movies, on=['movieId'], how='left_outer')

In [89]:
predictions.orderBy(new.movieId.asc()).show(50)

+-------+------+------+----------+-----------------+---------+----------------+--------------------+
|movieId|userId|rating|prediction|rating-prediction|good_pred|           title|              genres|
+-------+------+------+----------+-----------------+---------+----------------+--------------------+
|      1|   596|     4| 3.2742593|        0.7257407|        0|Toy Story (1995)|Adventure|Animati...|
|      1|   525|     4| 3.2849152|        0.7150848|        0|Toy Story (1995)|Adventure|Animati...|
|      1|   332|     4| 3.4803617|        0.5196383|        0|Toy Story (1995)|Adventure|Animati...|
|      1|   156|     4| 3.4410477|       0.55895233|        0|Toy Story (1995)|Adventure|Animati...|
|      1|   534|     4|  3.994771|     0.0052289963|        1|Toy Story (1995)|Adventure|Animati...|
|      1|    71|     5| 3.7085998|        1.2914002|        0|Toy Story (1995)|Adventure|Animati...|
|      1|   541|     3| 3.7809315|        0.7809315|        0|Toy Story (1995)|Adventure|An

### 12. Réalisation de recommandations

In [67]:
def make_recommendations(self, fav_movie, n_recommendations):
    """
    make top n movie recommendations
    Parameters
    ----------
    fav_movie: str, name of user input movie
    n_recommendations: int, top n recommendations
    """
    # get data
    movie_user_mat_sparse, hashmap = self._prep_data()
    # get recommendations
    raw_recommends = self._inference(
        self.model, movie_user_mat_sparse, hashmap,
        fav_movie, n_recommendations)
    # print results
    reverse_hashmap = {v: k for k, v in hashmap.items()}
    print('Recommendations for {}:'.format(fav_movie))
    for i, (idx, dist) in enumerate(raw_recommends):
        print('{0}: {1}, with distance '
              'of {2}'.format(i+1, reverse_hashmap[idx], dist))

In [101]:
# Generate top 10 movie recommendations for each user
userRecs = best_model.recommendForAllUsers(10).show(10)

+------+--------------------+
|userId|     recommendations|
+------+--------------------+
|   471|[[7842, 4.5132], ...|
|   463|[[7842, 5.144905]...|
|   496|[[7842, 4.533911]...|
|   148|[[8477, 4.655421]...|
|   540|[[7842, 5.5218005...|
|   392|[[8477, 4.842426]...|
|   243|[[67618, 5.862327...|
|    31|[[33649, 5.740833...|
|   516|[[4429, 4.7700167...|
|   580|[[7842, 5.212381]...|
+------+--------------------+
only showing top 10 rows



In [102]:
# Generate top 10 user recommendations for each movie
movieRecs = best_model.recommendForAllItems(10).show(10)

+-------+--------------------+
|movieId|     recommendations|
+-------+--------------------+
|   1580|[[53, 4.7668843],...|
|   4900|[[53, 4.532554], ...|
|   5300|[[53, 4.2717924],...|
|   6620|[[191, 4.574413],...|
|   7340|[[53, 3.979438], ...|
|  32460|[[53, 5.301957], ...|
|  54190|[[53, 5.637684], ...|
|    471|[[53, 4.8591766],...|
|   1591|[[37, 3.6290343],...|
|   1342|[[171, 3.5377822]...|
+-------+--------------------+
only showing top 10 rows



In [None]:
# Generate top 10 movie recommendations for a specified set of users
users = ratings.select(als.getUserCol()).distinct().limit(3)
userSubsetRecs = best_model.recommendForUserSubset(users, 10)

In [None]:
# Generate top 10 user recommendations for a specified set of movies
movies = ratings.select(als.getItemCol()).distinct().limit(3)
movieSubSetRecs = best_model.recommendForItemSubset(movies, 10)

------------------------------------
-----------------------------------