<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Факторизация матрицы рейтингов и </b> <span style="font-weight:bold; color:green">Spark MLlib</span></div><hr>
<div style="text-align:right;">Папулин С.Ю. <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Содержание</span>
    <ol>
        <li><a href="#1">Подключение библиотек и создание Spark контекста</a></li>
        <li><a href="#2">Загрузка исходных данных</a></li>
        <li><a href="#3">Анализ исходных данных</a></li>
        <li><a href="#4">Calculating benchmark for rating prediction</a></li>
        <li><a href="#5">Item-based collaborative filtering</a></li>
        <li><a href="#6">Факторизация матрицы рейтингов</a></li>
        <li><a href="#7">Завершение работы</a></li>
        <li><a href="#8">References</a></li>
    </ol>
</div>

<p>Подключение стилей оформления</p>

In [None]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Подключение библиотек и создание Spark контекста</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

<p>[ОПЦИОНАЛЬНО] <b>Настройка среды</b></p>

In [None]:
import os
import sys

os.environ["SPARK_HOME"]="/usr/lib/spark"
os.environ["PYSPARK_PYTHON"]="/opt/anaconda3/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="/opt/anaconda3/bin/python"

spark_home = os.environ.get("SPARK_HOME")
sys.path.insert(0, os.path.join(spark_home, "python"))
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.10.7-src.zip"))

<p>Запуск Spark Session</p>

In [None]:
import pyspark
from pyspark.sql import SparkSession

In [None]:
conf = pyspark.SparkConf() \
        .setAppName("moviewRecomApp")

In [None]:
conf = pyspark.SparkConf() \
        .set("spark.executor.memory", "1g") \
        .set("spark.executor.core", "2") \
        .setAppName("moviewRecomApp") \
        .setMaster("local[4]")

In [None]:
spark = SparkSession \
    .builder \
    .appName("moviewRecomApp") \
    .config(conf=conf) \
    .getOrCreate()

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Загрузка исходных данных</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

<p>[ЕСЛИ НЕОБХОДИМО]</p>

In [None]:
!hdfs dfs -mkdir -p data/spark_mllib

In [None]:
!hdfs dfs -copyFromLocal data/movie-recommendation-data/ml-latest-small/movies.csv data/spark_mllib/movies.csv
!hdfs dfs -copyFromLocal data/movie-recommendation-data/ml-latest-small/ratings.csv data/spark_mllib/ratings.csv

In [None]:
!hdfs dfs -ls data/spark_mllib

<p>Путь к данными</p>

In [None]:
# Доступ к данным из HDFS для небольшего набора данных
movie_data_path = "YOUR_PATH/data/movie-recommendation-data/ml-latest-small/movies.csv"
ratings_data_path = "YOUR_PATH/data/movie-recommendation-data/ml-latest-small/ratings.csv"

# Databricks
#movie_data_path = "/FileStore/tables/movies.csv"
#ratings_data_path = "/FileStore/tables/ratings.csv"

# Доступ к данным из HDFS для полного набора данных
# movie_data_path = "hdfs:///data/recom/large/movies.csv"
# ratings_data_path = "hdfs:///data/recom/large/ratings.csv"

<div class="msg-block msg-info">
  <div class="msg-text-info">
      <p>Данные необходимо предварительно загрузить. Ссылки на данные:<br>
      1. Небольшой набор: http://files.grouplens.org/datasets/movielens/ml-latest.zip <br>
      2. Полный набор: http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
      </p>
  </div>
</div>

<div class="msg-block msg-info">
  <div class="msg-text-info">
      <p>1. Выполните последующие шаги сначала для небольшого набора данных и только потом для полного.<br>
         2. Для загрузки данных из локальной файловой системы используйте префикс <span class="code-font">file:///</span> 
      </p>
  </div>
</div>

<p>Фильмы</p>

<div class="msg-block msg-warning">
  <p class="msg-text-warn">При использовании <span class="code-font">file:///</span> метод <span class="code-font">sqlContext.read.load()</span> может выдать ошибку</p>
</div>

In [None]:
df_movies = spark.read.load(movie_data_path, 
                            format="csv", 
                            header="true", 
                            inferSchema="true", 
                            sep=",")
print("Total number of movies:", df_movies.count())
df_movies.show(5)

<p>Рейтинги</p>

In [None]:
df_ratings = spark.read.load(ratings_data_path,          
                             format="csv", 
                             header="true", 
                             inferSchema="true", 
                             sep=",")
print("Total number of ratings:", df_ratings.count())
df_ratings.show(5)

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. Анализ исходных данных</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

<p>Фильмы с наибольшим количеством просмотров</p>

In [None]:
group = df_ratings.groupBy("movieId")

In [None]:
df_movie_count = group.agg({"movieId": "count", "rating":"mean"})
df_movie_count.show(10)

In [None]:
df_movie_count_sorted = df_movie_count.sort("count(movieId)", ascending=False)
df_movie_count_sorted.show(10)

In [None]:
df_movie_count_title = df_movie_count_sorted.join(other=df_movies, on="movieId", how="inner")
df_movie_count_title.show(10)

<p>Фильмы с наименьшим количеством просмотров</p>

In [None]:
df_movie_count_title.sort("count(movieId)", ascending=True).show(10)

<p>Фильмы с наилучшими средними рейтингами</p>

In [None]:
df_movie_count_title.sort("avg(rating)", ascending=False).show(10)

In [None]:
df_movie_count_title.sort("avg(rating)", ascending=False) \
                    .filter(df_movie_count_title["count(movieId)"] > 100) \
                    .show(10)

<p>Фильмы с наихудшими средними рейтингами</p>

In [None]:
df_movie_count_title.sort("avg(rating)", ascending=True).show(10)

In [None]:
df_movie_count_title.sort("avg(rating)", ascending=True) \
                    .filter(df_movie_count_title["count(movieId)"] > 50) \
                    .show(10)

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. Calculating benchmark for rating prediction</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

<p><b>Train and test subsets</b></p>

In [None]:
df_train, df_test = df_ratings.randomSplit([0.9, 0.1], seed=12)
df_train.persist(); df_test.persist()

In [None]:
df_train.count(), df_test.count()

In [None]:
df_train.rdd.getNumPartitions()

In [None]:
df_train = df_train.repartition(4)

In [None]:
df_train.rdd.getNumPartitions()

<p><b>Benchmark based on average rating</b></p>

In [None]:
import pyspark.sql.functions as F

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

<p>Calculating average reaging</p>

<p><i>Approach 1</i></p>

In [None]:
mean_movie_rating = df_train.select(F.mean("rating").alias("avr")).collect()[0]["avr"]
mean_movie_rating

<p><i>Approach 2</i></p>

In [None]:
df_train_descr = df_train.select("rating").describe()
df_train_descr.show()

In [None]:
df_train_descr.collect()

In [None]:
df_train_descr.toPandas()

<p><b>Testing</b></p>

<p>To calculate RMSE we can use the RegressionEvaluator class that is a part of MLlib. In order to do that we have to supply a dataframe with two columns that represent actual ratings and predictions (estimated ratings), respectively</p>

In [None]:
df_test_pred_bl = df_test.withColumn("prediction", F.lit(mean_movie_rating)) # broadcast mean_movie_rating
df_test_pred_bl.show(5)

<p>Calculate RMSE</p>

In [None]:
eval_rmse = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
eval_rmse.evaluate(df_test_pred_bl)

<a name="5"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">5. Item-based collaborative filtering</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

Add the module path to $PYTHONPATH

In [None]:
sys.path.insert(0, "/YOUR_PATH/lib/python/recommend")

<p>Import a class for the item-based collaborative filtering</p>

In [None]:
from itemrecom import ItemBasedRecommend

<p>Create an instance of the class to train and predict ratings </p>

In [None]:
item_based = ItemBasedRecommend(spark)

<p><b>Training</b></p>

<p>Calculate similarities between items (e.g. movies)</p>

In [None]:
item_based.train(df_train, user_column_name="userId", item_column_name="movieId")

<p><b>Predicting</b></p>

In [None]:
USER = 1
TOP_N_RATINGS = 10

In [None]:
df_recom = item_based.recommend(user_ids=[USER], top_N_ratings=TOP_N_RATINGS, grouped=False)
df_recom.show()

<p><b>Testing</b></p>

<p>Collect user ids from the test subset </p>

In [None]:
U = {el["userId"] for el in df_test[[F.col("userId")]].distinct().collect()}

Get a dataframe with predicted ratings for selected users

In [None]:
df_recom = item_based.recommend(user_ids=U, top_N_ratings=10000, grouped=False).persist()
df_recom.show()

<p>Join true ratings and estimated ones</p>

In [None]:
df_true_pred = df_test.join(df_recom, on=[(F.col("userId")==F.col("user")), 
                                          (F.col("movieId")==F.col("item"))]).persist()
df_true_pred.show()

<p>Calculate RMSE</p>

In [None]:
eval_rmse = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="rating_pred")
eval_rmse.evaluate(df_true_pred)

<a name="6"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">6. Факторизация матрицы с использованием ALS</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

In [None]:
from pyspark.ml.recommendation import ALS

<p><b>Обучение</b></p>

<p>Применение ALS для предсказания рейтингов</p>

In [None]:
als_inst = ALS(rank=10, maxIter=10, seed=123, regParam=0.1, 
               numUserBlocks=10, numItemBlocks=10, 
               userCol="userId", itemCol="movieId", ratingCol="rating")

m_als = als_inst.fit(df_train)
m_als

<p>Предсказания при обучающем подмножестве</p>

In [None]:
train_pred = m_als.transform(df_train)
train_pred.show(5)

<p>RMSE</p>

In [None]:
eval_rmse.evaluate(train_pred)

<p><b>Проверка на тестовом подмножестве</b></p>

<p>Предсказания при тестовом подмножестве</p>

In [None]:
test_pred = m_als.transform(df_test)
test_pred.show(5)

<p>Проблема</p>

In [None]:
eval_rmse.evaluate(test_pred)

<p>Причина</p>

In [None]:
test_pred.sort("prediction", ascending=False).show(10)

<p>Вариант решения</p>

In [None]:
test_pred_with_mean = test_pred.fillna(mean_movie_rating)

In [None]:
test_pred_with_mean.sort("prediction", ascending=False).show(10)

<p>RMSE</p>

In [None]:
eval_rmse.evaluate(test_pred_with_mean)

<p>Рекомендации пользователю</p>

In [None]:
user_id = 463
num_recom = 10 # количество рекомендаций

<p>Рейтинги пользователя</p>

In [None]:
df_ratings.where(col("userId")==user_id) \
            .join(other=df_movies, on="movieId", how="inner") \
            .sort("rating", ascending=False) \
            .show()

<p>Рекомендации</p>

In [None]:
recom_movies_for_users_df = m_als.recommendForAllUsers(num_recom)
recom_movies_for_users_df.show(5)

In [None]:
recom_movies_for_users_df.where(col("userId")==user_id) \
                            .select("recommendations.movieId", "recommendations.rating") \
                            .show()

In [None]:
movies_pandas = recom_movies_for_users_df.where(col("userId")==user_id)\
                    .select("recommendations.movieId").toPandas()
movies_pandas["movieId"]

In [None]:
movies_list = movies_pandas["movieId"][0]
movies_list

In [None]:
sc = spark.sparkContext
movies_list_broadcast = sc.broadcast(movies_list)

In [None]:
df_movies.filter(col("movieId").isin(movies_list_broadcast.value)).show()

<a name="7"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">7. Завершение работы</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">К содержанию</a></div>
    </div>
</div>

In [None]:
spark.stop()