<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/13_RecommenderSystems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommender Systems

The two most common types of recommender systems are:
* **Content-Based** and
* **Colaborative Filtering (CF)**

Focus on the attributes of the items and give you recommendations based on the similarity between them.

CF is more commonly used because it usually gives better results and is relatively easy to understand.

The algorithm has the ability to do feature learning on its own. It can start to learn for itself what features to use.

> **NOTE:** The data need to be in a specific format to work with Spark's Alternating Least Squares (ALS) Recomendation Algorithm

## Install pyspark and download the data file

In [1]:
# Install pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425344 sha256=0b0cf1d710e67849d90aafb750bed66b3f9b8d279244910f31da948a898f7402
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0


In [13]:
# Download the necessary data files
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/RecommenderSystems/movielens_ratings.csv
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/RecommenderSystems/Meal_Info.csv

--2023-10-04 18:45:09--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/RecommenderSystems/movielens_ratings.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14373 (14K) [text/plain]
Saving to: ‘movielens_ratings.csv.2’


2023-10-04 18:45:09 (64.0 MB/s) - ‘movielens_ratings.csv.2’ saved [14373/14373]

--2023-10-04 18:45:09--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/RecommenderSystems/Meal_Info.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26307 (26K) [text/plain

## Read in the data

In [3]:
# Import libraries
from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [4]:
# Create a session
spark = SparkSession.builder.appName('rec').getOrCreate()

In [15]:
# Read in the data file
data = spark.read.csv('movielens_ratings.csv', header=True, inferSchema=True)
data_info = spark.read.csv('Meal_Info.csv', header=True, inferSchema=True)

In [16]:
# Show data info
data_info.show()

+-------+------+------+--------+--------------------+
|movieId|rating|userId|mealskew|           meal_name|
+-------+------+------+--------+--------------------+
|      2|   3.0|     0|     2.0|       Chicken Curry|
|      3|   1.0|     0|     3.0|Spicy Chicken Nug...|
|      5|   2.0|     0|     5.0|           Hamburger|
|      9|   4.0|     0|     9.0|       Taco Surprise|
|     11|   1.0|     0|    11.0|            Meatloaf|
|     12|   2.0|     0|    12.0|        Ceaser Salad|
|     15|   1.0|     0|    15.0|            BBQ Ribs|
|     17|   1.0|     0|    17.0|         Sushi Plate|
|     19|   1.0|     0|    19.0|Cheesesteak Sandw...|
|     21|   1.0|     0|    21.0|             Lasagna|
|     23|   1.0|     0|    23.0|      Orange Chicken|
|     26|   3.0|     0|    26.0|    Spicy Beef Plate|
|     27|   1.0|     0|    27.0|Salmon with Mashe...|
|     28|   1.0|     0|    28.0| Penne Tomatoe Pasta|
|     29|   1.0|     0|    29.0|        Pork Sliders|
|     30|   1.0|     0|    3

In [17]:
# Show data
data.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [19]:
# Describe data
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



## Split data

In [26]:
# Split into train and test data
train_data, test_data = data.randomSplit([0.8, 0.2])

## Create the Model and Train

In [21]:
# Create ALS object
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating')

In [22]:
# Fit object
model = als.fit(train_data)

In [23]:
# Predict on model with test_data
predictions = model.transform(test_data)

In [24]:
# Show predictions
predictions.show()

+-------+------+------+----------+
|movieId|rating|userId|prediction|
+-------+------+------+----------+
|      0|   1.0|    26|-0.5986326|
|      3|   1.0|    26| 3.0940518|
|      2|   1.0|    12| 0.3111529|
|      0|   1.0|     6|-0.6660721|
|      1|   1.0|     6|-2.5745487|
|      2|   3.0|     6| 1.9826515|
|      0|   1.0|     3| 1.7167076|
|      2|   1.0|     3| 1.8655846|
|      0|   1.0|    20|0.30153486|
|      0|   1.0|     5|0.50251114|
|      0|   1.0|    19| 1.2818692|
|      0|   1.0|    15| 0.5357379|
|      5|   2.0|    15| 1.1678762|
|      2|   3.0|     9| 1.7838337|
|      0|   1.0|     8| 0.8033804|
|      4|   1.0|    23| 1.2910298|
|      0|   3.0|    10| 0.6119093|
|      3|   1.0|    29| 1.5830213|
|      4|   1.0|    14| 3.1710868|
|      1|   1.0|    18| 3.8213944|
+-------+------+------+----------+
only showing top 20 rows



## Evaluate the model

In [27]:
# Create evaluator object
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')

In [28]:
# Evaluate on the prediction made
rmse = evaluator.evaluate(predictions)

In [29]:
print(f'RMSE: {rmse}')

RMSE: 1.61688713446252


Not a great number, but as an example works fine.

# Make predictions on Test data
Now that we have a train model, let's recommend movies to users!

In [38]:
# Delect all movies from a single user
single_user = test_data.filter(test_data['userId']==11).select(['movieId', 'userId'])

In [39]:
# Show movies
single_user.show()

+-------+------+
|movieId|userId|
+-------+------+
|      6|    11|
|      9|    11|
|     11|    11|
|     13|    11|
|     18|    11|
|     20|    11|
|     27|    11|
|     38|    11|
|     62|    11|
|     78|    11|
|     81|    11|
|     82|    11|
|     94|    11|
|     97|    11|
+-------+------+



In [41]:
# Make recommendations to that specific user
recommendations = model.transform(single_user)
recommendations.orderBy('prediction', ascending=False).show()

+-------+------+----------+
|movieId|userId|prediction|
+-------+------+----------+
|     18|    11| 4.9869757|
|     27|    11|  4.928551|
|     81|    11|  4.496086|
|     38|    11|  3.886926|
|     13|    11| 3.8325024|
|     97|    11| 2.9742541|
|     82|    11| 2.5462859|
|     94|    11| 2.1185591|
|      6|    11| 1.6827623|
|     62|    11| 1.0523381|
|      9|    11| 1.0406839|
|     11|    11| 0.9801813|
|     78|    11| 0.9736101|
|     20|    11|0.90763056|
+-------+------+----------+

