### Recommender Systems


**Item-based collaborative filtering**

**(Recommends items similar to what a user has rated highly.)**

In item-based collaborative filtering, the focus shifts to calculating the similarity between items in the system. This is typically done using similarity measures such as cosine similarity or the Jaccard index based on the patterns of user interactions with items.

Key Idea: "If many users who watched Movie A also watched Movie B, then Movie B is likely to be recommended to users who watched Movie A."

**Collaborative Filtering (CF)**

**(Recommends items liked by similar users.)**

Collaborative filtering (CF) is a popular machine learning technique used in recommendation systems. It relies on user interactions (such as ratings, clicks, or purchases) to suggest items to users based on the preferences of similar users or similar items.

It produces recommendations based on the knowledge of users’ attitude to items, that is it uses the "wisdom of the crowd" to recommend items.

Key Idea: "If many users liked Item A and also liked Item B, a new user who likes Item A will likely enjoy Item B too."

**Content-Based Filtering**

**(Recommends items similar to what the user liked, based on item attributes.)**

Content-Based Filtering is a recommendation technique that suggests items based on their intrinsic characteristics (features, metadata, or content) rather than relying on user interactions or preferences from other users.

Key Idea: "If you liked Item A, you'll probably like Item B because they share similar characteristics."

CF faces challenges like the cold start problem, while CBF remains valuable for personalized recommendations when interaction data is limited.
In practice, hybrid systems combining both approaches often yield the best results.

**Alternating Least Squares (ALS)**

Alternating Least Squares (ALS) is a matrix factorization algorithm commonly used for Collaborative Filtering (CF) in recommendation systems.

It is widely implemented in Apache Spark due to its ability to handle large-scale datasets efficiently.


#### Movie Recommendation

The MovieLens ratings.csv dataset represents user-movie interactions in a User-Movie Matrix, where users provide explicit ratings for movies.

This dataset is commonly used for building recommendation systems.
- The User-Movie Matrix is the core of Collaborative Filtering.
- It is sparse, requiring techniques like Matrix Factorization (ALS) to predict missing values.
- User-Based CF finds similar users, Item-Based CF finds similar movies.
- MovieLens stopped collecting new ratings in 2018 mainly due to privacy regulations (GDPR), changing research priorities, industry shifts, and data maintenance costs.
- For real-time recommendation research, companies now rely on synthetic data or in-house models.


In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Movie Recommendation').getOrCreate()

In [2]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

MovieLens used in this code is the smallest version

In [3]:
data = spark.read.csv('Datasets/movielens_ratings.csv', inferSchema=True, header=True)
data.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      2|   3.0|     0|
|      3|   1.0|     0|
|      5|   2.0|     0|
|      9|   4.0|     0|
|     11|   1.0|     0|
|     12|   2.0|     0|
|     15|   1.0|     0|
|     17|   1.0|     0|
|     19|   1.0|     0|
|     21|   1.0|     0|
|     23|   1.0|     0|
|     26|   3.0|     0|
|     27|   1.0|     0|
|     28|   1.0|     0|
|     29|   1.0|     0|
|     30|   1.0|     0|
|     31|   1.0|     0|
|     34|   1.0|     0|
|     37|   1.0|     0|
|     41|   2.0|     0|
+-------+------+------+
only showing top 20 rows



In [4]:
data.describe().show()

+-------+------------------+------------------+------------------+
|summary|           movieId|            rating|            userId|
+-------+------------------+------------------+------------------+
|  count|              1501|              1501|              1501|
|   mean| 49.40572951365756|1.7741505662891406|14.383744170552964|
| stddev|28.937034065088994| 1.187276166124803| 8.591040424293272|
|    min|                 0|               1.0|                 0|
|    max|                99|               5.0|                29|
+-------+------------------+------------------+------------------+



Train, Test split

![SplitGuide](Img/TrainTest.png)

**Rule of Thumb**

More data → Can afford larger test sets for better evaluation.

Less data → Keep the test set smaller to avoid underfitting.

In [5]:
training,test = data.randomSplit([0.8,0.2]) # ratio 0.8,0.2 is best for a small dataset

**Create ALS model**

Set number of iterations for optimization to 5 (more iterations better convergence but longer training time), Regularization parameter to 0.01 to prevent from overfitting.

In [6]:
als = ALS(maxIter=5, regParam=0.01, userCol='userId', itemCol='movieId', ratingCol='rating')

In [7]:
model = als.fit(training)

In [9]:
predictions = model.transform(test)
predictions.show()

+-------+------+------+-----------+
|movieId|rating|userId| prediction|
+-------+------+------+-----------+
|      1|   1.0|    26|  2.2733474|
|      1|   1.0|     4|  0.2018598|
|      1|   1.0|    14|  0.6019948|
|      1|   1.0|    18|  1.4008311|
|      6|   1.0|    12|    1.00248|
|      3|   2.0|    22|  1.7812891|
|      3|   1.0|     1|  2.1199567|
|      3|   1.0|    13|  0.8448553|
|      3|   1.0|    29|  0.8821652|
|      3|   1.0|    21|  1.0327353|
|      3|   3.0|    14|  1.4928458|
|      4|   2.0|     1|  1.1636225|
|      4|   2.0|    13|  2.4776533|
|      4|   2.0|    20|  2.2458305|
|      4|   3.0|    10|  0.6677833|
|      4|   1.0|    14|  1.8901378|
|      2|   1.0|    17|0.049916774|
|      2|   1.0|    23|-0.51905584|
|      2|   2.0|     7|  1.7966807|
|      2|   4.0|    10| 0.11767332|
+-------+------+------+-----------+
only showing top 20 rows



**Evaluating the model using RMSE**

For a 1-5 rating scale, RMSE = 1.68 is relatively high, meaning the model's predictions are not very accurate.

In [10]:
evaluator = RegressionEvaluator(metricName='rmse', labelCol='rating', predictionCol='prediction')
rmse = evaluator.evaluate(predictions)
print('RMSE:',rmse)

RMSE: 1.683271256558634


Apply Recommendation (Example of User ID = 11)

In [11]:
single_user = test.filter(test['userId']==11)
single_user.show()

+-------+------+------+
|movieId|rating|userId|
+-------+------+------+
|      9|   1.0|    11|
|     10|   1.0|    11|
|     13|   4.0|    11|
|     36|   2.0|    11|
|     38|   4.0|    11|
|     61|   1.0|    11|
|     69|   5.0|    11|
|     79|   5.0|    11|
|     86|   1.0|    11|
|     88|   1.0|    11|
|     89|   1.0|    11|
+-------+------+------+



In [12]:
recommendations = model.transform(single_user)
recommendations.orderBy('prediction',ascending=False).show()

+-------+------+------+----------+
|movieId|rating|userId|prediction|
+-------+------+------+----------+
|     10|   1.0|    11|  3.590342|
|     36|   2.0|    11|  3.009136|
|     13|   4.0|    11| 2.6791563|
|     88|   1.0|    11|  1.974485|
|     61|   1.0|    11| 1.6488669|
|     89|   1.0|    11| 1.6134372|
|      9|   1.0|    11| 1.5793567|
|     69|   5.0|    11|0.73788416|
|     79|   5.0|    11|0.69312084|
|     38|   4.0|    11| 0.4809743|
|     86|   1.0|    11|0.10031758|
+-------+------+------+----------+



In [15]:
from pyspark.sql.functions import col
# Get unique users and unique movies
unique_users = data.select("userId").distinct()
unique_movies = data.select("movieId").distinct()
# Generate all possible user-movie pairs using Cartesian Join
user_movie_pairs = unique_users.crossJoin(unique_movies)
# Predict ratings for all user-movie combinations
full_predictions = model.transform(user_movie_pairs)
# Display top unknown predictions (sorted by predicted rating)
df_predictions = full_predictions.select("userId", "movieId", "prediction").orderBy(col("prediction").desc()).toPandas()
df_predictions

Unnamed: 0,userId,movieId,prediction
0,17,27,7.060535
1,17,32,5.990928
2,17,23,5.894359
3,26,98,5.850349
4,14,85,5.823034
...,...,...,...
2995,27,49,-2.740512
2996,16,87,-3.037732
2997,10,46,-3.094625
2998,24,64,-3.213073


In [16]:
df_filtered = df_predictions.loc[df_predictions["userId"] == 11]
df_filtered

Unnamed: 0,userId,movieId,prediction
5,11,46,5.762852
23,11,27,5.071353
34,11,48,5.020074
41,11,23,4.968108
43,11,32,4.945819
...,...,...,...
2517,11,38,0.480974
2566,11,14,0.358040
2611,11,28,0.252553
2632,11,49,0.215124
