# User-Based Recommender System

A recommender system is an algorithm or model that takes in information about a user and suggests an item — new to them — that is likely to be of interest. There are several approaches to building such a system, and this notebook will focus on **user-based methods**. Let's begin by starting the PySpark session:


In [2]:
# Load libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import helper_functions as hf
import pyspark.sql.functions as f
from pyspark.sql.types import FloatType, StringType, ArrayType
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.ml import Pipeline

# Initialize Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CBRS").getOrCreate()
spark

In [3]:
# Movies Metadata (Load Dataset)
df = spark.read.parquet('data/cleaned/ratings/')
display(df.limit(3).toPandas())

Unnamed: 0,userId,movieId,rating
0,1,110,1.0
1,1,147,4.5
2,1,858,5.0


## Introduction


User-Based Recommender Systems (UBRS) use the information about the preferences of similar users — or opposites — to make recommendations. The core idea is that if two users show high correlation in their preferences — positive or negative —, then one user's future preferences can be predicted based on the other user's past behavior. However, preferences not only have to refer to product ratings, but they include for example interactions with the webpage — time spent reading an article, clicked items in the current session, photos watched, etc. For this reason, an UBRS is capable of providing a custom experience even session wise, although the reliability of the data becomes the main issue.

On the contrary, this kind of systems struggles with several situations:
- **Cold Start Problem:** when there is little to no information about a customer, the recommendations become unreliable, as the user cannot show correlation with any other.
- **Computation:** when the database is large, the process becomes expensive.
- **Changes of opinion:** while the item data is theoretically unchanging, the customer's opinion about the rated items may vary over time.

### Cosine Similarity

Through cosine similarity, it is possible to compute a value based on the *angle* between two *vectors*, with the *vectors* being the user's rating of movies and the *angle* the similarity to each other — it can be positive or negative, if they are similar or opposites, but the values are kept between $[-1,1]$.

The main problem with this approach are the memory issues, caused by the large number of unique movies and users in the dataset, but it is still interesting to understand this implementation and, for that reason, we will continue with a limited data set of 10 users:

In [10]:
# Filter out users with an Id higher than 10
filtered_df = df.filter(f.col('userId') <= 10)

# Build the user-movie matrix: 
#   userId as rows, movieId as columns and user rating as content.
user_movie = filtered_df.groupBy('userId').pivot('movieId').agg(f.first('rating'))

With the memory constrains solved, the next problem is that there are users who did not rate every movie we are considering, which is the most common case, possibly because they have not watched it. This means that the user-movie matrix is filled with lots of missing data (`NA`), which cannot be easily imputed by a number, as it would assume it to be the opinion of the user about the movie.

There are several approaches to manage this:

- **Impute the missing values as 0:** which is simple, fast and allows for computing cosine similarity, at the expense of some reliability loss.

- **Use Pearson Correlation:** which is built for comparing only co-rated items, i.e. movies rated by the two users being compared, and it is the classic Netflix approach. However, it can be unstable when there are few overlapping , and it is hard to scale efficiently in PySpark compared to vectorized cosine similarity.

- **Mean centering:** by adjusting all ratings to be in the range $[-1,1]$, 0 becomes a neutral rating for a movie, so the bias introduced is less noticeable. The issue is that, when working with PySpark, missing values are a problem to create dense vectors and perform operations with them, but imputing them by the mean ratings of the users will also work.

The approach we will follow next is, consequently, the use of mean centering along with imputation by the mean.

In [14]:
user_movie.columns[1:] + '_imputed'

TypeError: can only concatenate list (not "str") to list

In [18]:
def prepare_user_ratings(user_movie = user_movie):
    from pyspark.ml.feature import Imputer
    from pyspark.ml.feature import StandardScaler

    # Impute missing values as the mean for every user
    inputCols = user_movie.columns[1:]
    outputCols = [col_name + '_imputed' for col_name in inputCols]

    imputer = Imputer(
        inputCols=inputCols, outputCols=outputCols, strategy='mean'
    )

    # Create a dense vector representation for each user
    vector_assembler = VectorAssembler(
        inputCols=outputCols, outputCol='vectors'
    )

    # Mean Centering and Normalization
    scaler = StandardScaler(withMean=True, withStd=True,
                            inputCol='vectors', outputCol='features')

    # Pipeline
    p = Pipeline(stages=[imputer, vector_assembler, scaler]).fit(user_movie)

    # Transform data
    user_vector = p.transform(user_movie).select('userId','features')
    user_vector.show(3, truncate=True)

    return user_vector

user_vector = prepare_user_ratings()

+------+--------------------+
|userId|            features|
+------+--------------------+
|     1|[7.53644380168213...|
|     6|[7.53644380168213...|
|     3|[7.53644380168213...|
+------+--------------------+
only showing top 3 rows



In [19]:
# Cross Join: cartesian product to form pairs.
# Allow repeated pairs — (1,2) and (2,1) —, so that left.userId
# can always be the target user and right.userId the neighbors.
# Filter out pairs with itself — (1,1), (2,2), etc.
user_cross = user_vector.alias('left').\
    crossJoin(user_vector.alias('right')).\
    filter(f.col('left.userId') != f.col('right.userId'))

display(user_cross.limit(3).toPandas())

Unnamed: 0,userId,features,userId.1,features.1
0,1,"[7.536443801682132e-15, 0.0, 0.0, 0.0, 0.0, 0....",6,"[7.536443801682132e-15, 0.0, 0.0, 0.0, 0.0, 0...."
1,1,"[7.536443801682132e-15, 0.0, 0.0, 0.0, 0.0, 0....",3,"[7.536443801682132e-15, 0.0, 0.0, 0.0, 0.0, 0...."
2,1,"[7.536443801682132e-15, 0.0, 0.0, 0.0, 0.0, 0....",5,"[7.536443801682132e-15, 0.0, 2.121320343559642..."


In [20]:
# Cosine Similarity UDF
def cosine_similarity(v1,v2):
    # Formula:
    #   Sim = A·B / |A||B|

    # Numerator: scalar product
    num = sum(c1*c2 for (c1,c2) in zip(v1,v2))
    
    # Denominator: modules
    mod_a = np.sqrt(sum(c1**2 for c1 in v1))
    mod_b = np.sqrt(sum(c2**2 for c2 in v2))
    den = mod_a * mod_b

    # Similarity
    return float(num) / float(den) if den != 0.0 else 0.0

cosine_udf = f.udf(lambda v1,v2: cosine_similarity(v1,v2), FloatType())

# Apply udf to the pairs of movies in every row
df_similarities = user_cross.\
    withColumn('similarity', cosine_udf(f.col('left.features'), f.col('right.features'))).\
    select(f.col('left.userId').alias('userId_1'), 
           f.col('right.userId').alias('userId_2'), 
           'similarity')

display(df_similarities.limit(10).toPandas())

Unnamed: 0,userId_1,userId_2,similarity
0,1,6,-1.393876e-16
1,1,3,-0.2185532
2,1,5,0.04157351
3,1,9,-0.1120754
4,1,4,-0.03896361
5,1,8,-0.09855314
6,1,7,-0.5094177
7,1,10,-1.393876e-16
8,1,2,-1.207132e-16
9,6,1,-1.393876e-16


In User-Based Recommender Systems, the last step includes computing neighbors' to discover which movies to recommend using collaboration:

In [21]:
from pyspark.sql import Window

# Create data window:
# Assume left.userId is the target user and right.userId the neighbor
window = Window.\
    orderBy(f.col('similarity').desc()).\
    partitionBy('userId_1')
    
# Get Top N most similar neighbor for each left.userId
top_n_neighbors = df_similarities.\
    withColumn('rank', f.row_number().over(window)).\
    filter(f.col('rank') <= 5)

display(top_n_neighbors.limit(10).toPandas())

Unnamed: 0,userId_1,userId_2,similarity,rank
0,1,5,0.04157351,1
1,1,2,-1.207132e-16,2
2,1,6,-1.393876e-16,3
3,1,10,-1.393876e-16,4
4,1,4,-0.03896361,5
5,2,8,0.1620802,1
6,2,9,1.762172e-15,2
7,2,7,9.362936e-16,3
8,2,6,3.845925e-16,4
9,2,10,3.845925e-16,5


Join the rankings of the neighbors (pair wise):

In [22]:
neighbor_ranking = top_n_neighbors.join(
    other=df,
    on=(top_n_neighbors.userId_2 == df.userId),
    how='inner'
).select(
    f.col('userId_1').alias('targetUser'),
    f.col('movieId'),
    f.col('similarity'),
    f.col('rating').alias('neighbor_rating')
)

display(neighbor_ranking.limit(3).toPandas())

Unnamed: 0,targetUser,movieId,similarity,neighbor_rating
0,8,110,-0.098553,1.0
1,5,110,0.041574,1.0
2,4,110,-0.038964,1.0


However, we are only interested in the movies that the target user has not seen yet:

In [None]:
# Movies seen (already rated by the target user)
targetUser_rated = df.select(f.col('userId').alias('targetUser'), 'movieId')

# Keep only unseen movies
unseen_movies = neighbor_ranking.join(
    other=targetUser_rated,
    on=['targetUser', 'movieId'],
    how='left_anti' # drop movies that appear in both tables
)

# Compute weighted rating and its average
recommendations = unseen_movies.withColumn(
    colName='weighted_rating',
    col = f.col('similarity') * f.col('neighbor_rating')
).groupby(
    'targetUser', 'movieId'
).agg(
    f.expr('sum(weighted_rating) / sum(similarity)').\
        alias('predicted_score')
)

# Get the top K recommendations per user
window = Window.partitionBy('targetUser').orderBy(f.col('predicted_score').desc())

top_k_recommendations = recommendations.\
    withColumn('rank', f.row_number().over(window)).\
    filter(f.col('rank') <= 5)

display(top_k_recommendations.limit(10).toPandas())

In [25]:
spark.stop()