# Worksheet 17

Name: Woohyeon Her <br>
UID: U88838753

### Topics

- Recommender Systems

### Recommender Systems

In the example in class of recommending movies to users we used the movie rating as a measure of similarity between users and movies and thus the predicted rating for a user is a proxy for how highly a movie should be recommended. So the higher the predicted rating for a user, the higher a recommendation it would be.

a) Consider a streaming platform that only has "like" or "dislike" (no 1-5 rating). Describe how you would build a recommender system in this case.

Data Collection: Gather user interactions with movies in a binary form (like or dislike). <br>
Preprocessing: Clean the data and handle missing values or imbalances in likes and dislikes. <br>
Model Selection: Choose a model based on the available computational resources, data characteristics, and desired recommendation quality. <br>
Training and Validation: Train the model using historical data and validate using a holdout set or cross-validation. <br>
Deployment: Integrate the recommender into the platform's backend to serve real-time recommendations. <br>
Feedback Loop: Continuously update the model with new user data to refine and improve recommendations. <br>

b) Describe 3 challenges of building a recommender system


Building a recommender system presents several challenges, including data sparsity and scalability, where the vast amount of uninteracted items creates sparse matrices that are computationally intensive to process. The cold start problem complicates making accurate recommendations for new users or items without historical interaction data, potentially affecting user satisfaction. Lastly, ensuring diversity and serendipity in recommendations is crucial to avoid repetitiveness and enhance user engagement by introducing unexpected choices. 

c) Why is SVD not an option for collaborative filtering?


SVD (Singular Value Decomposition) is not typically ideal for collaborative filtering due to several key challenges. Firstly, SVD cannot directly handle sparse matrices, which are common in collaborative filtering where users only interact with a small subset of items, leaving many matrix entries as unknown. Secondly, SVD's computational cost is high, making it less practical for large datasets typical in recommender systems, and it lacks scalability for real-time recommendations. Finally, SVD assumes linear relationships and is sensitive to noise, which may not accurately capture the complex and non-linear interactions in user-item data. Alternative techniques like ALS, stochastic gradient descent-based matrix factorization, and SVD++ are often preferred as they are designed to address these specific limitations.

d) Use the code below to train a recommender system on a dataset of amazon movies

In [2]:
# Note: requires py3.10
import findspark
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

findspark.init()
conf = SparkConf()
conf.set("spark.executor.memory","28g")
conf.set("spark.driver.memory", "28g")
conf.set("spark.driver.cores", "8")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.getOrCreate()

init_df = pd.read_csv("./train.csv").dropna()
init_df['UserId_fact'] = init_df['UserId'].astype('category').cat.codes
init_df['ProductId_fact'] = init_df['ProductId'].astype('category').cat.codes

# Split training set into training and testing set
X_train_processed, X_test_processed, Y_train, Y_test = train_test_split(
        init_df.drop(['Score'], axis=1),
        init_df['Score'],
        test_size=1/4.0,
        random_state=0
    )

X_train_processed['Score'] = Y_train
df = spark.createDataFrame(X_train_processed[['UserId_fact', 'ProductId_fact', 'Score']])
als = ALS(
    userCol="UserId_fact",
    itemCol="ProductId_fact",
    ratingCol="Score",
    coldStartStrategy="drop",
    nonnegative=True,
    rank=100
)
# param_grid = ParamGridBuilder().addGrid(
        # als.rank, [10, 50]).addGrid(
        # als.regParam, [.1]).addGrid(
        # # als.maxIter, [10]).build()
# evaluator = RegressionEvaluator(
        # metricName="rmse",
        # labelCol="Score", 
        # # predictionCol="prediction")
# cv = CrossValidator(estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=3, parallelism = 6)
# cv_fit = cv.fit(df)
# rec_sys = cv_fit.bestModel

rec_sys = als.fit(df)
# rec_sys.save('rec_sys.obj') # so we don't have to re-train it
rec = rec_sys.transform(spark.createDataFrame(X_test_processed[['UserId_fact', 'ProductId_fact']])).toPandas()
X_test_processed['Score'] = rec['prediction'].values.reshape(-1, 1)

print("Kaggle RMSE = ", mean_squared_error(X_test_processed['Score'], Y_test, squared=False))

cm = confusion_matrix(Y_test, X_test_processed['Score'], normalize='true')
sns.heatmap(cm, annot=True)
plt.title('Confusion matrix of the classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

ModuleNotFoundError: No module named 'findspark'

In [3]:
# Import necessary libraries
import findspark
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, confusion_matrix

from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Initialize Spark
findspark.init()
conf = SparkConf().setAppName("MovieRecommender")
conf.set("spark.executor.memory", "28g")
conf.set("spark.driver.memory", "28g")
conf.set("spark.driver.cores", "8")
sc = SparkContext.getOrCreate(conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# Load and preprocess data
init_df = pd.read_csv("./train.csv").dropna()
init_df['UserId_fact'] = init_df['UserId'].astype('category').cat.codes
init_df['ProductId_fact'] = init_df['ProductId'].astype('category').cat.codes

# Split the data into training and test sets
X_train_processed, X_test_processed, Y_train, Y_test = train_test_split(
    init_df[['UserId_fact', 'ProductId_fact']], 
    init_df['Score'],
    test_size=0.25,
    random_state=0
)

# Prepare Spark DataFrame from the training data
X_train_processed['Score'] = Y_train
train_df = spark.createDataFrame(X_train_processed)

# Define ALS model
als = ALS(
    userCol="UserId_fact",
    itemCol="ProductId_fact",
    ratingCol="Score",
    coldStartStrategy="drop",
    nonnegative=True,
    rank=100
)

# Build parameter grid for model tuning
param_grid = ParamGridBuilder() \
    .addGrid(als.rank, [10, 50, 100]) \
    .addGrid(als.regParam, [0.01, 0.1]) \
    .addGrid(als.maxIter, [5, 10]) \
    .build()

# Define evaluator as RMSE
evaluator = RegressionEvaluator(
    metricName="rmse",
    labelCol="Score"
)

# Set up cross-validation
cv = CrossValidator(
    estimator=als,
    estimatorParamMaps=param_grid,
    evaluator=evaluator,
    numFolds=3,
    parallelism=6
)

# Fit ALS model
cv_model = cv.fit(train_df)
best_model = cv_model.bestModel

# Prepare test data
test_df = spark.createDataFrame(X_test_processed)
X_test_processed['Score'] = Y_test  # to compare with predictions

# Predict on test data
predictions = best_model.transform(test_df).toPandas()
X_test_processed['Predicted_Score'] = predictions['prediction']

# Evaluate the model
rmse = mean_squared_error(X_test_processed['Score'], X_test_processed['Predicted_Score'], squared=False)
print("Kaggle RMSE =", rmse)

# Plot confusion matrix
cm = confusion_matrix(X_test_processed['Score'], X_test_processed['Predicted_Score'], normalize='true')
sns.heatmap(cm, annot=True, fmt=".2f")
plt.title('Confusion Matrix of the Classifier')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Stop Spark session
sc.stop()


ModuleNotFoundError: No module named 'findspark'