# Capstone 4: Pre-processing And Modeling
### By Joshua Dytko

In [1]:
import pandas as pd
import numpy as np

# While there is a custom built train test split function that function still uses the sklearn train_test_split function within it 
from sklearn.model_selection import train_test_split

#Surpise imports
from surprise import SVD, Dataset, Reader
from surprise import accuracy
from surprise import KNNBasic


from collections import defaultdict

# Evaluate RMSE
from surprise import accuracy

I will build three user based collaborative filtering models. The models that will be used are K-Nearest Neighbors (KNN) and Singular Value Decomposition (SVD) from the Surprise library and Alternating Least Squares (ALS) from PySpark. Surprise is a library for building item recommendation systems.

Surprise has it's own train test split function, but that function does not guarantee that all products and user's are represented in both training and test splits. The model is going to be a collaborative filtering model. It will need every product and user present in both sets for it to be able to make useful predictions because the data is not available to handle the cold start problem. The customer train test split will do exactly that using the train test split from sklearn on user groups generated from pandas groupby functionality.

In [2]:
#Create train test splits that contain all items and users
def custom_train_test_split(df, test_size=0.2, random_state=42):
    """
    Splits the DataFrame into training and test sets such that every user and item
    appears in both sets.

    Parameters:
    - df: pandas DataFrame with columns ['user_id', 'asin', 'rating']
    - test_size: float between 0.0 and 1.0, represents the proportion of the dataset to include in the test split

    Returns:
    - train_df: Training DataFrame
    - test_df: Test DataFrame
    """
    
    train_list = []
    test_list = []

    # Group by user_id
    user_group = df.groupby('user_id')

    for user_id, group in user_group:
        if len(group) >= 2:
            # If the user has more than one interaction, split their data
            train, test = train_test_split(group, test_size=test_size, random_state=random_state)
            train_list.append(train)
            test_list.append(test)
        else:
            # If the user has only one interaction, include it in both train and test sets
            train_list.append(group)
            test_list.append(group)

    # Concatenate all user splits
    train_df = pd.concat(train_list).reset_index(drop=True)
    test_df = pd.concat(test_list).reset_index(drop=True)

    # Ensure all items are in both sets
    # Find items not in train_df
    missing_items_in_train = set(df['asin'].unique()) - set(train_df['asin'].unique())
    # Add missing items to train_df
    if missing_items_in_train:
        items_to_add = df[df['asin'].isin(missing_items_in_train)]
        train_df = pd.concat([train_df, items_to_add]).drop_duplicates().reset_index(drop=True)

    # Repeat for test_df
    missing_items_in_test = set(df['asin'].unique()) - set(test_df['asin'].unique())
    if missing_items_in_test:
        items_to_add = df[df['asin'].isin(missing_items_in_test)]
        test_df = pd.concat([test_df, items_to_add]).drop_duplicates().reset_index(drop=True)

    # Remove any duplicates that might have been introduced from the previous 2 steps
    train_df = train_df.drop_duplicates(subset=['user_id', 'asin', 'rating'])
    test_df = test_df.drop_duplicates(subset=['user_id', 'asin', 'rating'])

    return train_df, test_df

Here will be the top-N and the compute precision and recall at k functions as well as the relevance threshold, which will be set to 4.

In [3]:
# Define relevance threshold (e.g., ratings >= 4)
relevance_threshold = 4.0

# Return the top-N recommendation for each user from a set of predictions
def get_top_n(predictions, n=10):
    # Map the predictions to each user
    top_n = defaultdict(list)
    for pred in predictions:
        user_id = pred.uid
        item_id = pred.iid
        est = pred.est
        top_n[user_id].append((item_id, est))
    
    # Sort the predictions for each user and retrieve the top n
    for user_id, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[user_id] = user_ratings[:n]
    
    return top_n
    
# Compute Precision@K and Recall@K for each user
def compute_precision_recall_at_k(top_n, actual_ratings, k=10, threshold=4.0):
    precisions = dict()
    recalls = dict()
    
    for user_id, user_predictions in top_n.items():
        # Get the set of recommended items
        recommended_items = set([item_id for (item_id, _) in user_predictions[:k]])
        
        # Get the set of relevant items from actual ratings
        relevant_items = set(actual_ratings[user_id])
        
        # Compute precision and recall
        n_relevant_and_recommended = len(recommended_items & relevant_items)
        precisions[user_id] = n_relevant_and_recommended / k
        recalls[user_id] = n_relevant_and_recommended / len(relevant_items) if relevant_items else 0.0
    
    # Compute average precision and recall
    average_precision = sum(precisions.values()) / len(precisions)
    average_recall = sum(recalls.values()) / len(recalls)
    
    return average_precision, average_recall


Reading in the cleaned data created in the previous step in the capstone.

In [4]:
df = pd.read_csv('Cleaned data.csv', index_col = False)

Confirmed the data types that are read in are what are expected.

In [5]:
df.dtypes

rating      int64
user_id    object
asin       object
dtype: object

I found that ALS did failed if rating was not set to a float data type.

In [6]:
df['rating'] = df['rating'].astype(float)

Create the train and test split using the custom train test split function. The random state is set to 42 to generate repeatable resutls.

In [7]:
# Split data into training and test sets
train_df, test_df = custom_train_test_split(df, test_size=0.2, random_state=42)

# Reset index
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

Before I was using surprise I tried using sparse matrices. For that I mapped the user ids and asin to mapped ids. Surprise does not need the data mapped but ALS requires the mapping.

In [8]:
# Create mappings
user_id_mapping = {id: idx for idx, id in enumerate(df['user_id'].unique())}
item_id_mapping = {id: idx for idx, id in enumerate(df['asin'].unique())}

# Reverse mappings - Wiill be used for the ALS PySpark model
user_index_to_id = {idx: id for id, idx in user_id_mapping.items()}
item_index_to_id = {idx: id for id, idx in item_id_mapping.items()}

# Add mapped indices to dataframes
for dataset in [train_df, test_df]:
    dataset['user_index'] = dataset['user_id'].map(user_id_mapping)
    dataset['asin_index'] = dataset['asin'].map(item_id_mapping)

In the following code block the data object is built that will be used in the surprise KNN and SVD models.

In [9]:
# Combine train and test data for Surprise (required format)
data_for_surprise = pd.concat([train_df, test_df])

# Define the rating scale
rating_scale = (df['rating'].min(), df['rating'].max())

# Create a Reader object
reader = Reader(rating_scale=rating_scale)

# Load data into Surprise dataset
data = Dataset.load_from_df(data_for_surprise[['user_id', 'asin', 'rating']], reader)


### The Models
Model order:
1. KNN
2. SVD
3. ALS

## KNN - K-Nearest Neighbors

In the similarity options for KNN I had to switch to item-based filtering because my computer did not have the memory run the user-based filtering version of the model.

In [10]:
# Build full training set (Surprise uses the entire dataset for training unless specified)
trainset = data.build_full_trainset()


# Define similarity options
sim_options = {
    'name': 'cosine',  
    'user_based': False  # Set to False for item-based filtering
}

# Initialize KNN model
algo_knn = KNNBasic(sim_options=sim_options)

# Train the model
algo_knn.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBasic at 0x2a884ac6b90>

In [11]:
#Create testset for Surprise
testset_surprise = list(zip(test_df['user_id'], test_df['asin'], test_df['rating']))

# Get predictions
predictions_knn = algo_knn.test(testset_surprise)

rmse_knn = accuracy.rmse(predictions_knn)

RMSE: 0.9219


In [12]:
def recommend_knn(user_id, n=10):
    # Get a list of all asins
    all_items = set(data_for_surprise['asin'].unique())
    # Get items the user has already interacted with
    user_items = set(train_df[train_df['user_id'] == user_id]['asin'])
    # Get items the user hasn't interacted with yet
    items_to_predict = all_items - user_items
    # Predict ratings for the tems
    predictions = []
    for item_id in items_to_predict:
        pred = algo_knn.predict(str(user_id), str(item_id))
        predictions.append((item_id, pred.est))
    # Sort predictions by estimated rating
    predictions.sort(key=lambda x: x[1], reverse=True)
    # Return top N asin
    top_n_items = [item for item, rating in predictions[:n]]
    return top_n_items

user_id = train_df['user_id'].iloc[0]  # User the user in position 0
recommended_items_knn = recommend_knn(user_id)
print(f'Top recommended items for user {user_id} using KNN: {recommended_items_knn}')

Top recommended items for user A0009988MRFQ3TROTQPI using KNN: ['6300209830', 'B004L1DB8C', 'B00O2R6NX0', '6302970040', 'B011OCMQHW', 'B00079I09W', 'B00007JZO3', 'B000EOTV98', 'B00KNVF2SG', 'B0018LX9SA']


In [13]:
# Get top N recommendations for KNN
top_n_knn = get_top_n(predictions_knn, n=10)

# Get actual relevant items for each user from test set
actual_ratings = defaultdict(set)
for _, row in test_df.iterrows():
    if row['rating'] >= relevance_threshold:
        actual_ratings[str(row['user_id'])].add(str(row['asin']))

# Compute Precision@K and Recall@K for KNN
precision_knn, recall_knn = compute_precision_recall_at_k(top_n_knn, actual_ratings, k=10, threshold=relevance_threshold)
f1_score_knn = ((2*precision_knn * recall_knn)/ (recall_knn + precision_knn))
print(f'KNN Precision@10: {precision_knn:.4f}')
print(f'KNN Recall@10: {recall_knn:.4f}')
print(f'KNN F1 Score@10: {f1_score_knn:.4}')

KNN Precision@10: 0.2037
KNN Recall@10: 0.9094
KNN F1 Score@10: 0.3329


## SVD - Singular Value Decomposition

In [14]:
# Initialize SVD model
algo_svd = SVD(random_state=42)

# Train the model
algo_svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2a918c43290>

In [15]:
# Get predictions for SVD
predictions_svd = algo_svd.test(testset_surprise)

# Evaluate RMSE
rmse_svd = accuracy.rmse(predictions_svd)
#print(f'SVD RMSE: {rmse_svd}')

RMSE: 0.6227


In [16]:
# Function to recommend top N items for a user
def recommend_svd(user_id, n=10):
    # Get a list of all asins
    all_items = set(data_for_surprise['asin'].unique())
    # Get items the user has already interacted with
    user_items = set(train_df[train_df['user_id'] == user_id]['asin'])
    # Get items the user hasn't interacted with yet
    items_to_predict = all_items - user_items
    # Predict ratings for the items
    predictions = []
    for item_id in items_to_predict:
        pred = algo_svd.predict(str(user_id), str(item_id))
        predictions.append((item_id, pred.est))
    # Sort predictions by estimated rating
    predictions.sort(key=lambda x: x[1], reverse=True)
    # Return top N asins
    top_n_items = [item for item, rating in predictions[:n]]
    return top_n_items

recommended_items_svd = recommend_svd(user_id)
print(f'Top recommended items for user {user_id} using SVD: {recommended_items_svd}')

Top recommended items for user A0009988MRFQ3TROTQPI using SVD: ['B00006FD8X', 'B00IJJBHW4', 'B0054JELS4', 'B00VU4YPR4', 'B00005QCV8', 'B0009VDJQ2', 'B00404ME0G', 'B001CMSDVS', 'B000P4ZL5K', 'B00CYQXE10']


In [17]:
# Get top N recommendations for SVD
top_n_svd = get_top_n(predictions_svd, n=10)

# Get actual relevant items for each user from test set
actual_ratings = defaultdict(set)
for _, row in test_df.iterrows():
    if row['rating'] >= relevance_threshold:
        actual_ratings[str(row['user_id'])].add(str(row['asin']))

# Compute Precision@K and Recall@K for SVD
precision_svd, recall_svd = compute_precision_recall_at_k(top_n_svd, actual_ratings, k=10, threshold=relevance_threshold)
f1_score_svd = ((2*precision_svd * recall_svd)/ (recall_svd + precision_svd))
print(f'SVD Precision@10: {precision_svd:.4f}')
print(f'SVD Recall@10: {recall_svd:.4f}')
print(f'SVD F1 Score@10: {f1_score_svd:.4}')

SVD Precision@10: 0.2054
SVD Recall@10: 0.9107
SVD F1 Score@10: 0.3352


## PySpark ALS - Alternating Least Squares

In [18]:
#Pyspark imports
from pyspark.sql import SparkSession
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark import StorageLevel
import findspark

#Mean squared error from sklearn: : I learned later that there is a root_mean_squared_error now in sklearn but but it did not run for me and I did not want to update my code this far into the project
from sklearn.metrics import mean_squared_error

#Pyspark
from pyspark.ml.recommendation import ALS, ALSModel
from pyspark.ml.evaluation import RegressionEvaluator

#Imports for environmental variables to be set so that PySpark can run
import os
import sys


findspark.init() 

os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk-22"
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable




In [19]:
spark = SparkSession.builder.getOrCreate()

In [20]:
# Convert Pandas DataFrame to Spark DataFrame
train_spark_df = spark.createDataFrame(train_df[['user_index', 'asin_index', 'rating']])
test_spark_df = spark.createDataFrame(test_df[['user_index', 'asin_index', 'rating']])

In [21]:
# Initialize ALS model
als = ALS(
    maxIter=10,
    regParam=0.1,
    rank=50,
    userCol='user_index',
    itemCol='asin_index',
    ratingCol='rating',
    coldStartStrategy='drop',
    nonnegative=True,
    implicitPrefs=False
)

# Train the model
model_als = als.fit(train_spark_df)

In [22]:
# Generate predictions
predictions_als = model_als.transform(test_spark_df)

# Evaluate RMSE
evaluator = RegressionEvaluator(
    metricName='rmse',
    labelCol='rating',
    predictionCol='prediction'
)
rmse_als = evaluator.evaluate(predictions_als)
print(f'ALS RMSE: {rmse_als}')

ALS RMSE: 0.9968222284871232


In [23]:
# Convert predictions to Pandas DataFrame
predictions_als_pandas = predictions_als.toPandas()

# Merge with test_df to align user_id and asin
als_results = test_df.merge(
    predictions_als_pandas,
    on=['user_index', 'asin_index'],
    how='left'
)

# Rename the prediction column to `als_pred`
als_results = als_results.rename(columns={'prediction': 'als_pred'})

In [24]:
# Create mapping from item_index to asin
item_index_to_id = {v: k for k, v in item_id_mapping.items()}

In [25]:
# Function to recommend top N items for a user from the ALS model
def recommend_als(user_id, n=10):
    # Get user_index
    if user_id not in user_id_mapping:
        print(f'User {user_id} not found in training set.')
        return []
    user_index = user_id_mapping[user_id]
    # Get items the user has already interacted with
    user_items = set(train_df[train_df['user_id'] == user_id]['asin_index'])
    # Get items the user hasn't interacted with yet
    all_items = set(range(len(item_id_mapping)))
    items_to_predict = list(all_items - user_items)
    # Create a Spark DataFrame of user-asin pairs
    from pyspark.sql.types import IntegerType
    user_item_pairs = spark.createDataFrame(
        [(user_index, item_idx) for item_idx in items_to_predict],
        ['user_index', 'asin_index']
    )
    # Generate predictions
    predictions = model_als.transform(user_item_pairs)
    # Drop NaN values (if any)
    predictions = predictions.na.drop()
    # Get top N recommendations
    predictions_df = predictions.toPandas()
    top_n = predictions_df.sort_values('prediction', ascending=False).head(n)
    # Map item_index back to asin
    top_n['asin'] = top_n['asin_index'].map(item_index_to_id)
    return top_n['asin'].tolist()

# Example usage
recommended_items_als = recommend_als(user_id)
print(f'Top recommended items for user {user_id} using ALS: {recommended_items_als}')


Top recommended items for user A0009988MRFQ3TROTQPI using ALS: ['1561270563', 'B005IFK660', 'B00012FX4A', 'B0064S0354', 'B000YI7LLY', 'B00006672Z', 'B000XJM7YA', 'B00009B1RV', 'B00005M2G1', 'B0012KK6R4']


In [26]:
# Generate top N recommendations for each user using ALS
user_recs = model_als.recommendForAllUsers(10)
user_recs_pandas = user_recs.toPandas()

# Map recommendations to user_ids
top_n_als = defaultdict(list)
for index, row in user_recs_pandas.iterrows():
    user_index = row['user_index']
    user_id = user_index_to_id[user_index]
    recommendations = row['recommendations']
    for rec in recommendations:
        item_index = rec['asin_index']
        est = rec['rating']
        item_id = item_index_to_id[item_index]
        top_n_als[user_id].append((item_id, est))

In [27]:
# Compute Precision@K and Recall@K for ALS
precision_als, recall_als = compute_precision_recall_at_k(top_n_als, actual_ratings, k=10, threshold=relevance_threshold)
f1_score_als = ((2*precision_als * recall_als)/ (recall_als + precision_als))
print(f'ALS Precision@10: {precision_als:.4f}')
print(f'ALS Recall@10: {recall_als:.4f}')
print(f'ALS F1 Score@10: {f1_score_als:.4}')

ALS Precision@10: 0.0013
ALS Recall@10: 0.0073
ALS F1 Score@10: 0.002258


## Ensemble Model

After completing the previous models I wanted to see how an ensemble model would perform. I had initially planned to use all three models but after seeing how poorly ALS performed I opted to only use KNN and SVD in the ensemble model.

In [28]:
# Convert KNN predictions to DataFrame
knn_pred_df = pd.DataFrame([(pred.uid, pred.iid, pred.est) for pred in predictions_knn],
                           columns=['user_id', 'asin', 'knn_pred'])

# Merge with test_df
ensemble_df = test_df.merge(knn_pred_df, on=['user_id', 'asin'], how='left')

In [29]:
# Convert SVD predictions to DataFrame
svd_pred_df = pd.DataFrame([(pred.uid, pred.iid, pred.est) for pred in predictions_svd],
                           columns=['user_id', 'asin', 'svd_pred'])

# Merge with ensemble_df
ensemble_df = ensemble_df.merge(svd_pred_df, on=['user_id', 'asin'], how='left')

In [30]:
# Fill any missing predictions with the mean rating
mean_rating = train_df['rating'].mean()
ensemble_df['knn_pred'].fillna(mean_rating, inplace=True)
ensemble_df['svd_pred'].fillna(mean_rating, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ensemble_df['knn_pred'].fillna(mean_rating, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  ensemble_df['svd_pred'].fillna(mean_rating, inplace=True)


In [31]:
#Weight the models by the inverse of their RMSE
weight_svd = 1 / rmse_svd
weight_knn = 1 / rmse_knn

total_weight = weight_svd + weight_knn

# Generate the ratio each model's weight has on the total weight
weight_svd /= total_weight
weight_knn /= total_weight

# The weights
print(f'Weights:')
print(f'KNN Weight: {weight_knn:.4f}')
print(f'SVD Weight: {weight_svd:.4f}')

Weights:
KNN Weight: 0.4031
SVD Weight: 0.5969


In [32]:
# Apply weights to ensemble predictions
ensemble_df['ensemble_pred'] = (
    ensemble_df['knn_pred'] * weight_knn +
    ensemble_df['svd_pred'] * weight_svd
)

In [33]:
# Actual ratings
true_ratings = ensemble_df['rating'].values

# Ensemble predictions
ensemble_predictions = ensemble_df['ensemble_pred'].values

# Compute RMSE
ensemble_rmse = mean_squared_error(true_ratings, ensemble_predictions, squared=False)
print(f'Ensemble RMSE: {ensemble_rmse}')

Ensemble RMSE: 0.7148613431740194


The initial ensemble recommendation function tooks hours to run. After doing research and learning how to batch the predictinos it became significantly faster.

In [34]:
# Function to recommend top N items for a user using the ensemble
def recommend_ensemble_batch(user_id, n=10):
    # Get items the user hasn't interacted with
    user_items = set(train_df[train_df['user_id'] == user_id]['asin'])
    all_items = set(data_for_surprise['asin'].unique())
    items_to_predict = list(all_items - user_items)
    items_to_predict_str = [str(item_id) for item_id in items_to_predict]
    user_id_str = str(user_id)

    # Prepare user-item pairs with a placeholder rating (e.g., 0)
    user_item_pairs = [(user_id_str, item_id_str, 0) for item_id_str in items_to_predict_str]

    # Batch predictions with KNN
    predictions_knn = algo_knn.test(user_item_pairs)
    knn_preds = {pred.iid: pred.est for pred in predictions_knn}

    # Batch predictions with SVD
    predictions_svd = algo_svd.test(user_item_pairs)
    svd_preds = {pred.iid: pred.est for pred in predictions_svd}

    # Combine predictions
    combined_predictions = []
    for item_id_str in items_to_predict_str:
        preds = []
        if item_id_str in knn_preds:
            preds.append((knn_preds[item_id_str], weight_knn))
        if item_id_str in svd_preds:
            preds.append((svd_preds[item_id_str], weight_svd))
        if preds:
            # Weighted average of predictions
            weighted_sum = sum(pred * weight for pred, weight in preds)
            weight_sum = sum(weight for _, weight in preds)
            combined_pred = weighted_sum / weight_sum
            combined_predictions.append((item_id_str, combined_pred))

    # Sort predictions by estimated rating
    combined_predictions.sort(key=lambda x: x[1], reverse=True)

    # Return top N item_ids
    top_n_items = [item for item, rating in combined_predictions[:n]]
    return top_n_items


recommended_items_ensemble = recommend_ensemble_batch(user_id)
print(f'Top recommended items for user {user_id} using the Ensemble model: {recommended_items_ensemble}')

Top recommended items for user A3T1AUE5RSBOS6 using the Ensemble model: ['B001JTRKHW', '6302484286', 'B00D3PYS6Q', 'B00005T30I', 'B000096IBI', 'B000EHQU12', '837255773X', 'B011SDC12M', 'B0047UJBMC', 'B000069HXD']


In [35]:
# Get top N recommendations for Ensemble
top_n_ensemble = defaultdict(list)
ensemble_predictions = []

for index, row in ensemble_df.iterrows():
    user_id = str(row['user_id'])
    item_id = str(row['asin'])
    est = row['ensemble_pred']
    ensemble_predictions.append((user_id, item_id, est))

# Map predictions to each user
for user_id, item_id, est in ensemble_predictions:
    top_n_ensemble[user_id].append((item_id, est))

# Sort and get top N
for user_id, user_ratings in top_n_ensemble.items():
    user_ratings.sort(key=lambda x: x[1], reverse=True)
    top_n_ensemble[user_id] = user_ratings[:10]

# Compute Precision@K and Recall@K for Ensemble
precision_ensemble, recall_ensemble = compute_precision_recall_at_k(top_n_ensemble, actual_ratings, k=10, threshold=relevance_threshold)
f1_score_ensemble = ((2*precision_ensemble * recall_ensemble)/ (recall_ensemble + precision_ensemble))
print(f'Ensemble Precision@10: {precision_ensemble:.4f}')
print(f'Ensemble Recall@10: {recall_ensemble:.4f}')
print(f'Ensemble F1 Score@10: {f1_score_ensemble:.4f}')

Ensemble Precision@10: 0.2054
Ensemble Recall@10: 0.9107
Ensemble F1 Score@10: 0.3352


### Results Summary

In [36]:

print("\nModel Performance Summary:")
print(f"KNN RMSE: {rmse_knn}")
print(f"SVD RMSE: {rmse_svd}")
print(f"ALS RMSE: {rmse_als}")
print(f"Ensemble RMSE: {ensemble_rmse}")

print("\nRecommendation Quality Metrics:")
print(f'KNN Precision@10: {precision_knn:.4f}, Recall@10: {recall_knn:.4f}, F1Score@10: {f1_score_knn:.4}')
print(f'SVD Precision@10: {precision_svd:.4f}, Recall@10: {recall_svd:.4f}, F1Score@10: {f1_score_svd:.4}')
print(f'ALS Precision@10: {precision_als:.4f}, Recall@10: {recall_als:.4f}, F1Score@10: {f1_score_als:.4}')
print(f'Ensemble Precision@10: {precision_ensemble:.4f}, Recall@10: {recall_ensemble:.4f}, F1Score@10: {f1_score_ensemble:.4}')


Model Performance Summary:
KNN RMSE: 0.9219278522943515
SVD RMSE: 0.6226858510489055
ALS RMSE: 0.9968222284871232
Ensemble RMSE: 0.7148613431740194

Recommendation Quality Metrics:
KNN Precision@10: 0.2037, Recall@10: 0.9094, F1Score@10: 0.3329
SVD Precision@10: 0.2054, Recall@10: 0.9107, F1Score@10: 0.3352
ALS Precision@10: 0.0013, Recall@10: 0.0073, F1Score@10: 0.002258
Ensemble Precision@10: 0.2054, Recall@10: 0.9107, F1Score@10: 0.3352


In [37]:
# Create a copy of test_df with actual ratings
comparison_df = test_df[['user_id', 'asin', 'rating']].rename(columns={'rating': 'actual_rating'})

In [38]:
#Merging all the predictions into one df so that the predictions can be compared side by side
comparison_df = comparison_df.merge(knn_pred_df[['user_id', 'asin', 'knn_pred']], on=['user_id', 'asin'], how='left')
comparison_df = comparison_df.merge(svd_pred_df[['user_id', 'asin', 'svd_pred']], on=['user_id', 'asin'], how='left')
comparison_df = comparison_df.merge(als_results[['user_id', 'asin', 'als_pred']], on=['user_id', 'asin'], how='left')
comparison_df = comparison_df.merge(ensemble_df[['user_id', 'asin', 'ensemble_pred']], on=['user_id', 'asin'], how='left')

In [39]:
comparison_df

Unnamed: 0,user_id,asin,actual_rating,knn_pred,svd_pred,als_pred,ensemble_pred
0,A0009988MRFQ3TROTQPI,B000BMSUBI,5.0,5.000000,5.000000,4.199226,5.000000
1,A0009988MRFQ3TROTQPI,B009INAMA8,5.0,5.000000,4.707128,4.471426,4.825195
2,A00311542N70JGNHUZPI,B0014Z3OQW,5.0,5.000000,4.783487,4.855718,4.870771
3,A0040548BPHKXMHH3NTI,B00X7SIALI,4.0,4.108956,4.188138,4.399876,4.156217
4,A0040548BPHKXMHH3NTI,B0092QDMQ2,4.0,4.108956,4.184728,4.376638,4.154182
...,...,...,...,...,...,...,...
1199045,A2AZIQJGBLU7WN,B01HGRJUGE,3.0,3.375000,3.331971,3.061921,3.349317
1199046,A2ZNT49VN00YA3,B01HH20HHE,4.0,4.113785,3.878240,3.878754,3.973196
1199047,A1GPDRU8VBWN3E,B01HH20HHE,5.0,4.425000,4.733786,4.710001,4.609304
1199048,A2NDDQLMA8XAUZ,B01HH20HHE,5.0,5.000000,4.714126,4.783737,4.829371


In [40]:
# Stopping the PySpark session
spark.stop()