# Recommender System based on Amazon Product Reviews


I am going to build a recommender system that suggests products to users based on their past ratings.
The dataset I will be using can be found at https://www.kaggle.com/datasets/saurav9786/amazon-product-reviews and it was uploaded by the user 'Saurav Anand'.
The dataset contains 7.82 million data points, and 4 attributes:
* __userId__ : Every user identified with a unique id;
* __productId__ : Every product identified with a unique id;
* __Rating__ : Rating of the corresponding product by the corresponding user;
* __timestamp__ : Time of the rating.

I want to highlight the fact that this notebook was created purely for my own learning, so the methods used might not be the best. The recommender system itself is very simple and rudimentary, just to get learn a few things about recommender systems in general.

I will keep reading about recommender systems as there still is much I don't know; in the future I might create a much better and more complex version of a recommender system.

I have loaded the .csv file on my Google Drive so, firstly, I am going to directly read the file from my drive.

In [None]:
import pandas as pd
from google.colab import drive

drive.mount('/content/drive')

column_names = ['user_id', 'product_id', 'rating', 'timestamp']

file_path = '/content/drive/My Drive/ratings_Electronics.csv'

df = pd.read_csv(file_path, names=column_names)

print(df.head())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
          user_id  product_id  rating   timestamp
0   AKM1MP6P0OYPR  0132793040     5.0  1365811200
1  A2CX7LUOHB2NDG  0321732944     5.0  1341100800
2  A2NWSAGRHCP8N5  0439886341     1.0  1367193600
3  A2WNBOD3WNDNKT  0439886341     3.0  1374451200
4  A1GI0U4ZRJA8WN  0439886341     1.0  1334707200


From the description of the dataset found on Kaggle I already know that there are no missing values, but I will check anyways, just to make sure.

In [None]:
print(df.isnull().sum())

user_id       0
product_id    0
rating        0
timestamp     0
dtype: int64


I am going to use the '.to_datetime' function to convert the 'timestamp' attribute and ensure a consistent format for our dates. I won't really need it, but it's usually better for performing data arithmetic and performing other tasks related to dates. *testo in corsivo*

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
df.head()

Unnamed: 0,user_id,product_id,rating,timestamp
0,AKM1MP6P0OYPR,132793040,5.0,2013-04-13
1,A2CX7LUOHB2NDG,321732944,5.0,2012-07-01
2,A2NWSAGRHCP8N5,439886341,1.0,2013-04-29
3,A2WNBOD3WNDNKT,439886341,3.0,2013-07-22
4,A1GI0U4ZRJA8WN,439886341,1.0,2012-04-18


Here I am going to get rid of smaller counts of 'user_id' and 'product_id'. This is done for a few reasons. Firstly, I am trying to build a recommender system, so it will be rather useful to have a few observations for each user/product to base my recommendations on. Also, I don't have the resources for building a model with millions of data points, so this will thin out the observations in our dataset.

In [None]:
user_counts = df['user_id'].value_counts()
product_counts = df['product_id'].value_counts()

df = df[df['user_id'].isin(user_counts[user_counts >= 5].index)]
df = df[df['product_id'].isin(product_counts[product_counts >= 5].index)]

We no longer have 7.82 millions of observations, as we are down to just over 2 million. This is still too much for the simple recommender system I am going to create, as I will also need some memory for computing sparse matrices, so I will sample 1% of the observations to construct this simple model.

In [None]:
df.shape

(2109869, 4)

In [None]:
df_sampled = df.sample(frac=0.01, random_state=42)
df_sampled.shape

(21099, 4)

I will be converting the data types of the first 3 attributes to 'optimize' for memory purposes. 'float32' will be used instead of 'float64' to reduce the memory footprint by half. You could technically improve this even further by converting the 'timestamp' attribute to 'int64', but I won't be doing that here.

In [None]:
df_sampled['user_id'] = df['user_id'].astype('category')
df_sampled['product_id'] = df['product_id'].astype('category')
df_sampled['rating'] = df_sampled['rating'].astype('float32')

Now we start to get into the fun part... \\
I am going to create a sparse matrix for the training data, which is efficient in terms of memory for storing and processing large datasets. \\
Although it is true that we only have about 17K data points in the training dataset, I think this may still be good practice when working with larger datasets. \\
A sparse matrix is simply a matrix in which most of the elements are zeros. What I am going to do is simply store the non-zero elements and their indices, which saves a good amount of memory. \\
It is also good for performing matrix operations, as we just skip the zero entries.

In [None]:
from sklearn.model_selection import train_test_split
from scipy.sparse import csr_matrix

train_data, test_data = train_test_split(df_sampled, test_size=0.2, random_state=42)

train_interaction_matrix = train_data.pivot(index='user_id', columns='product_id', values='rating').fillna(0)

sparse_train_matrix = csr_matrix(train_interaction_matrix.values)

Here is the part that is most confusing to myself....\\
I am going to use Truncated SVD to reduce dimensionality and extract the most important features. \\
By reducing the matrix to its most significant components, we can capture the most important patterns in the data while getting rid of less important information. \\
In a recommender system truncated SVD can provide better predictions while aiding generalizations and avoiding overfitting.

In [None]:
from scipy.sparse.linalg import svds
import numpy as np

U, sigma, Vt = svds(sparse_train_matrix, k=10)
sigma = np.diag(sigma)

predicted_ratings_train_matrix = np.dot(np.dot(U, sigma), Vt)
predicted_ratings_train_df = pd.DataFrame(predicted_ratings_train_matrix, columns=train_interaction_matrix.columns, index=train_interaction_matrix.index)

Lastly, in the last few cells of this recommender system, I am going to check the Root Mean Squared Error to check the performance of the model. \\
I also iterated over the different values of k to check which is the best for performance. \\
I also want to ensure robustness by performing cross validation.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

train_actual_ratings = train_data.pivot(index='user_id', columns='product_id', values='rating')
predicted_ratings = predicted_ratings_train_df.values[train_actual_ratings.notna()]

train_rmse = sqrt(mean_squared_error(train_actual_ratings.values[train_actual_ratings.notna()], predicted_ratings))
print(f'Training RMSE: {train_rmse}')

Training RMSE: 4.241323702773579


In [None]:
# Ensure the predicted ratings DataFrame is reindexed to match the test interaction matrix
test_predicted_ratings_df = test_predicted_ratings_df.reindex(index=test_interaction_matrix.index, columns=test_interaction_matrix.columns).fillna(0)

# Extract predicted and actual ratings only where they are available in the test set
test_predicted_ratings = test_predicted_ratings_df.values[test_interaction_matrix.notna()]
test_actual_ratings = test_interaction_matrix.values[test_interaction_matrix.notna()]

# Calculate RMSE for the test set
from sklearn.metrics import mean_squared_error
from math import sqrt

test_rmse = sqrt(mean_squared_error(test_actual_ratings, test_predicted_ratings))
print(f'Test RMSE: {test_rmse}')


Test RMSE: 0.0730544417842926


In [None]:
k_values = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
for k in k_values:
    U, sigma, Vt = svds(sparse_train_matrix, k=k)
    sigma = np.diag(sigma)
    predicted_ratings_train_matrix = np.dot(np.dot(U, sigma), Vt)
    predicted_ratings_train_df = pd.DataFrame(predicted_ratings_train_matrix, columns=train_interaction_matrix.columns, index=train_interaction_matrix.index)

    predicted_ratings_train = predicted_ratings_train_df.values[train_interaction_matrix.notna()]
    train_rmse = sqrt(mean_squared_error(train_interaction_matrix.values[train_interaction_matrix.notna()], predicted_ratings_train))
    print(f'k={k}, Training RMSE: {train_rmse}')

    test_predicted_ratings_df = predicted_ratings_train_df.reindex(index=test_interaction_matrix.index, columns=test_interaction_matrix.columns).fillna(0)
    test_predicted_ratings = test_predicted_ratings_df.values[test_interaction_matrix.notna()]
    test_rmse = sqrt(mean_squared_error(test_interaction_matrix.values[test_interaction_matrix.notna()], test_predicted_ratings))
    print(f'k={k}, Test RMSE: {test_rmse}')


k=10, Training RMSE: 0.040826054276471874
k=10, Test RMSE: 0.07292480047307036
k=20, Training RMSE: 0.04056975719090962
k=20, Test RMSE: 0.07296191313161031
k=30, Training RMSE: 0.040371866620199696
k=30, Test RMSE: 0.07297314822054247
k=40, Training RMSE: 0.040205532390612385
k=40, Test RMSE: 0.07302565993300913
k=50, Training RMSE: 0.04006566495573171
k=50, Test RMSE: 0.0730544417842926
k=60, Training RMSE: 0.03993772703120798
k=60, Test RMSE: 0.07307285443536923
k=70, Training RMSE: 0.0398136940075101
k=70, Test RMSE: 0.07313917226984641
k=80, Training RMSE: 0.039696211173304366
k=80, Test RMSE: 0.07316548145484424
k=90, Training RMSE: 0.03958587061259218
k=90, Test RMSE: 0.07318070692319747
k=100, Training RMSE: 0.03947921152529604
k=100, Test RMSE: 0.07319203248962101


In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)
train_rmse_list = []
test_rmse_list = []

for train_index, val_index in kf.split(df_sampled):
    train_data = df_sampled.iloc[train_index]
    val_data = df_sampled.iloc[val_index]

    train_interaction_matrix = train_data.pivot(index='user_id', columns='product_id', values='rating').fillna(0)
    val_interaction_matrix = val_data.pivot(index='user_id', columns='product_id', values='rating').fillna(0)

    sparse_train_matrix = csr_matrix(train_interaction_matrix.values)

    U, sigma, Vt = svds(sparse_train_matrix, k=50)
    sigma = np.diag(sigma)

    predicted_ratings_train_matrix = np.dot(np.dot(U, sigma), Vt)
    predicted_ratings_train_df = pd.DataFrame(predicted_ratings_train_matrix, columns=train_interaction_matrix.columns, index=train_interaction_matrix.index)

    predicted_ratings_train = predicted_ratings_train_df.values[train_interaction_matrix.notna()]
    train_rmse = sqrt(mean_squared_error(train_interaction_matrix.values[train_interaction_matrix.notna()], predicted_ratings_train))
    train_rmse_list.append(train_rmse)

    val_predicted_ratings_df = predicted_ratings_train_df.reindex(index=val_interaction_matrix.index, columns=val_interaction_matrix.columns).fillna(0)
    val_predicted_ratings = val_predicted_ratings_df.values[val_interaction_matrix.notna()]
    val_rmse = sqrt(mean_squared_error(val_interaction_matrix.values[val_interaction_matrix.notna()], val_predicted_ratings))
    test_rmse_list.append(val_rmse)

print(f'Average Training RMSE: {np.mean(train_rmse_list)}')
print(f'Average Validation RMSE: {np.mean(test_rmse_list)}')


Average Training RMSE: 0.04003342498475809
Average Validation RMSE: 0.07316010442896775
