# Header
- **Created by    :** Robby Lysander Aurelio
- **Creation date :** September 13, 2024

In [1]:
# !pip install scikit-surprise
# !pip install torch

In [2]:
# Load all the necessary libraries
import pandas as pd
import numpy as np
import random
from surprise import Reader, Dataset, SVD, KNNWithMeans
from surprise.model_selection import cross_validate
import torch
import torch.nn as nn
import torch.nn.functional as F
import gc
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
# Load the training set
df = pd.read_csv('./data/train.csv') # Labeled dataset for training the models

In [4]:
# Check the dataset
df.head()

Unnamed: 0,user_id,product_id,product_name,rating,votes,helpful_votes,ID
0,1813,154533,Beautiful Thing,5,10,8,0
1,1944,192838,Almost Famous,5,4,2,1
2,534,202590,A Clockwork Orange,5,5,5,2
3,1811,140456,Great Expectations (Wordsworth Classics),4,1,0,3
4,102,154278,Phenomenon,5,0,0,4


In [5]:
# Check the structure of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 745889 entries, 0 to 745888
Data columns (total 7 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   user_id        745889 non-null  int64 
 1   product_id     745889 non-null  int64 
 2   product_name   745889 non-null  object
 3   rating         745889 non-null  int64 
 4   votes          745889 non-null  int64 
 5   helpful_votes  745889 non-null  int64 
 6   ID             745889 non-null  int64 
dtypes: int64(6), object(1)
memory usage: 39.8+ MB


In [6]:
# Check the summary of the dataset
df.describe()
df.describe(include=['O'])

Unnamed: 0,product_name
count,745889
unique,178037
top,The Hobbit
freq,783


In [7]:
# Check the number of unique user and product IDs
num_users = len(df.user_id.unique())
num_items = len(df.product_id.unique())
print(num_users, num_items)

2000 201325


## Data Wrangling

Notice that from the above exploration, the number of unique product IDs and names do not match. This may suggest that there are some data integrity issues as a product should have only one ID.

In [8]:
# Check data integrity
product_variations = df.groupby('product_name')['product_id'].nunique()
product_variations[product_variations > 1]

Unnamed: 0_level_0,product_id
product_name,Unnamed: 1_level_1
"""Extra"" Work for Brain Surgeons",3
"""Fire! Fire!"" Said Mrs. McGuire",2
"""O"" Is for Outlaw",3
"""O"" Is for Outlaw (Kinsey Millhone Mysteries (Audio))",2
#1,2
...,...
johns,2
"sex, lies, and videotape",2
"¡Corre, perro, corre!",2
¿Dónde está Spot?,2


The above results show that some products really have different IDs even though they are the same. Therefore, we need to fix this first. Here, we can just simply use the most ID for each product.

In [9]:
# Fix the same products have different ID
# Here, pick the ID with the most frequency
train_mapping = df.groupby('product_name').product_id.agg(lambda x: pd.Series.mode(x)[0]).to_dict()
df['product_id'] = df['product_name'].apply(lambda x: train_mapping[x])

## Model Selection

To make the recommender system that can predict the rating given by a user on a specific product, 3 types of algorithm will be considered:
-  **K-Nearest Neighbours (KNN):** a Collaborative Filtering model that uses the ratings from the most similar users or items.
- **Single Value Decomposition (SVD):** another Collaborative Filtering model that assumes both the users and items' properties can be described in some low dimensional space (by some latent factors). It uses an inner product to produce the output.
- **Neural Networks:** introduces non-linearity to produce the output. It can learn more complex relationship between users and items

### KNN

Note that a grid search had been done previously to find the optimal parameters for the models. The hyperparameter tuned includes the similarity measurement type and the minimum and maximum number of neighbours (K).

In [10]:
# Initialize the seed for the RNG
random.seed(79)
np.random.seed(79)

# Create a dataset from the training set only using the user and product ID pairings and their rating
data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], Reader())

# Initialize the KNN model
knn = KNNWithMeans(k = 10,
                   min_k = 4,
                   sim_options = {'name': 'MSD'},
                   verbose = False)

# Do a 5-fold cross-validation train and evaluate the model's performance
cross_validate(knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8585  0.8572  0.8587  0.8578  0.8526  0.8570  0.0022  
MAE (testset)     0.6119  0.6099  0.6111  0.6105  0.6090  0.6105  0.0010  
Fit time          1.34    1.74    1.68    1.65    1.71    1.62    0.15    
Test time         9.31    8.94    8.94    8.41    9.58    9.03    0.40    


{'test_rmse': array([0.85848359, 0.85716239, 0.85872703, 0.85777069, 0.8526204 ]),
 'test_mae': array([0.61192058, 0.6098989 , 0.61112399, 0.61052454, 0.60896675]),
 'fit_time': (1.3351709842681885,
  1.7351608276367188,
  1.682624101638794,
  1.6471624374389648,
  1.7080376148223877),
 'test_time': (9.310630798339844,
  8.937234163284302,
  8.93796157836914,
  8.407681465148926,
  9.57626461982727)}

### SVD

Note that a grid search had been done previously to find the optimal parameters for the models. The hyperparameter tuned includes the number of latent factors, number of epochs, learning rates, and regularization rates.

In [11]:
# Initialize the seed for the RNG
random.seed(79)
np.random.seed(79)

# Initialize the SVD model
svd = SVD(n_factors = 350,
          n_epochs = 150,
          lr_all = 0.1,
          reg_all = 0.02,
          random_state = 79)

# Do a 5-fold cross-validation train and evaluate the model's performance
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.7955  0.7947  0.7960  0.7946  0.7901  0.7942  0.0021  
MAE (testset)     0.5224  0.5205  0.5205  0.5211  0.5213  0.5211  0.0007  
Fit time          247.03  257.77  265.30  255.13  248.60  254.76  6.60    
Test time         2.03    1.85    1.87    1.82    2.15    1.94    0.13    


{'test_rmse': array([0.79547422, 0.79473424, 0.79604107, 0.79464712, 0.79008606]),
 'test_mae': array([0.52237509, 0.52046822, 0.52052395, 0.52107455, 0.52125977]),
 'fit_time': (247.0275740623474,
  257.77093267440796,
  265.2962739467621,
  255.13149666786194,
  248.59572052955627),
 'test_time': (2.0273091793060303,
  1.8466849327087402,
  1.874511480331421,
  1.821594476699829,
  2.1515326499938965)}

### Neural Network

For the Neural Network model, here we will use the Multi-Layer Perceptron (MLP) algorithm, in which the interaction function consists of several non-linear functions to learn the relationship between the users and items.

Here, we will use the TF-IDF representation of the product name instead of the product ID to accomodate for the unseen products (not provided in the training set) in the test set.

In [12]:
# References: class tutorial

def encode_data(df, train=None):
    '''
    Encodes the dataset with continous user ids

    Parameters
    ----------
    df : Pandas dataframe
      the dataframe to be encoded
    train : Pandas dataframe
      the training dataframe (for reference)

    Returns
    -------
    The encoded dataframe
    '''

    # Create a copy of the dataset
    df = df.copy()
    # Reset the training column
    train_col = None
    # If training set is provided, use it as a reference
    # Else, use the dataset directly
    if train is not None:
        uniq = train["user_id"].unique()
    else:
        uniq = df["user_id"].unique()
    # Create the encoding mapping
    name2idx = {o:i for i,o in enumerate(uniq)}
    # Encode the user IDs
    df["user_id"] = np.array([name2idx.get(x, -1) for x in df["user_id"]])
    # Remove the unseen IDs
    df = df[df["user_id"] >= 0]
    return df

class CollabFNet(nn.Module):
    '''
    The neural collaborative filtering model.

    Parameters
    ----------
    num_users : int
      the number of users (size of the input layer)
    tfidf_dim : int
      the number of dimensions in the item's TF-IDF
    emb_size : int
      the size of the embedding layer (default: 100)
    n_hidden : list
      the number of hidden layers (default: [10])

    Attributes
    ----------
    user_emb : torch.nn object
      the user embedding layer
    item_tfidf : torch.nn object
      the item TF-IDF embedding layer
    lin1 : torch.nn object
      the first linear layer (embedding to hidden)
    lin2 : torch.nn object
      the second linear layer (hidden to output)
    drop1 : torch.nn object
      the dropout layer
    '''

    def __init__(self, num_users, tfidf_dim, emb_size=100, n_hidden=10):
        '''
        Constructs all the necessary attributes for the CollabFNet object.

        Parameters
        ----------
        num_users : int
          the number of users (size of the input layer)
        tfidf_dim : int
          the number of dimensions in the item's TF-IDF
        emb_size : int
          the size of the embedding layer (default: 100)
        n_hidden : list
          the size of the hidden layer (default: 10)
        '''

        super(CollabFNet, self).__init__()
        self.user_emb = nn.Embedding(num_users, emb_size)
        self.item_tfidf = nn.Linear(tfidf_dim, emb_size)
        self.lin1 = nn.Linear(emb_size*2, n_hidden)
        self.lin2 = nn.Linear(n_hidden, 1)
        self.drop1 = nn.Dropout(0.1)

    def forward(self, u, v):
        '''
        Implements the feedforward step to produce an output
        from the inputted data.

        Parameters
        ----------
        u : tensor
          user id to be processed
        v : tensor
          the item TF-IDF to be processed

        Returns
        -------
        Predicted ratings for the given user-item pairings.
        '''

        # Determine the user and item embeddings
        U = self.user_emb(u)
        V = self.item_tfidf(v)
        # Concatenate the embeddings and feed into the model
        x = F.relu(torch.cat([U, V], dim=1))
        x = self.drop1(x)
        x = F.relu(self.lin1(x))
        x = self.lin2(x)
        return x

def train_epochs(model, epochs=10, lr=0.01, wd=0.0, unsqueeze=False):
    '''
    Trains the model for the specified number of epochs.

    Parameters
    ----------
    model : torch.nn object
      the model to be trained
    epochs : int
      the number of epochs to train for (default: 10)
    lr : float
      the learning rate (default: 0.01)
    wd : float
      the weight decay (default: 0.0)
    unsqueeze : bool
      whether to unsqueeze the ratings (default: False)

    Return
    ------
    None
    '''

    # Initiate the optimizer
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
    # Set the mode to training
    model.train()

    # Save the data to tensors
    users = torch.LongTensor(df_train.user_id.values)
    items_tfidf = torch.FloatTensor(tfidf_train)
    ratings = torch.FloatTensor(df_train.rating.values)
    # Unsqueeze the ratings if asked
    if unsqueeze:
      ratings = ratings.unsqueeze(1)

    # Train the model
    for i in range(epochs):
        # Do the prediction
        y_hat = model(users, items_tfidf)
        # Calculate the loss
        loss = F.mse_loss(y_hat, ratings)
        # Reset the gradients
        optimizer.zero_grad()
        # Capture the gradients
        loss.backward()
        # Update the weights
        optimizer.step()
        # Print the epoch loss
        print('Epoch', i+1, 'loss:', loss.item()**0.5)

    # Due to memory issue, here we will delete the tensors after usage
    # This will clear up some memories
    del users
    del items_tfidf
    del ratings
    del y_hat
    del loss
    gc.collect()

    evaluate(model, unsqueeze)

def evaluate(model, unsqueeze=False):
    '''
    Evalute the model to the validation set

    Parameters
    ----------
    model : torch.nn object
      the model to be evaluated
    unsqueeze : bool
      whether to unsqueeze the ratings (default: False)

    Return
    ------
    None
    '''

    # Set the mode to evaluation
    model.eval()
    # Save the data to tensors
    users = torch.LongTensor(df_val.user_id.values)
    items_tfidf = torch.FloatTensor(tfidf_val)
    ratings = torch.FloatTensor(df_val.rating.values)
    # Unsqueeze the ratings if asked
    if unsqueeze:
        ratings = ratings.unsqueeze(1)
    # Do the predictions
    y_hat = model(users, items_tfidf)
    # Calculate the loss
    loss = F.mse_loss(y_hat, ratings)**0.5
    print("Validation loss %.3f " % loss.item())

    # Due to memory issue, here we will delete the tensors after usage
    # This will clear up some memories
    del users
    del items_tfidf
    del ratings
    del y_hat
    del loss
    gc.collect()

After exploring the training and test set, it was found that some of the products in the test set did not appear in the training set. This will create an unseen label problem when applying the model to the test set. To solve this problem, we can use the product name directly to learn the representation for the items, instead of just using the product ID.

In [13]:
# Split the dataset into train and validation
np.random.seed(79)
sample = np.random.rand(len(df)) < 0.8
train = df[sample].copy()
val = df[~sample].copy()

# Encode the train and validation data
df_train = encode_data(train)
df_val = encode_data(val, train)

# Initiate the vectorizer for the product name
# Here, we use the TF-IDF vectorizer
vectorizer = TfidfVectorizer(lowercase = True,
                             analyzer = 'word',
                             ngram_range = (1, 1),
                             min_df = 5,
                             max_df = 0.95,
                             stop_words = 'english',
                             max_features = 1000)

# Extract the features from the training and validation set
tfidf_train = vectorizer.fit_transform(train['product_name']).toarray()
tfidf_val = vectorizer.transform(val['product_name']).toarray()

# Get the number of users and the TF-IDF dimension
num_users = len(df_train.user_id.unique())
tfidf_dim = tfidf_train.shape[1]
print(num_users, tfidf_dim)

2000 1000


Note that a grid search had been done previously to find the optimal parameters for the models. The hyperparameter tuned includes the embedding size, hidden layer size, number of epochs, learning rate, and weight decay.

In [14]:
# Initialize the seed for RNG
torch.manual_seed(79)
# Define the NN model
neural = CollabFNet(num_users,
                    tfidf_dim,
                    emb_size = 200,
                    n_hidden = 75)
# Train the NN model
train_epochs(neural, epochs = 40, lr = 0.01, wd = 1e-3, unsqueeze = True)

Epoch 1 loss: 4.005619393408904
Epoch 2 loss: 1.6626528648139098
Epoch 3 loss: 2.4986924566805957
Epoch 4 loss: 2.1373888042594706
Epoch 5 loss: 1.2573596778980476
Epoch 6 loss: 1.3014115647889763
Epoch 7 loss: 1.6985490553506593
Epoch 8 loss: 1.820989327401021
Epoch 9 loss: 1.6851783605972237
Epoch 10 loss: 1.412345265040454
Epoch 11 loss: 1.1832779566921625
Epoch 12 loss: 1.1873010669520643
Epoch 13 loss: 1.356000065077721
Epoch 14 loss: 1.4610742515166106
Epoch 15 loss: 1.4130294567069526
Epoch 16 loss: 1.27133430966575
Epoch 17 loss: 1.159181641446068
Epoch 18 loss: 1.1536467346233812
Epoch 19 loss: 1.2217514541489307
Epoch 20 loss: 1.289591177986866
Epoch 21 loss: 1.3119884129716197
Epoch 22 loss: 1.2823483481173084
Epoch 23 loss: 1.2206608391113458
Epoch 24 loss: 1.1602220612942926
Epoch 25 loss: 1.1344920368612819
Epoch 26 loss: 1.1524748646773488
Epoch 27 loss: 1.1866668991202938
Epoch 28 loss: 1.2040868605112942
Epoch 29 loss: 1.1902959934731558
Epoch 30 loss: 1.15951417565738

### Summary

Here are the summary of the validation loss between the three models:

| Model  | Validation RMSE |
| ------ | --------------- |
| KNN    |      0.857      |
| SVD    |      0.794      |
| Neural |      1.118      |

From the above results, it can be seen that the SVD model has the best validation result. Therefore, the final chosen model would be the SVD model. The detailed discussion about the model comparison is provided in the report.

## Final Model Development

After finding out which algorithm has the best performance, now, we can build the final model to be used for the prediction task.

In [15]:
data = Dataset.load_from_df(df[['user_id', 'product_id', 'rating']], Reader())
# Build the trainset
trainset = data.build_full_trainset()

# Initiate the model
svd = SVD(n_factors = 350,
          n_epochs = 150,
          lr_all = 0.1,
          reg_all = 0.02,
          random_state = 79)

# Train the model
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x79578941fca0>

## Prediction Task

Lastly, we will apply the best model to the test set to predict the ratings given the user-item pairings.

In [16]:
# Load the test data
test = pd.read_csv('./data/test.csv')

# Fix the data integrity issue in the test set
test['product_id'] = test.apply(lambda x: train_mapping[x.product_name] if x.product_name in train_mapping else x.product_id, axis = 1)
test_mapping = test.groupby('product_name').product_id.agg(lambda x: pd.Series.mode(x)[0]).to_dict()
test['product_id'] = test['product_name'].apply(lambda x: test_mapping[x])

# Apply the model to the test set to predict the rating
test['rating'] = test.apply(lambda x: svd.predict(x.user_id, x.product_id).est, axis = 1)
# Save the prediction result to a new CSV file
test[['ID', 'rating']].to_csv('./labeled_test.csv', index = False)