# Modeling

In this notebook, we will build the recommendation models using the prepared datasets.
We will be using the Surprise library to implement both SVD and KNNBaseline algorithms.
The choice of these algorithms is based on their effectiveness in collaborative filtering tasks with minimal preprocessing requirements.

# 1. Importing Libraries

In this cell, we will import all the necessary libraries and functions for model building and data loading from the disk.

In [1]:
from surprise import SVD, KNNBaseline
import pickle as plk
from pathlib import Path as pth
import IPython
import gc

# 2. Loading Prepared Data from the previous notebook

In this cell, we will load the prepared training from disk using the pickle format.
This will allow us to easily access the datasets for model training while avoiding redundant data preparation steps.

In [2]:
# Defining the paths to load the prepared datasets
train_1m_path = pth.cwd().parent / 'data' / 'prepared-1m' / 'train_1m.pkl'
train_100k_path = pth.cwd().parent / 'data' / 'prepared-100k' / 'train_100k.pkl'

# Loading the 1M training dataset
with open(train_1m_path, 'rb') as f:
    train_1m = plk.load(f)
# Loading the 100k training dataset
with open(train_100k_path, 'rb') as f:
    train_100k = plk.load(f)

## 3. Training the models

In these cells, we will train the SVD and KNNBaseline models on both the 1M and 100k datasets.
We will be using the best hyperparameters found online to avoid the need for hyperparameter tuning using grid search to avoid any possible OOM issues.

We will also save the trained models to disk for later evaluation in the next notebook.

### 3.1 Training SVD on 1M dataset

In this cell, we will train the SVD model on the 1M dataset using the best hyperparameters found online and save the trained model to disk.

In [3]:
# Defining the best hyperparameters for SVD on 1M dataset
svd_1m_params = {
    'n_factors': 100,
    'n_epochs': 20
}

# Initializing the SVD model on 1M dataset
svd_1m = SVD(**svd_1m_params)

# Fitting the model to the training data
svd_1m.fit(train_1m)

# Saving the trained SVD model to disk
svd_1m_path = pth.cwd().parent / 'models' / 'svd_1m.pkl'
with open(svd_1m_path, 'wb') as f:
    plk.dump(svd_1m, f)

#### 3.1.1 Cleanup after training SVD on 1M dataset

In this cell, we will delete the trained SVD model for the 1M dataset from memory and run garbage collection to free up memory.

In [4]:
# Deleting the trained SVD model for 1M dataset to save memory
del svd_1m

#Deleting the training hyperparameters dictionary
del svd_1m_params

# Deleting the trained model path variable
del svd_1m_path

# calling garbage collector to free up memory
gc.collect()

64

### 3.2 Training KNNBaseline on 1M dataset

In this cell, we will train the KNNBaseline model on the 1M dataset using the best hyperparameters found online and save the trained model to disk.

In [5]:
# Defining the best hyperparameters for KNNBaseline on 1M dataset
knn_1m_params = {
    'k': 40,
    'sim_options': {
        'name': 'pearson_baseline',
        'user_based': False # item-based collaborative filtering to save up memory space
    }
}

# Initializing the KNNBaseline model on 1M dataset
knn_1m = KNNBaseline(**knn_1m_params)

# Fitting the model to the training data
knn_1m.fit(train_1m)

# Saving the trained KNNBaseline model to disk
knn_1m_path = pth.cwd().parent / 'models' / 'knn_1m.pkl'
with open(knn_1m_path, 'wb') as f:
    plk.dump(knn_1m, f)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


#### 3.2.1 Cleanup after training KNNBaseline on 1M dataset
In this cell, we will delete the trained KNNBaseline model for the 1M dataset from memory and run garbage collection to free up memory.

In [6]:
# Deleting the trained KNNBaseline model for 1M dataset to save memory
del knn_1m

# Deleting the training hyperparameters dictionary
del knn_1m_params

# Deleting the trained model path variable
del knn_1m_path

# calling garbage collector to free up memory
gc.collect()

0

### 3.3 Cleanup after training on 1M dataset

In this cell, we will delete the 1M training dataset from memory and run garbage collection to free up memory.

In [7]:
# Deleting the 1M training dataset to save memory
del train_1m

# calling garbage collector to free up memory
gc.collect()

0

### 3.4 Training SVD on 100k dataset

In this cell, we will train the SVD model on the 100k dataset using the best hyperparameters found online and save the trained model to disk.

In [8]:
# Defining the best hyperparameters for SVD on 100k dataset
svd_100k_params = {
    'n_factors': 50,
    'n_epochs': 15
}

# Initializing the SVD model on 100k dataset
svd_100k = SVD(**svd_100k_params)

# Fitting the model to the training data
svd_100k.fit(train_100k)

# Saving the trained SVD model to disk
svd_100k_path = pth.cwd().parent / 'models' / 'svd_100k.pkl'
with open(svd_100k_path, 'wb') as f:
    plk.dump(svd_100k, f)

#### 3.4.1 Cleanup after training SVD on 100k dataset

In this cell, we will delete the trained SVD model for the 100k dataset from memory and run garbage collection to free up memory.

In [9]:
# Deleting the trained SVD model for 100k dataset to save memory
del svd_100k

# Deleting the training hyperparameters dictionary
del svd_100k_params

# Deleting the trained model path variable
del svd_100k_path

# calling garbage collector to free up memory
gc.collect()

0

### 3.5 Training KNNBaseline on 100k dataset

In this cell, we will train the KNNBaseline model on the 100k dataset using the best hyperparameters found online and save the trained model to disk.

In [10]:
# Defining the best hyperparameters for KNNBaseline on 100k dataset
knn_100k_params = {
    'k': 30,
    'sim_options': {
        'name': 'pearson_baseline',
        'user_based': True # We can use user-based collaborative filtering for smaller datasets without any risk of  OOM issues which results in better recommendations
    }
}

# Initializing the KNNBaseline model on 100k dataset
knn_100k = KNNBaseline(**knn_100k_params)

# Fitting the model to the training data
knn_100k.fit(train_100k)

# Saving the trained KNNBaseline model to disk
knn_100k_path = pth.cwd().parent / 'models' / 'knn_100k.pkl'
with open(knn_100k_path, 'wb') as f:
    plk.dump(knn_100k, f)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.


#### 3.5.1 Cleanup after training KNNBaseline on 100k dataset

In this cell, we will delete the trained KNNBaseline model for the 100k dataset from memory and run garbage collection to free up memory.

In [11]:
# Deleting the trained KNNBaseline model for 100k dataset to save memory
del knn_100k

# Deleting the training hyperparameters dictionary
del knn_100k_params

# Deleting the trained model path variable
del knn_100k_path

# calling garbage collector to free up memory
gc.collect()

0

### 3.6 Cleanup after training on 100k dataset

In this cell, we will delete the 100k training dataset from memory and run garbage collection to free up memory.

In [12]:
# Deleting the 100k training dataset to save memory
del train_100k

# calling garbage collector to free up memory
gc.collect()

0

## 4. Jupiter notebook shutdown

In [13]:
# Shutdown the Jupyter notebook kernel programmatically
print("Shutting down the Jupyter notebook kernel for this notebook...")
IPython.get_ipython().kernel.do_shutdown(restart=False)

Shutting down the Jupyter notebook kernel for this notebook...


{'status': 'ok', 'restart': False}

## 5. Conclusion

In this notebook, we successfully built and trained recommendation models using the SVD and KNNBaseline algorithms from the Surprise library.
We utilized the prepared datasets from the previous notebook and saved the trained models to disk for later evaluation.
The next step will involve evaluating these models to assess their performance and effectiveness in making recommendations.