# Data Preparation

In this notebook, we will prepare the data for training and testing.
Since we are using two datasets (1M and 32M), we will prepare both datasets separately.
We will be using functions from the utils folder to help with data splitting and preparation and the DataFrames created in the previous notebook.

## 0. Utils folder path setup for imports

Before we start we need to add the utils folder to the system path so that we can import the functions defined there.

In [1]:
import sys
from pathlib import Path as pth
import gc
import IPython

# Add the utils folder to the system path
project_root = pth.cwd().parent # Should go from ML_Project/notebooks to ML_Project

sys.path.insert(0,str(project_root))

print("Utils folder added to system path for imports.")

Utils folder added to system path for imports.


## 1. Importing Necessary Libraries and Functions

This cell imports all the necessary libraries and functions for data preparation.

In [2]:
import pandas as pd
from utils.preprocessing import format_rating_data, format_movies_data, load_surprise_dataset, split_surprise_dataset, load_movies_data, load_rating_data, merge_datasets
from pathlib import Path as pth

## 2. Load the Datasets

This cell loads the movies and ratings datasets (32M and 1M) and prepares them for formatting to the correct format.
This is done thanks to pre defined functions in the utils folder.

In [3]:
# Declaring the 1M dataset file paths
ratings_1m_path = pth.cwd().parent / 'data' / 'ml-1m' / 'ratings.dat'
movies_1m_path = pth.cwd().parent / 'data' / 'ml-1m' / 'movies.dat'

# Declaring the 100K dataset file paths
ratings_100k_path = pth.cwd().parent / 'data' / 'ml-latest-small' / 'ratings.csv'
movies_100k_path = pth.cwd().parent / 'data' / 'ml-latest-small' / 'movies.csv'

# Loading the 1M dataset
ratings_1m = load_rating_data(ratings_1m_path,'dat')
movies_1m = load_movies_data(movies_1m_path,'dat')

# Cleaning and formatting 1M dataset using predefined functions
formatted_movies_1m = format_movies_data(movies_1m)
formatted_ratings_1m = format_rating_data(ratings_1m)

# Loading the 100K dataset
ratings_100k = load_rating_data(ratings_100k_path, 'csv')
movies_100k = load_movies_data(movies_100k_path, 'csv')

# Cleaning and formatting 100K dataset using predefined functions
formatted_movies_100k = format_movies_data(movies_100k)
formatted_ratings_100k = format_rating_data(ratings_100k)

# Merging datasets
merged_1m = merge_datasets(formatted_movies_1m, formatted_ratings_1m)
merged_100k = merge_datasets(formatted_movies_100k, formatted_ratings_100k)

### 2.1 Memory Cleanup
To optimize memory usage, we will delete the unformatted datasets and run garbage collection.

In [4]:
# Deleting unformatted datasets to save memory
del ratings_1m
del movies_1m
del ratings_100k
del movies_100k

# Deleting formatted datasets to save memory
del formatted_movies_1m
del formatted_ratings_1m
del formatted_movies_100k
del formatted_ratings_100k

# Deleting path variables
del ratings_1m_path
del movies_1m_path
del ratings_100k_path
del movies_100k_path

# Running the garbage collector to free up memory
gc.collect()

87

### 2.2 Verifying Loaded Datasets

This cell will display the shape and the info of the merged dataframes of these 2 datasets to ensure they are loaded correctly.

In [5]:
# Displaying the shape of the merged 1M dataset
print("1M Merged Dataset Shape:", merged_1m.shape)

# Displaying the shape of the merged 100K dataset
print("100K Merged Dataset Shape:", merged_100k.shape)

# Displaying the info of the merged 1M dataset
print("\n1M Merged Dataset Info:")
merged_1m.info()

# Displaying the info of the merged 100K dataset
print("\n100k Merged Dataset Info:")
merged_100k.info()

1M Merged Dataset Shape: (1000209, 5)
100K Merged Dataset Shape: (100836, 5)

1M Merged Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000209 entries, 0 to 1000208
Data columns (total 5 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   movieId  1000209 non-null  int64 
 1   title    1000209 non-null  object
 2   genres   1000209 non-null  object
 3   userId   1000209 non-null  int64 
 4   rating   1000209 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 38.2+ MB

100k Merged Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   movieId  100836 non-null  int64  
 1   title    100836 non-null  object 
 2   genres   100836 non-null  object 
 3   userId   100836 non-null  int64  
 4   rating   100836 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage

## 3. Data Preparation for Training and Testing

In this cell, we will prepare the datasets for training and testing using functions from the utils folder.
We will start by formatting them to the surprise library format of Dataset instead of pandas DataFrame.
Afterwards, we will split them into training and testing sets.
The train test split will be 80% training and 20% testing.

In [6]:
# Preparing the 1M dataset for the train test split
surprise_1m = load_surprise_dataset(merged_1m)

# Preparing the 100k dataset for the train test split
surprise_100k = load_surprise_dataset(merged_100k)

# Splitting the 1M dataset into training and testing sets
train_1m, test_1m = split_surprise_dataset(surprise_1m, test_size=0.2)

# Splitting the 100k dataset into training and testing sets
train_100k, test_100k = split_surprise_dataset(surprise_100k, test_size=0.2)

### 3.1 Memory Cleanup After Preparation

To optimize memory usage, we will delete the merged datasets and run garbage collection.

In [7]:
# Deleting merged datasets to save memory
del merged_1m
del merged_100k
# Deleting surprise datasets to save memory
del surprise_1m
del surprise_100k
# Calling garbage collector to free up memory
gc.collect()

60

## 4. Saving prepared Data to Disk for access in the training notebook

In this cell, we will save the prepared training and testing datasets to disk using the pickle format.
This will allow us to easily load the datasets in the training notebook without having to repeat the preparation steps.

In [8]:
import pickle as plk

# Defining the paths to save the prepared datasets
train_1m_path = project_root / 'data' / 'prepared-1m' / 'train_1m.pkl'
test_1m_path = project_root / 'data' / 'prepared-1m' / 'test_1m.pkl'
train_100k_path = project_root / 'data' / 'prepared-100k' / 'train_100k.pkl'
test_100k_path = project_root / 'data' / 'prepared-100k' / 'test_100k.pkl'

# Saving the 1M training dataset
with open(train_1m_path, 'wb') as f:
    plk.dump(train_1m, f)
# Saving the 1M testing dataset
with open(test_1m_path, 'wb') as f:
    plk.dump(test_1m, f)
# Saving the 100k training dataset
with open(train_100k_path, 'wb') as f:
    plk.dump(train_100k, f)
# Saving the 100k testing dataset
with open(test_100k_path, 'wb') as f:
    plk.dump(test_100k, f)

## 5. Final Cleanup
After saving the prepared datasets, we will clean up the memory used for better resource management.

In [9]:
# Cleaning up the memory used by the training and testing datasets
del train_1m
del test_1m
del train_100k
del test_100k
# Cleaning up any other variables that are no longer needed
del project_root
del train_1m_path
del test_1m_path
del train_100k_path
del test_100k_path
# Calling garbage collector to free up memory
gc.collect()

0

## 6. Jupiter notebook shutdown

In [10]:
# Shutdown the Jupyter notebook kernel programmatically
print("Shutting down the Jupyter notebook kernel for this notebook...")
IPython.get_ipython().kernel.do_shutdown(restart=False)

Shutting down the Jupyter notebook kernel for this notebook...


{'status': 'ok', 'restart': False}

## 7. Conclusion

In this notebook, we successfully prepared the 1M and 100K datasets for training and testing.
We formatted the datasets, split them into training and testing sets, and saved them to disk for easy access in the training notebook.