## Introduction

We will now proceed to create our testing dataset. For that, we will load our cleaned interaction matrix (small_matrix) and, for each interaction, we will take the corresponding previously engineered user and video vectors. We will also save the "watch_ratio" of the interaction which will be our "y" value to predict.

The important detail is that, to simulate a real-life scenario, we consider that our recommender system would be deployed after the learning process. It is why we will take the last video feature vector we computed date-wise from our training dataset. So, in our test dataset, each video will have only one corresponding feature vector.

## Imports

In [1]:
import pandas as pd
import os
import joblib

## Loading the data

In [2]:
export_dir = "./exports/feature_engineered_data/"
small_matrix_cleaned = pd.read_parquet("./exports/cleaned_data/small_matrix_cleaned.pq")
user_df = pd.read_parquet(export_dir + "user_df.pq")
video_df = pd.read_parquet(export_dir + "video_df.pq")

## Test dataset creation

### Step 1: Merging everything on our interaction matrix (small matrix)

Here we sort the video feature vectors by date and keep only the most recent one, for the reason explained earlier.

In [3]:
small_matrix_cleaned = small_matrix_cleaned.drop(columns=["video_duration", "time"])
small_matrix_cleaned = small_matrix_cleaned.merge(user_df, on='user_id', how='left')

# Keep only the most recent video feature vector:
video_df['date'] = pd.to_datetime(video_df['date']).dt.date
video_df = video_df.sort_values("date", ascending=False).groupby("video_id").head(1).reset_index(drop=True)

small_matrix_cleaned = small_matrix_cleaned.merge(video_df, on=['video_id'], how='left')
small_matrix_cleaned = small_matrix_cleaned.drop(columns=["date"])
small_matrix_cleaned

Unnamed: 0,user_id,video_id,watch_ratio,avg_feat_0,avg_feat_1,avg_feat_2,avg_feat_3,avg_feat_4,avg_feat_5,avg_feat_6,...,category_30,category_31,category_32,category_33,category_34,category_35,category_36,category_37,category_38,category_39
0,14,148,0.722103,3.807366,1.006094,1.317424,0.000000,1.171372,1.335805,0.921278,...,0,0,0,0,0,0,0,0,0,0
1,14,183,1.907377,3.807366,1.006094,1.317424,0.000000,1.171372,1.335805,0.921278,...,0,0,0,0,0,0,0,0,0,0
2,14,3649,2.063311,3.807366,1.006094,1.317424,0.000000,1.171372,1.335805,0.921278,...,0,0,0,0,0,0,0,0,0,0
3,14,5262,0.566388,3.807366,1.006094,1.317424,0.000000,1.171372,1.335805,0.921278,...,0,0,0,0,0,0,0,0,0,0
4,14,8234,0.418364,3.807366,1.006094,1.317424,0.000000,1.171372,1.335805,0.921278,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3837173,7162,9177,0.142857,0.323295,1.300315,0.665100,0.714526,0.898477,1.414654,0.976267,...,0,0,0,0,0,0,0,0,0,0
3837174,7162,4987,1.234848,0.323295,1.300315,0.665100,0.714526,0.898477,1.414654,0.976267,...,0,0,0,0,0,0,0,0,0,0
3837175,7162,7988,1.024412,0.323295,1.300315,0.665100,0.714526,0.898477,1.414654,0.976267,...,0,0,0,0,0,0,0,0,0,0
3837176,7162,6533,0.273750,0.323295,1.300315,0.665100,0.714526,0.898477,1.414654,0.976267,...,0,0,0,0,0,0,0,0,0,0


### Step 2: Extracting and scaling our features

Since the test dataset is way smaller, we can apply all the transformations directly on the dataframe without having to convert to numpy arrays.

The point of attention here is that we load the scalers we saved from our training dataset, so we do not train new scalers and use these ones instead for consistency.

In [4]:
user_feature_cols = [col for col in user_df.columns if col not in ('user_id')]
video_feature_cols = [col for col in video_df.columns if col not in ('video_id', 'date')]

export_dir = "./exports/scalers/"

user_scaler = joblib.load(export_dir + "user_scaler.pkl")
small_matrix_cleaned[user_feature_cols] = user_scaler.transform(small_matrix_cleaned[user_feature_cols])

video_scaler = joblib.load(export_dir + "video_scaler.pkl")
small_matrix_cleaned[["video_duration", "trend_score"]] = video_scaler.transform(small_matrix_cleaned[["video_duration", "trend_score"]])

## Saving the data

In [5]:
export_dir = "./exports/test_data/"
if not os.path.exists(export_dir):
    os.makedirs(export_dir)
small_matrix_cleaned.to_parquet(export_dir + "test_dataset.pq")