# Feature Engineering & Model Developpment

Please run eda.ipynb first

### Data Loading
Load necessary libraries and datasets.

In [15]:
import numpy as np
import pandas as pd

import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer, StandardScaler
from sklearn.metrics import mean_absolute_error, ndcg_score, precision_score, recall_score
from sklearn.decomposition import PCA

import tensorflow as tf
import keras
from keras.api.layers import Concatenate, Dense, Dot, Dropout, Input
from keras.api.models import Model, Sequential
from keras.api.optimizers import Adam

In [2]:
base_dir = "processed_data/"

big_matrix = pd.read_csv(base_dir + "big_matrix.csv")
small_matrix = pd.read_csv(base_dir + "small_matrix.csv")
social_network = pd.read_csv(base_dir + "social_network.csv")
user_features = pd.read_csv(base_dir + "user_features.csv")
item_daily_features = pd.read_csv(base_dir + "item_daily_features.csv")
item_categories = pd.read_csv(base_dir + "item_categories.csv")

## Categorical Data Preprocessing with MultiLabelBinarizer

- Import preprocessing and evaluation tools from sklearn (OneHotEncoder, MultiLabelBinarizer, StandardScaler, PCA, and various metrics).
- Initialize a MultiLabelBinarizer to convert lists of multiple labels into binary vectors (each column represents a category, with 1 indicating presence and 0 indicating absence).
- Apply the MultiLabelBinarizer to the "feat" column of item_categories, which appears to contain lists of categories per video.
- Result: a DataFrame where each column is a unique category extracted from all lists, and each row shows a 0 or 1 indicating whether that category is present for a given video_id.
- Cast the DataFrame’s columns to int16 to optimize memory usage.

In [3]:
mlb = MultiLabelBinarizer()

item_categories = pd.DataFrame(mlb.fit_transform(item_categories["feat"]),
                  columns=mlb.classes_,
                  index=item_categories["video_id"])

item_categories.reset_index(drop=True, inplace=True)
item_categories[item_categories.columns] = item_categories[item_categories.columns].astype("int16")

## Definition of Feature Groups

- text_features: list of column names representing categorical or textual variables (e.g., video type, upload type).

- integer_feature: list of column names corresponding to integer numerical variables, such as durations, dimensions, identifiers, or video-related statistics (e.g., video duration, view count, number of users).

This separation will likely enable different processing depending on the data type (encoding for categorical features, normalization or other transformations for numerical features).

In [4]:
text_features = ["video_type", "upload_type"]
integer_feature = ['video_duration','video_width', 'video_height', 'music_id', 'video_tag_id','show_cnt', 'show_user_num', 'play_cnt', 'play_user_num',
                'play_duration', 'complete_play_cnt', 'complete_play_user_num', 'valid_play_cnt', 'valid_play_user_num', 'long_time_play_cnt',
                'long_time_play_user_num', 'short_time_play_cnt', 'short_time_play_user_num', 'play_progress']

## Extraction and Encoding of Daily Video Features

- Filter item_daily_features to keep only the latest date (maximum date) for each video_id.
    This selects the most recent daily observation for each video.
- Select the categorical/textual columns (text_features) from this subset.
- Apply One-Hot Encoding (OneHotEncoder) to these text columns, handling unknown categories to avoid errors if new categories appear.
- Convert the encoded result to a NumPy array, then to a pandas DataFrame with explicit column names (e.g., video_type_A, upload_type_B).
- Reconstruct the final item_daily_features DataFrame by concatenating the numerical columns (integer_feature) with the encoded columns.
- Perform a final merge with the binarized categories (item_categories) to create a comprehensive item_features_map table that aggregates all video features.
- Explicitly cast all column names to strings and store the list of column names in item_features_columns for easier future access.

In [None]:
item_daily_features = item_daily_features.loc[item_daily_features.groupby("video_id")["date"].idxmax()].reset_index(drop=True)

text_daily_features = item_daily_features[text_features]
onehotter = OneHotEncoder(handle_unknown="ignore")
onehot_array = onehotter.fit_transform(text_daily_features).toarray()

text_daily_features = pd.DataFrame(
    onehot_array,
    columns=onehotter.get_feature_names_out(text_features),
    index=item_daily_features.index)

item_daily_features = pd.concat([item_daily_features[integer_feature], text_daily_features], axis=1)

In [5]:
item_features_map = pd.concat([item_daily_features, item_categories], axis=1)
item_features_map.columns = item_features_map.columns.map(str)
item_features_columns = item_features_map.columns.tolist()

## Selection and Formatting of User Features

- Define a list user_features_columns containing the names of the columns of interest from the user_features DataFrame.
These include binary indicators (e.g., is_lowactive_period, is_live_streamer, is_video_author) and several one-hot encoded variables (onehot_featX).

- Create a new DataFrame user_features_map that retains only these columns, making a copy to avoid modifying the original.

- Cast all columns explicitly to int16 to optimize memory usage, assuming the values are binary flags or other small integers.

In [6]:
user_features_columns = [
    "is_lowactive_period","is_live_streamer", "is_video_author",
    "onehot_feat0", "onehot_feat1", "onehot_feat2", "onehot_feat3",
    "onehot_feat4", "onehot_feat5", "onehot_feat6", "onehot_feat7",
    "onehot_feat8", "onehot_feat9", "onehot_feat10", "onehot_feat11",
    "onehot_feat12", "onehot_feat13", "onehot_feat14", "onehot_feat15",
    "onehot_feat16", "onehot_feat17"
]
user_features_map = user_features[user_features_columns].copy()

user_features_map[user_features_map.columns] = user_features_map[user_features_map.columns].astype("int16")

## Preparation of Training and Test Sets for User–Video Interactions

- Define the variable `interaction` as 2,500,000, indicating the desired sample size.  
- Randomly sample `interaction` rows from `big_matrix` to create `y_train_df`, the training set.  
- Extract the unique users (`train_users`) and videos (`train_videos`) present in `y_train_df`.  
- Filter `small_matrix` to retain only interactions involving `train_users` and `train_videos`, storing the result in `filtered_small`.  
- Randomly sample from `filtered_small` to create `y_test_df`, with size equal to the minimum of `interaction` and `filtered_small.shape[0]`.  
- Split the `user_id` and `video_id` columns into separate variables for both sets:  
  - Training: `user_ids_train`, `video_ids_train`  
  - Testing: `user_ids_test`, `video_ids_test`  
- Extract the target values `y_train` and `y_test` from `y_train_df` and `y_test_df`, respectively.  


In [7]:
INTERACTION_N = 2500000

# Step 1: Sample training interactions from big_matrix
y_train_df = big_matrix.sample(n=INTERACTION_N, random_state=42).copy()

# Step 2: Get user and video IDs seen in training
train_users = set(y_train_df["user_id"])
train_videos = set(y_train_df["video_id"])

# Step 3: Filter small_matrix to only keep known user-video pairs
filtered_small = small_matrix[
    small_matrix["user_id"].isin(train_users) &
    small_matrix["video_id"].isin(train_videos)
].copy()

# Optional: Limit test size to INTERACTION_N or smaller
y_test_df = filtered_small.sample(n=min(INTERACTION_N, len(filtered_small)), random_state=42)

# Step 4: Extract IDs for later use
user_ids_train, video_ids_train = y_train_df["user_id"], y_train_df["video_id"]
user_ids_test, video_ids_test = y_test_df["user_id"], y_test_df["video_id"]

# Step 5: Extract targets
y_train = y_train_df[["watch_ratio"]].values
y_test = y_test_df[["watch_ratio"]].values

## Extraction of User and Video Features for Training and Test Sets

- Use `user_ids_train` to select matching user feature vectors from `user_features_map`, storing them in `user_features_train`.  
- Perform the same operation for training videos: extract item features from `item_features_map` using `video_ids_train`, storing them in `item_features_train`.  
- Repeat this process for the test set with `user_ids_test` and `video_ids_test`, producing `user_features_test` and `item_features_test`.  
- Implicitly convert to NumPy arrays via `.values` for seamless integration with machine learning models.  

This step prepares the input features required to train and evaluate a user–item interaction prediction model.  


In [8]:
user_features_train = user_features_map.iloc[user_ids_train].values
item_features_train = item_features_map.iloc[video_ids_train].values
user_features_test = user_features_map.iloc[user_ids_test].values
item_features_test = item_features_map.iloc[video_ids_test].values

## Normalization of Features and Targets with StandardScaler

- Instantiate a `StandardScaler` for item features (`scalerItem`), which standardizes data to zero mean and unit variance.  
  - Fit and transform on the training set (`item_features_train`).  
  - Transform the test set (`item_features_test`) using the same scaler (no refitting).  
- Apply the same process to user features using `scalerUser`.  
- Use a separate `StandardScaler` (`scalerTarget`) to normalize target values `y_train` and `y_test`.  

This standardization is crucial for most machine learning algorithms to prevent variables with larger scales from dominating.  


In [9]:
scalerItem = StandardScaler()
item_features_train = scalerItem.fit_transform(item_features_train)
item_features_test = scalerItem.transform(item_features_test)

In [10]:
scalerUser = StandardScaler()
user_features_train = scalerUser.fit_transform(user_features_train)
user_features_test = scalerUser.transform(user_features_test)

In [11]:
scalerTarget = StandardScaler()
y_train = scalerTarget.fit_transform(y_train)
y_test = scalerTarget.transform(y_test)

## Dimensionality Reduction with PCA (Principal Component Analysis)

- Initialize a PCA with `n_components=0.95`, selecting enough components to explain 95% of the data variance.  
- Apply PCA to item features:  
  - Fit and transform on the training set (`fit_transform`).  
  - Transform the test set using the same model (`transform`).  
- Apply the same process to user features.  

This step reduces the dimensionality of the data while retaining the majority of the information, helping to streamline downstream models and reduce computational cost.  


In [12]:
pca_item = PCA(n_components=0.95)
item_features_train = pca_item.fit_transform(item_features_train)
item_features_test = pca_item.transform(item_features_test)

In [13]:
pca_user = PCA(n_components=0.95)
user_features_train = pca_user.fit_transform(user_features_train)
user_features_test = pca_user.transform(user_features_test)

## Construction of a Neural Network to Encode Item Features

- Import TensorFlow and Keras libraries, along with necessary layers and classes (`Input`, `Dense`, `Dropout`, `Sequential`, etc.).  
- Define input dimensions for users (`user_dim`) and items (`item_dim`) based on the training data.  
- Create a Keras input `input_item` corresponding to a vector of size `item_dim` (reduced item features).  
- Define a sequential model `item_NN` to encode item features:  
  1. Dense layer with 128 units and ReLU activation  
  2. Dropout layer with rate = 0.2 for regularization  
  3. Dense layer with 64 units and ReLU activation  
  4. Final Dense layer with 32 units to produce the **item embedding**, a compact dense representation of the item  
- Apply `item_NN` to `input_item` to obtain `item_embedding`.  

This architecture learns a dense representation of items from their features, facilitating subsequent interaction modeling.  


In [16]:
user_dim = user_features_train.shape[1]
item_dim = item_features_train.shape[1]

input_item = Input(shape=(item_dim,), name="item_input")

item_NN = Sequential(
    [
        Dense(128, activation="relu", name="item_x"),
        Dropout(0.2),
        Dense(64, activation="relu"),
        Dense(32, activation="relu", name="item_embedding"),
    ], name="item_NN"
)

item_embedding = item_NN(input_item)

## Construction of a Neural Network to Encode User Features

- Create a Keras input `input_user` with shape `user_dim`, representing the user feature vectors.  
- Define a sequential model `user_NN` analogous to the item encoder:  
  1. Dense layer with 128 units and ReLU activation  
  2. Dropout layer with rate 0.2 to reduce overfitting  
  3. Dense layer with 64 units and ReLU activation  
  4. Final Dense layer with 32 units to produce the **user embedding**, a compact dense representation of user characteristics  
- Apply the `user_NN` model to `input_user` to obtain `user_embedding`.  

This step learns a dense numerical representation of users, comparable to item embeddings, for modeling their interactions.  


In [17]:
input_user = Input(shape=(user_dim,), name="user_input")

user_NN = Sequential(
    [
        Dense(128, activation="relu", name="user_x"),
        Dropout(0.2),
        Dense(64, activation="relu"),
        Dense(32, activation="relu", name="user_embedding"),
    ], name="user_NN"
)
user_embedding = user_NN(input_user)

## Construction and Compilation of the Final Prediction Model (“Two-Tower”)

Merge embeddings: concatenate the item embedding (item_embedding) and the user embedding (user_embedding) to form a joint representation.

Dense network: pass the combined vector through a sequence of dense layers:
- Dense layer with 64 neurons and ReLU activation
- Dropout layer (rate = 0.2) to mitigate overfitting
- Dense layer with 32 neurons and ReLU activation
- Output Dense layer with 1 neuron (no activation) to predict a continuous target (e.g., watch ratio)
  
Model definition: instantiate a Keras Model named Two_Tower_Model, specifying two inputs (input_user, input_item) and the single output tensor.
Compilation:
- Optimizer: Adam with a learning rate of 0.001
- Loss: mean squared error (mse), suitable for regression
- Metric: mean absolute error (mae) for performance monitoring
Architecture summary: call model.summary() to display the full layer-by-layer structure.

In [18]:
combined = Concatenate()([item_embedding, user_embedding])
x = Dense(64, activation="relu")(combined)
x = Dropout(0.2)(x)
x = Dense(32, activation="relu")(x)
output = Dense(1)(x)


model = Model(inputs=[input_user, input_item], outputs=output, name="Two_Tower_Model")

model.compile(
    optimizer=Adam(learning_rate=1e-3),
    loss="mse",
    metrics=["mae"],
)

model.summary()

## Model Training with Early Stopping

Call the Keras fit method to train the model on the training data:
- Inputs: user and item features (user_features_train, item_features_train)
- Targets: normalized values y_train
- Validation data: provided to monitor performance on a separate set (user_features_test, item_features_test, y_test)

Training parameters:
- Epochs: 20
- Batch size: 512 examples per iteration
- EarlyStopping callback enabled to halt training if the loss does not improve for 3 consecutive epochs, with restoration of the best weights achieved.

This mechanism prevents overfitting and automatically stops training when the model ceases to make progress.

In [19]:
history = model.fit(
    x=[user_features_train, item_features_train],
    y=y_train,
    validation_data=([user_features_test, item_features_test], y_test),
    epochs=20,
    batch_size=512,
    callbacks=[keras.callbacks.EarlyStopping(monitor="loss", patience=3, restore_best_weights=True)]
)

Epoch 1/20
[1m3907/3907[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m44s[0m 10ms/step - loss: 1.0394 - mae: 0.3544 - val_loss: 0.5847 - val_mae: 0.2328
Epoch 2/20
[1m3907/3907[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 10ms/step - loss: 0.9141 - mae: 0.3299 - val_loss: 0.5835 - val_mae: 0.2310
Epoch 3/20
[1m3907/3907[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 10ms/step - loss: 0.9248 - mae: 0.3269 - val_loss: 0.5844 - val_mae: 0.2454
Epoch 4/20
[1m3907/3907[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 10ms/step - loss: 0.9620 - mae: 0.3280 - val_loss: 0.5848 - val_mae: 0.2503
Epoch 5/20
[1m3907/3907[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 10ms/step - loss: 0.9394 - mae: 0.3243 - val_loss: 0.5830 - val_mae: 0.2421
Epoch 6/20
[1m3907/3907[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 10ms/step - loss: 0.9284 - mae: 0.3244 - val_loss: 0.5825 - val_mae: 0.2426
Epoch 7/20
[1m3907/3907[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

In [25]:
model.save("proposed_solution.h5")



## Predictions on the Test Set and Rescaling

- Use the loaded model to generate predictions on the user–item pairs in the test set.

- The test_predictions are produced in the normalized space (since the target y_train was standardized prior to training).

- Apply scalerTarget.inverse_transform() to convert these normalized predictions back to their original scale (denormalization), yielding y_pred.

This step enables you to interpret the predictions in the target’s original units (for example, the actual view ratio).

In [27]:
test_predictions = model.predict([user_features_test, item_features_test])
y_pred = scalerTarget.inverse_transform(test_predictions)

[1m62500/62500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m130s[0m 2ms/step


## Preparation of Data for Top-K Performance Evaluation

- Use a **threshold `threshold = 0.85`** to define video **relevance**: a video is considered relevant (`relevant = 1`) if its `true_watch_ratio` is greater than or equal to 0.85.  
- Assign a rank to each prediction per user via `groupby("user_id").cumcount()`. This rank corresponds to the order of predictions in the test set.  
- Filter the **top-K recommendations per user** with `results_df[results_df["rank"] < K]`, keeping the first K predictions for each user.  
- Construct two matrices:  
  - `y_true_matrix`: a binary matrix (0 or 1) indicating for each rank whether the video is actually relevant.  
  - `y_score_matrix`: a matrix of predicted scores (`predicted_watch_ratio`) for each user and each rank.  

These matrices enable the computation of standard evaluation metrics for top-K recommendation systems, such as **NDCG**, **precision**, **recall**, etc.  


In [28]:
threshold = 0.85
K = 25

results_df = pd.DataFrame({
    "user_id": user_ids_test,
    "video_id": video_ids_test,
    "true_watch_ratio":scalerTarget.inverse_transform(y_test).flatten(),
    "predicted_watch_ratio": y_pred.flatten()
})

results_df["distance"] = abs(results_df["predicted_watch_ratio"] - results_df["true_watch_ratio"])

results_df["relevant"] = (results_df["true_watch_ratio"] >= threshold).astype(int)

results_df["rank"] = results_df.groupby("user_id").cumcount()

results_top_k = results_df[results_df["rank"] < K]

y_true_matrix = results_top_k.pivot(index="user_id", columns="rank", values="relevant").fillna(0).astype(int)
y_score_matrix = results_top_k.pivot(index="user_id", columns="rank", values="predicted_watch_ratio").fillna(0)

## Final Model Evaluation and Comparison with a Naive Baseline

- Use `model.evaluate(...)` to assess the model on the test set:  
  - `test_loss[0]` is the **MSE** on normalized data.  
  - `test_loss[1]` is the **MAE** on normalized data.  
- Compute the **mean of the training targets** (`mean_train`) on the original scale to serve as a **baseline**.  
- Create a constant prediction vector `baseline_pred_orig` filled with `mean_train`, simulating a model that always predicts the training mean.  
- Calculate the **baseline MAE** on the original scale using `mean_absolute_error`.  
- Print:  
  - The training target mean.  
  - The model’s test MSE.  
  - The baseline MAE (simple reference).  
  - The actual model MAE on the original scale (average absolute difference between predictions and true values).  

This comparison quantifies the model’s performance against a naive strategy to confirm its added value.  


In [29]:
test_loss = model.evaluate([user_features_test, item_features_test], y_test)

mean_train = scalerTarget.inverse_transform(y_train).mean()

baseline_pred_orig = np.full_like(results_df['true_watch_ratio'], fill_value=mean_train)

baseline_mae = mean_absolute_error(baseline_pred_orig, results_df['true_watch_ratio'])

print(f"Mean value (train): {mean_train:.4f}")
print(f"Test loss (model, scaled): {test_loss[0]:.4f}")
print(f"Baseline MAE (original scale): {baseline_mae:.4f}")
print(f"Model MAE (original scale): {abs(results_df['predicted_watch_ratio'] - results_df['true_watch_ratio']).mean():.4f}")

[1m62500/62500[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m161s[0m 3ms/step - loss: 0.6026 - mae: 0.2527
Mean value (train): 0.9479
Test loss (model, scaled): 0.5843
Baseline MAE (original scale): 0.4956
Model MAE (original scale): 0.4241


## Calculation of Top-K Metrics: NDCG, Precision, and Recall

- **NDCG@K (Normalized Discounted Cumulative Gain)**  
  - Assesses the quality of the ranked recommendations by accounting for item positions.  
  - A highly relevant video appearing near the top of the list contributes more to the score.  
  - Computed via `ndcg_score(...)` on the Top-K score matrices.

- **Precision@K**  
  - Measures the proportion of recommended items among the top K that are actually relevant.  
  - Determined by comparing predicted relevant videos (`y_score_matrix >= RELEVANCE_THRESHOLD`) against ground truth relevance.

- **Recall@K**  
  - Measures the proportion of all truly relevant videos for a user that appear in the top K recommendations.

- **Reporting**  
  - Each metric is printed rounded to four decimal places for clear, side-by-side comparison.

These Top-K metrics are crucial for evaluating recommendation quality in a ranking context, complementing regression errors by focusing on relevance and ordering.


In [35]:
ndcg = ndcg_score(y_true_matrix.values, y_score_matrix.values, k=K)
precision = precision_score(y_true_matrix.values.flatten(), y_score_matrix.values.flatten() >= RELEVANCE_THRESHOLD, average="micro")
recall = recall_score(y_true_matrix.values.flatten(), y_score_matrix.values.flatten() >= RELEVANCE_THRESHOLD, average="micro")

print(f"NDCG@{K}: {ndcg:.4f}")
print(f"Precision@{K}: {precision:.4f}")
print(f"Recall@{K}: {recall:.4f}")

NDCG@25: 0.8952
Precision@25: 0.5266
Recall@25: 0.5266
