<div style="text-align: center; background-color: #559cff; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Lab 02 - Introduction To Data Science @ FIT-HCMUS, VNU-HCM 📌
</div>

<div style="text-align: center; background-color: #b1d1ff; font-family: 'Trebuchet MS', Arial, sans-serif; color: white; padding: 20px; font-size: 40px; font-weight: bold; border-radius: 0 0 0 0; box-shadow: 0px 6px 8px rgba(0, 0, 0, 0.2);">
  Stage 4.0 - Data modelling
</div>

# Import

In [1]:
import ast
import re
from joblib import dump, load

import numpy as np
import pandas as pd
from scipy import stats

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_error

random_state = 42

# Data preparation

In [2]:
df_videos = pd.read_csv('../data/processed/df_videos_processed.csv')

# Feature engineering

This section outlines the feature engineering process undertaken to construct informative features from video durations and tags for the purpose of predicting view counts. The goal of this process is to transform raw data into meaningful features that can be effectively utilized by a predictive model.

We are going to convert these columns to appropriate data types and clean the tags.

In [3]:
def clean_text(text):
  text = text.replace('#', '').replace('$', '')
  text = re.sub(r'\b\d+\.\d+\b', '', text)
  text = text.lower()
  if text == '':
    return np.nan
  return text

In [4]:
df_process = df_videos[df_videos['tags'].notnull()][['view_count', 'duration', 'tags']]
df_process['duration'] = df_process['duration'].apply(pd.to_timedelta).dt.total_seconds()
df_process['tags'] = df_process['tags'].apply(lambda x: ast.literal_eval(x) if pd.notnull(x) else np.nan)
df_process['tags'] = df_process['tags'].apply(lambda x: list(set([clean_text(tag) for tag in x])))

How many unique tags are there?

In [5]:
df_process['tags'].explode().nunique()

23161

There are quite a lot of tags, so we will select only some of the repeating tags to improve model performance.

In [6]:
selected_tag = df_process['tags'].explode().value_counts()
threshold = 100
selected_tag = selected_tag[selected_tag > threshold]
len(selected_tag)

99

Save these tags for deployment.

In [7]:
selected_tag.sort_index().reset_index()['index'].to_json('../deploy/materials/tags.json')

Remove tags that are not in the selected tags.

In [8]:
df_process['clean_tags'] = df_process['tags'].apply(lambda x: [tag for tag in x if tag in selected_tag])

In [9]:
df_process = df_process[df_process['clean_tags'].apply(len) > 0]

Remove outliers from `view_count` and `duration` columns.

In [10]:
df_process = df_process[(np.abs(stats.zscore(df_process[['view_count', 'duration']])) < 3).all(axis=1)].reset_index(drop = True)

# Training model

## Setting up

In [11]:
X = df_process[['duration', 'clean_tags']]
y = df_process['view_count']

In [12]:
mlb = MultiLabelBinarizer()
X = X.join(pd.DataFrame(mlb.fit_transform(X.pop('clean_tags')), columns = mlb.classes_, index = X.index))

In [13]:
kfold = KFold(n_splits = 5, shuffle = True, random_state = random_state)

In [14]:
regr = RandomForestRegressor(random_state = random_state)

In [15]:
param_grid = {'n_estimators': [50, 100, 150, 200],
              'criterion': ['squared_error', 'friedman_mse', 'poisson'],
              'max_depth': [None],
              'max_features': [None, 'sqrt']}

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = random_state)

## Cross validation and hyperparameter tuning

In [17]:
grid_cv = GridSearchCV(estimator = regr, param_grid = param_grid, cv = kfold,
                       scoring = 'neg_mean_absolute_error', n_jobs = 1, verbose = 3)

In [18]:
grid_cv.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[CV 1/5] END criterion=squared_error, max_depth=None, max_features=None, n_estimators=50;, score=-101575.907 total time=   2.0s
[CV 2/5] END criterion=squared_error, max_depth=None, max_features=None, n_estimators=50;, score=-101913.826 total time=   1.9s
[CV 3/5] END criterion=squared_error, max_depth=None, max_features=None, n_estimators=50;, score=-117279.964 total time=   1.9s
[CV 4/5] END criterion=squared_error, max_depth=None, max_features=None, n_estimators=50;, score=-114403.358 total time=   1.6s
[CV 5/5] END criterion=squared_error, max_depth=None, max_features=None, n_estimators=50;, score=-112637.043 total time=   1.7s
[CV 1/5] END criterion=squared_error, max_depth=None, max_features=None, n_estimators=100;, score=-101961.399 total time=   6.2s
[CV 2/5] END criterion=squared_error, max_depth=None, max_features=None, n_estimators=100;, score=-101674.661 total time=   5.2s
[CV 3/5] END criterion=squared_error, ma

## Train the best model from GridSearchCV

In [19]:
best_params = grid_cv.best_params_
print(best_params)

{'criterion': 'poisson', 'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 150}


In [20]:
model = grid_cv.best_estimator_

In [21]:
model.fit(X_train, y_train)

# Evaluation

In [22]:
predictions = model.predict(X_test)
print('Mean Absolute Error (MAE):', mean_absolute_error(y_test, predictions))
print('Mean Squared Error (MSE):', mean_squared_error(y_test, predictions))
print('Root Mean Squared Error (MSE):', np.sqrt(mean_squared_error(y_test, predictions)))

Mean Absolute Error (MAE): 108575.62881281701
Mean Squared Error (MSE): 69849458368.53297
Root Mean Squared Error (MSE): 264290.4810403375


# Prediction

In [23]:
def make_data_predict(duration, tags):
  new_data = pd.DataFrame({ 'duration': [duration], 'tags': [tags]})
  return new_data.join(pd.DataFrame(mlb.transform(new_data.pop('tags')), columns = mlb.classes_, index = new_data.index))

In [24]:
print(model.predict(make_data_predict(60, ['machine learning'])))

[24214.74371429]


# Results analysis

MAE, MSE, and RMSE are error metrics that quantify model predictions' deviations from actual values. High scores indicate significant prediction errors.

Possible causes:
- Overfitting: Model captures noise in training data, failing to generalise to new examples.
- Underfitting: Model lacks complexity to capture underlying patterns in data.
- Data quality issues: Undetected noise, outliers in data can distort model learning.

# Conclusion

Although video duration and tags offer valuable clues to predict viewership, achieving accurate forecasts demands meticulous model tuning. Relying solely on extracted data from these elements is insufficient. A refined model, crafted through careful feature engineering, selection, and parameter optimisation, unlocks the true predictive power within. Only then can we confidently estimate a video's potential audience.

# Saving models for deployment

In [25]:
model.fit(X, y)

In [26]:
dump(model, '../models/model.joblib')
dump(mlb, '../deploy/materials/mlb.joblib')

['mlb.joblib']