# Engagement Score Prediction

## Problem Statement

ABC is an online content sharing platform that enables users to create, upload and share the content in the form of videos. It includes videos from different genres like entertainment, education, sports, technology and so on. The maximum duration of video is 10 minutes.

Users can like, comment and share the videos on the platform. 

Based on the user’s interaction with the videos, engagement score is assigned to the video with respect to each user. Engagement score defines how engaging the content of the video is. 

Understanding the engagement score of the video improves the user’s interaction with the platform. It defines the type of content that is appealing to the user and engages the larger audience.


## Objective
The main objective of the problem is to develop the machine learning approach to predict the engagement score of the video on the user level.

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# import machine learning libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score

# import linear regression
from sklearn.linear_model import LinearRegression

# import xgboost
from xgboost import XGBRegressor

# import gridsearchcv for cross-validation
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [3]:
# import data
train_data = pd.read_csv('data/train_0OECtn8.csv')
train_data.head()

Unnamed: 0,row_id,user_id,category_id,video_id,age,gender,profession,followers,views,engagement_score
0,1,19990,37,128,24,Male,Student,180,1000,4.33
1,2,5304,32,132,14,Female,Student,330,714,1.79
2,3,1840,12,24,19,Male,Student,180,138,4.35
3,4,12597,23,112,19,Male,Student,220,613,3.77
4,5,13626,23,112,27,Male,Working Professional,220,613,3.13


In [4]:
# shape 
train_data.shape

(89197, 10)

In [5]:
# lets check for null values
train_data.isna().sum()

row_id              0
user_id             0
category_id         0
video_id            0
age                 0
gender              0
profession          0
followers           0
views               0
engagement_score    0
dtype: int64

In [6]:
# unique values in a column
train_data.nunique()

row_id              89197
user_id             27734
category_id            47
video_id              175
age                    58
gender                  2
profession              3
followers              17
views                  43
engagement_score      229
dtype: int64

## data preparation

In [7]:
train_data.head()

Unnamed: 0,row_id,user_id,category_id,video_id,age,gender,profession,followers,views,engagement_score
0,1,19990,37,128,24,Male,Student,180,1000,4.33
1,2,5304,32,132,14,Female,Student,330,714,1.79
2,3,1840,12,24,19,Male,Student,180,138,4.35
3,4,12597,23,112,19,Male,Student,220,613,3.77
4,5,13626,23,112,27,Male,Working Professional,220,613,3.13


In [8]:
# remove unwanted features/columns to assign X and y
X = train_data.drop(['row_id', 'user_id','category_id','video_id', 'engagement_score'], axis=1)
y = train_data['engagement_score']

In [9]:
# label encoder for catagorical features
encoder = LabelEncoder()

cat_cols = ['gender', 'profession']

for col in cat_cols:
    X[col] = encoder.fit_transform(X[col])

X.head()

Unnamed: 0,age,gender,profession,followers,views
0,24,1,1,180,1000
1,14,0,1,330,714
2,19,1,1,180,138
3,19,1,1,220,613
4,27,1,2,220,613


In [10]:
X.nunique()

age           58
gender         2
profession     3
followers     17
views         43
dtype: int64

## Train Test Split

In [11]:
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [12]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(62437, 5)
(62437,)
(26760, 5)
(26760,)


### Data Standardization

In [13]:
# # MinMaxScaler
# scaler = MinMaxScaler()

# Standarscaler
scaler = StandardScaler()

# all columns to all_cols
all_cols = X_train.columns

# fit scaler
X_train[all_cols] = scaler.fit_transform(X_train[all_cols])
X_test[all_cols] = scaler.transform(X_test[all_cols])

In [14]:
X_train.head()

Unnamed: 0,age,gender,profession,followers,views
6159,0.350986,0.839918,-1.280995,-0.055484,-1.204521
45351,1.690816,-1.190592,-1.280995,-0.272351,-1.022034
19559,0.350986,0.839918,-1.280995,-0.272351,-0.694301
83744,0.909248,0.839918,-1.280995,1.679457,0.784218
23113,0.797596,-1.190592,-1.280995,-0.489219,-0.500641


In [15]:
X_train.describe()

Unnamed: 0,age,gender,profession,followers,views
count,62437.0,62437.0,62437.0,62437.0,62437.0
mean,1.88433e-15,1.474584e-16,-1.244541e-15,5.011927e-16,1.021175e-15
std,1.000008,1.000008,1.000008,1.000008,1.000008
min,-1.65876,-1.190592,-1.280995,-2.007292,-1.763155
25%,-0.7655395,-1.190592,-1.280995,-0.4892191,-1.022034
50%,-0.207277,0.8399184,0.1470418,-0.2723514,-0.1356668
75%,0.7975955,0.8399184,0.1470418,0.5951191,0.7842179
max,4.817086,0.8399184,1.575079,2.33006,1.849348


## Linear Regression

In [16]:
le_regg = LinearRegression()

le_regg.fit(X_train, y_train)

LinearRegression()

In [17]:
y_train_pred = le_regg.predict(X_train)
y_test_pred = le_regg.predict(X_test)

In [18]:
print('Train score:', r2_score(y_train, y_train_pred))
print('Test Score:',r2_score(y_test, y_test_pred))

Train score: 0.22873778028867142
Test Score: 0.23078575916618294


## XGBoost Regressor

In [20]:
xgb_reg = XGBRegressor(n_estimators = 800, max_depth = 8, learning_rate=0.01)

xgb_reg.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.01, max_delta_step=0,
             max_depth=8, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=800, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [21]:
y_train_pred = xgb_reg.predict(X_train)
y_test_pred = xgb_reg.predict(X_test)

In [22]:
print('Train score:', r2_score(y_train, y_train_pred))
print('Test Score:',r2_score(y_test, y_test_pred))

Train score: 0.38667672834521194
Test Score: 0.3528540644392151


### Hyper parameter tuning with XGBoost & RandomizedSearchCV - 1

In [22]:
# import gridsearchcv
from sklearn.model_selection import RandomizedSearchCV

In [27]:
xgb_reg_cv = XGBRegressor(random_state=42)

params = {
    'learning_rate': [0.005, 0.01, 0.02, 0.03, 0.04, 0.05],
    'max_depth': [3, 5, 8, 10, 12],
    'n_estimators': [100, 250, 400, 500, 700, 800]
}

# # grid search
# grid_model = GridSearchCV(estimator=xgb_reg_cv, param_grid=params, n_jobs=-1, cv=4,
#                           verbose=2, return_train_score=True, scoring=['r2'])

# randomized search
random_model = RandomizedSearchCV(estimator=xgb_reg_cv, param_distributions=params, 
                                  n_jobs=-1, cv=4,
                          verbose=2, return_train_score=True)

In [28]:
# # fit grid model
# grid_model.fit(X_train, y_train)

# fit random model
random_model.fit(X_train, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


RandomizedSearchCV(cv=4,
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None,
                                          enable_categorical=False, gamma=None,
                                          gpu_id=None, importance_type=None,
                                          interaction_constraints=None,
                                          learning_rate=None,
                                          max_delta_step=None, max_depth=None,
                                          min_child_weight=None, missing=nan,
                                          monotone_constraints=...
                                          num_parallel_tree=None,
                                          predictor=None, random_state=42,
                                          reg_alph

In [49]:
random_model.best_params_

{'n_estimators': 700, 'max_depth': 5, 'learning_rate': 0.05}

In [29]:
best_model = random_model.best_estimator_
best_model

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.05, max_delta_step=0,
             max_depth=5, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=700, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=42,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [30]:
y_train_pred = best_model.predict(X_train)
y_test_pred = best_model.predict(X_test)

In [31]:
print('Train score:', r2_score(y_train, y_train_pred))
print('Test Score:',r2_score(y_test, y_test_pred))

Train score: 0.3752077290138349
Test Score: 0.35364993096676844


Model with the above parameters present in `best_model` has better test score.

Looks better than the previous 1st model of XGB, so lets proceed in predicting scores for test data set.

# Test Data

**import test data and do preprocessing techniques as done for train data**

In [34]:
# import data
test_data = pd.read_csv('data/test_1zqHu22.csv')

# info of data
print("Shape of Test data:", test_data.shape)
print("Columns present in Test data:",test_data.columns)

# preprocess the data
new_test_data = test_data.drop(['row_id', 'user_id','category_id','video_id'], axis=1)

# encoding categorical columns
cat_cols = ['gender', 'profession']
for col in cat_cols:
    new_test_data[col] = encoder.fit_transform(new_test_data[col])

# standardize the data
all_test_cols = new_test_data.columns
new_test_data[all_test_cols] = scaler.transform(new_test_data[all_test_cols])

# predict the engagement score
target = best_model.predict(new_test_data)

print(target[:10])

Shape of Test data: (11121, 9)
Columns present in Test data: Index(['row_id', 'user_id', 'category_id', 'video_id', 'age', 'gender',
       'profession', 'followers', 'views'],
      dtype='object')
[4.1037755 3.7482507 2.7046568 3.8991432 2.6361735 3.9714184 3.7615764
 3.8025105 2.6055307 4.040778 ]


In [47]:
# submission dataframe
submission_df = pd.DataFrame({'row_id': test_data.row_id.values,
                            'engagement_score': target})

submission_df.head()

Unnamed: 0,row_id,engagement_score
0,89198,4.103776
1,89199,3.748251
2,89200,2.704657
3,89201,3.899143
4,89202,2.636173


In [48]:
# save to sumission_csv
submission_df.to_csv('submission.csv', index=False)
print("File saved successfully!!")

File saved successfully!!
