# Engagement Score Prediction

## Problem Statement

ABC is an online content sharing platform that enables users to create, upload and share the content in the form of videos. It includes videos from different genres like entertainment, education, sports, technology and so on. The maximum duration of video is 10 minutes.

Users can like, comment and share the videos on the platform. 

Based on the user’s interaction with the videos, engagement score is assigned to the video with respect to each user. Engagement score defines how engaging the content of the video is. 

Understanding the engagement score of the video improves the user’s interaction with the platform. It defines the type of content that is appealing to the user and engages the larger audience.


## Objective
The main objective of the problem is to develop the machine learning approach to predict the engagement score of the video on the user level.

In [10]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# warnings
import warnings
warnings.filterwarnings('ignore')

In [11]:
# import machine learning libraries
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [12]:
# import data
train_data = pd.read_csv('data/train_0OECtn8.csv')
train_data.head()

Unnamed: 0,row_id,user_id,category_id,video_id,age,gender,profession,followers,views,engagement_score
0,1,19990,37,128,24,Male,Student,180,1000,4.33
1,2,5304,32,132,14,Female,Student,330,714,1.79
2,3,1840,12,24,19,Male,Student,180,138,4.35
3,4,12597,23,112,19,Male,Student,220,613,3.77
4,5,13626,23,112,27,Male,Working Professional,220,613,3.13


In [13]:
# shape 
train_data.shape

(89197, 10)

In [14]:
# lets check for null values
train_data.isna().sum()

row_id              0
user_id             0
category_id         0
video_id            0
age                 0
gender              0
profession          0
followers           0
views               0
engagement_score    0
dtype: int64

In [15]:
# unique values in a column
train_data.nunique()

row_id              89197
user_id             27734
category_id            47
video_id              175
age                    58
gender                  2
profession              3
followers              17
views                  43
engagement_score      229
dtype: int64

## data preparation

In [16]:
train_data.head()

Unnamed: 0,row_id,user_id,category_id,video_id,age,gender,profession,followers,views,engagement_score
0,1,19990,37,128,24,Male,Student,180,1000,4.33
1,2,5304,32,132,14,Female,Student,330,714,1.79
2,3,1840,12,24,19,Male,Student,180,138,4.35
3,4,12597,23,112,19,Male,Student,220,613,3.77
4,5,13626,23,112,27,Male,Working Professional,220,613,3.13


In [17]:
# remove unwanted features/columns to assign X and y
X = train_data.drop(['row_id', 'engagement_score'], axis=1)
y = train_data['engagement_score']

In [18]:
# label encoder for catagorical features
encoder = LabelEncoder()

cat_cols = ['gender', 'profession']

for col in cat_cols:
    X[col] = encoder.fit_transform(X[col])

X.head()

Unnamed: 0,user_id,category_id,video_id,age,gender,profession,followers,views
0,19990,37,128,24,1,1,180,1000
1,5304,32,132,14,0,1,330,714
2,1840,12,24,19,1,1,180,138
3,12597,23,112,19,1,1,220,613
4,13626,23,112,27,1,2,220,613


In [19]:
X.nunique()

user_id        27734
category_id       47
video_id         175
age               58
gender             2
profession         3
followers         17
views             43
dtype: int64

## Train Test Split

In [20]:
# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [21]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(62437, 8)
(62437,)
(26760, 8)
(26760,)


### Data Normalization

In [22]:
# # MinMaxScaler
# scaler = MinMaxScaler()

# Standarscaler
scaler = StandardScaler()

# all columns to all_cols
all_cols = X_train.columns

# fit scaler
X_train[all_cols] = scaler.fit_transform(X_train[all_cols])

In [23]:
X_train.head()

Unnamed: 0,user_id,category_id,video_id,age,gender,profession,followers,views
6159,-1.175375,-0.629432,-1.128619,0.350986,0.839918,-1.280995,-0.055484,-1.204521
45351,0.333129,-1.142575,0.645257,1.690816,-1.190592,-1.280995,-0.272351,-1.022034
19559,-1.422511,0.567902,1.264052,0.350986,0.839918,-1.280995,-0.272351,-0.694301
83744,1.082518,1.166569,0.294607,0.909248,0.839918,-1.280995,1.679457,0.784218
23113,-0.714394,0.054759,0.892775,0.797596,-1.190592,-1.280995,-0.489219,-0.500641


In [24]:
X_train.describe()

Unnamed: 0,user_id,category_id,video_id,age,gender,profession,followers,views
count,62437.0,62437.0,62437.0,62437.0,62437.0,62437.0,62437.0,62437.0
mean,2.7696450000000003e-17,3.494561e-16,2.083315e-16,1.88433e-15,1.474584e-16,-1.244541e-15,5.011927e-16,1.021175e-15
std,1.000008,1.000008,1.000008,1.000008,1.000008,1.000008,1.000008,1.000008
min,-1.732491,-1.484671,-1.582402,-1.65876,-1.190592,-1.280995,-2.007292,-1.763155
25%,-0.8666413,-0.8860036,-0.9017282,-0.7655395,-1.190592,-1.280995,-0.4892191,-1.022034
50%,-0.001415008,-0.2018126,-0.03541621,-0.207277,0.8399184,0.1470418,-0.2723514,-0.1356668
75%,0.8679261,0.6534262,0.8515222,0.7975955,0.8399184,0.1470418,0.5951191,0.7842179
max,1.725546,2.449428,2.006605,4.817086,0.8399184,1.575079,2.33006,1.849348


## Linear Regression

In [25]:
from sklearn.linear_model import LinearRegression

In [26]:
le_regg = LinearRegression()

le_regg.fit(X_train, y_train)

LinearRegression()

In [27]:
from sklearn.metrics import r2_score

In [28]:
y_train_pred = le_regg.predict(X_train)
y_test_pred = le_regg.predict(X_test)

In [29]:
r2_score(y_test, y_test_pred)

-4009.705480292893

In [30]:
r2_score(y_train, y_train_pred)

0.23665917760202815

## XGBoost Regressor

In [31]:
# import libraries
from xgboost import XGBRegressor

In [32]:
xgb_reg = XGBRegressor(n_estimators = 500, max_depth = 12, learning_rate=0.01)

xgb_reg.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.01, max_delta_step=0,
             max_depth=12, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=500, n_jobs=8,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [33]:
y_train_pred = xgb_reg.predict(X_train)
y_test_pred = xgb_reg.predict(X_test)

In [34]:
print('Train score:', r2_score(y_train, y_train_pred))
print('Test Score:',r2_score(y_test, y_test_pred))

Train score: 0.5946972939640556
Test Score: -0.2792050348028223


# Cannot be used for prediction