#Problem Statement

ABC is an online content sharing platform that enables users to create, upload and share the content in the form of videos. It includes videos from different genres like entertainment, education, sports, technology and so on. The maximum duration of video is 10 minutes.

Users can like, comment and share the videos on the platform. 

Based on the user’s interaction with the videos, engagement score is assigned to the video with respect to each user. Engagement score defines how engaging the content of the video is. 

Understanding the engagement score of the video improves the user’s interaction with the platform. It defines the type of content that is appealing to the user and engages the larger audience.

#Objective
The main objective of the problem is to develop the machine learning approach to predict the engagement score of the video on the user level.

In [1]:
# First of all importing the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
# reading the file from G drive
train = pd.read_csv("/content/drive/My Drive/Engagement Score Prediction/train.csv")
test = pd.read_csv("/content/drive/My Drive/Engagement Score Prediction/test.csv")
# train['data'] = 'train'
# test['data'] = 'test'
test, train

(       row_id  user_id  category_id  ...            profession  followers views
 0       89198     7986           12  ...               Student        180   138
 1       89199    11278           34  ...               Student        230   840
 2       89200    17245            8  ...  Working Professional        280   628
 3       89201     9851           16  ...               Student        270   462
 4       89202    16008           34  ...                 Other        230   840
 ...       ...      ...          ...  ...                   ...        ...   ...
 11116  100314    26336           25  ...               Student        240   317
 11117  100315     6772            8  ...               Student        280   628
 11118  100316     2042           16  ...               Student        270   462
 11119  100317    24626            8  ...                 Other        280   628
 11120  100318      967            8  ...  Working Professional        280   628
 
 [11121 rows x 9 columns],

In [3]:
# label 
outcome = train['engagement_score']
outcome

0        4.33
1        1.79
2        4.35
3        3.77
4        3.13
         ... 
89192    3.91
89193    3.56
89194    4.23
89195    3.77
89196    4.31
Name: engagement_score, Length: 89197, dtype: float64

In [4]:
# here drop the label from feature
train = train.drop(['engagement_score'], axis =1)
test_rowid = test['row_id'] # reference for the final output

In [5]:
# Concatenate the data set 
data = pd.concat([train, test], axis=0).reset_index(drop=True)

In [6]:
# EDA head()
data

Unnamed: 0,row_id,user_id,category_id,video_id,age,gender,profession,followers,views
0,1,19990,37,128,24,Male,Student,180,1000
1,2,5304,32,132,14,Female,Student,330,714
2,3,1840,12,24,19,Male,Student,180,138
3,4,12597,23,112,19,Male,Student,220,613
4,5,13626,23,112,27,Male,Working Professional,220,613
...,...,...,...,...,...,...,...,...,...
100313,100314,26336,25,140,21,Male,Student,240,317
100314,100315,6772,8,100,19,Female,Student,280,628
100315,100316,2042,16,98,22,Male,Student,270,462
100316,100317,24626,8,16,33,Male,Other,280,628


In [7]:
# check the shape of the data
data.shape

(100318, 9)

In [8]:
# 100318- observations , 10- variables
# info of the dataset
# pd.set_option('display.max_rows', None)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100318 entries, 0 to 100317
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   row_id       100318 non-null  int64 
 1   user_id      100318 non-null  int64 
 2   category_id  100318 non-null  int64 
 3   video_id     100318 non-null  int64 
 4   age          100318 non-null  int64 
 5   gender       100318 non-null  object
 6   profession   100318 non-null  object
 7   followers    100318 non-null  int64 
 8   views        100318 non-null  int64 
dtypes: int64(7), object(2)
memory usage: 6.9+ MB


In [9]:
# how data spread for numerical values
data.describe()

Unnamed: 0,row_id,user_id,category_id,video_id,age,followers,views
count,100318.0,100318.0,100318.0,100318.0,100318.0,100318.0,100318.0
mean,50159.5,13875.67935,18.029157,77.94011,24.849229,252.153253,497.656861
std,28959.456489,8005.079041,11.562197,48.499456,8.955318,45.32458,266.974474
min,1.0,1.0,1.0,1.0,10.0,160.0,30.0
25%,25080.25,6938.25,8.0,35.0,18.0,230.0,229.0
50%,50159.5,13889.0,16.0,76.0,23.0,240.0,467.0
75%,75238.75,20813.0,26.0,121.0,32.0,280.0,709.0
max,100318.0,27734.0,47.0,175.0,68.0,360.0,1000.0


In [10]:
# no missing values
# check the duplicate values 
duplicate = data.duplicated()
duplicate

0         False
1         False
2         False
3         False
4         False
          ...  
100313    False
100314    False
100315    False
100316    False
100317    False
Length: 100318, dtype: bool

In [11]:
# label encoder for categorical data 'gender' and 'profession'
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [12]:
# using the label encoder convert to numeric for machine can understand only numerical value
data['gender'] = le.fit_transform(data['gender'])
data['profession'] = le.fit_transform(data['profession'])
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100318 entries, 0 to 100317
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype
---  ------       --------------   -----
 0   row_id       100318 non-null  int64
 1   user_id      100318 non-null  int64
 2   category_id  100318 non-null  int64
 3   video_id     100318 non-null  int64
 4   age          100318 non-null  int64
 5   gender       100318 non-null  int64
 6   profession   100318 non-null  int64
 7   followers    100318 non-null  int64
 8   views        100318 non-null  int64
dtypes: int64(9)
memory usage: 6.9 MB


In [13]:
# change to array from dataframe so that we can fit the model using .values in pandas
temp = data.values
temp

array([[     1,  19990,     37, ...,      1,    180,   1000],
       [     2,   5304,     32, ...,      1,    330,    714],
       [     3,   1840,     12, ...,      1,    180,    138],
       ...,
       [100316,   2042,     16, ...,      1,    270,    462],
       [100317,  24626,      8, ...,      0,    280,    628],
       [100318,    967,      8, ...,      2,    280,    628]])

In [14]:
# now changed to pandas then split as per the given data set from job-a-thon event
train = temp[:89197]
test = temp[89197:]
train, test

(array([[    1, 19990,    37, ...,     1,   180,  1000],
        [    2,  5304,    32, ...,     1,   330,   714],
        [    3,  1840,    12, ...,     1,   180,   138],
        ...,
        [89195, 13655,    16, ...,     1,   270,   462],
        [89196, 24840,     9, ...,     2,   230,   819],
        [89197, 27183,    25, ...,     1,   240,   317]]),
 array([[ 89198,   7986,     12, ...,      1,    180,    138],
        [ 89199,  11278,     34, ...,      1,    230,    840],
        [ 89200,  17245,      8, ...,      2,    280,    628],
        ...,
        [100316,   2042,     16, ...,      1,    270,    462],
        [100317,  24626,      8, ...,      0,    280,    628],
        [100318,    967,      8, ...,      2,    280,    628]]))

In [15]:
# importing train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train, outcome, test_size = 0.3, random_state = 0)

In [16]:
# here we use the linear regression model because of label is numerical value this is a regression problem that why
from sklearn.linear_model import LinearRegression

In [18]:
# fit the model then predict using the test data
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)
predictions = linear_regression.predict(X_test)

In [25]:
# importing r2 score for accuracy purpose 
from sklearn.metrics import r2_score
# check the engagement_score using r2 score
score = r2_score(y_test, predictions)
score

0.24273396883727627

In [21]:
# here prediction for test data set
prediction = linear_regression.predict(test)
prediction

array([4.07531396, 3.8121079 , 2.60246022, ..., 3.85896581, 3.69906607,
       3.54098126])

In [22]:
# here we convert from array to dataframe 
prediction = pd.DataFrame(prediction)
prediction.columns = ['engagement_score']
df = pd.DataFrame()
df['row_id'] = test_rowid
prediction

Unnamed: 0,engagement_score
0,4.075314
1,3.812108
2,2.602460
3,3.962550
4,2.451964
...,...
11116,3.843967
11117,3.245489
11118,3.858966
11119,3.699066


In [23]:
# concat the row id and prediction 
prediction = pd.concat([df,prediction], axis=1)
prediction

Unnamed: 0,row_id,engagement_score
0,89198,4.075314
1,89199,3.812108
2,89200,2.602460
3,89201,3.962550
4,89202,2.451964
...,...,...
11116,100314,3.843967
11117,100315,3.245489
11118,100316,3.858966
11119,100317,3.699066


In [24]:
# finally our job is done and export csv file
prediction.to_csv('/content/drive/My Drive/Engagement Score Prediction/testpred.csv')