# Creator suggestion for influencer sponsorship

## Part 5: Use Auto-ML on the outreach of the brand on positive videos
---

### Content Workflow:

- [Instatiate libraries](#Instatiate-libraries)
- [Creating a predictive model on the popularity of video released](#Creating-a-predictive-model-on-the-popularity-of-video-released)

## Instatiate libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Import progress bar
from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()

# Import tqdm
from tqdm import tqdm

# Modelling
from sklearn.model_selection import train_test_split

# Expand display of the dataframe
pd.options.display.max_colwidth = 500

In [3]:
### Load Saved datasets
df_m = pd.read_csv('./output/df_m.csv')

## Creating a predictive model on the popularity of video released

In [4]:
df_m.drop(columns=['Unnamed: 0','Title'], inplace= True)
print(df_m.shape)
print(df_m.info())
df_m.head()

(15718, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15718 entries, 0 to 15717
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Channel_name      15718 non-null  object 
 1   View_count        15716 non-null  float64
 2   Like_count        15717 non-null  float64
 3   Comment_count     15718 non-null  int64  
 4   View_per_day      15716 non-null  float64
 5   Like_per_day      15717 non-null  float64
 6   Comment_per_day   15718 non-null  float64
 7   Engagement_ratio  15716 non-null  float64
 8   Popular_target    15718 non-null  int64  
 9   Growth            15718 non-null  float64
dtypes: float64(7), int64(2), object(1)
memory usage: 1.2+ MB
None


Unnamed: 0,Channel_name,View_count,Like_count,Comment_count,View_per_day,Like_per_day,Comment_per_day,Engagement_ratio,Popular_target,Growth
0,Tina Huang,16757.0,1113.0,143,9509.002236,631.587963,81.14742,0.853375,1,36.992072
1,Tina Huang,46006.0,2457.0,321,3555.394838,189.879692,24.807237,0.697735,1,36.992072
2,Tina Huang,77435.0,4551.0,379,2985.21075,175.446428,14.610898,0.489443,1,36.992072
3,Tina Huang,44254.0,2683.0,196,1306.3021,79.197554,5.785583,0.442898,1,36.992072
4,Tina Huang,207709.0,11628.0,799,4726.001495,264.571807,18.179642,0.384673,1,36.992072


## Modelling on Numerical Data 

In [4]:
features = [col for col in df_m.columns if col != 'Popular_target']
X = df_m[features]
y = df_m['Popular_target']

In [9]:
features = [col for col in df_m.columns if col != 'Channel_name']
X1 = df_m[features]
y1 = df_m['Channel_name']

In [6]:
# Creating holdout set data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05,
                                                    stratify=y, random_state=42)

In [10]:
# Creating holdout set data
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.05,
                                                    stratify=y, random_state=42)

## Creating Pycaret dataset

In [None]:
#!pip install pycaret

In [7]:
from pycaret.regression import *

In [11]:
pyca_data1 = pd.concat([X1_train, y1_train], axis=1)

In [8]:
pyca_data = pd.concat([X_train, y_train], axis=1)

In [9]:
pyca_model = setup(data=pyca_data,
                   target='Popular_target',
                   session_id=123)

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Popular_target
2,Original Data,"(14932, 10)"
3,Missing Values,True
4,Numeric Features,8
5,Categorical Features,1
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(10452, 70)"


In [10]:
best=compare_models(exclude=['ransac'])

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
xgboost,Extreme Gradient Boosting,0.122,0.0394,0.1984,0.8315,0.1407,0.1603,2.776
lightgbm,Light Gradient Boosting Machine,0.1247,0.041,0.2023,0.8247,0.1434,0.1661,0.178
rf,Random Forest Regressor,0.123,0.0458,0.214,0.804,0.1516,0.1651,3.207
knn,K Neighbors Regressor,0.0985,0.0482,0.2192,0.794,0.1508,0.1555,0.2
et,Extra Trees Regressor,0.1807,0.0808,0.2841,0.6545,0.1997,0.2419,2.859
gbr,Gradient Boosting Regressor,0.2407,0.0923,0.3038,0.6049,0.2155,0.314,1.916
catboost,CatBoost Regressor,0.0871,0.0264,0.1358,0.5868,0.097,0.1146,2.598
dt,Decision Tree Regressor,0.0991,0.0991,0.3146,0.5758,0.2181,0.135,0.102
ada,AdaBoost Regressor,0.3972,0.1781,0.422,0.2376,0.3019,0.4779,0.476
omp,Orthogonal Matching Pursuit,0.4058,0.2005,0.4477,0.1422,0.3117,0.5518,0.032


Thus, lightgbm would be used for the producation model due to the shorter computational time to the $R^2$ score.

## Conclusion

The prediction model of the dataset are for client to check the the popularity of the video prior to releasing the video on social media. 
The lightgbm model will be further fine-tunned to have a better score.