# Instagram Like Prediction @310ai Competition - Modeling, Visualization & Evaluation

This notebook is a continuation of ***Data*** notebook. In this part of project we will train the predictive model, analyze it, visualize its output and evaluate it. The reason for this separation, is to make the notebooks easier to understand and less dependent on each other, especially for this case since the Instagram showed that it reguralry changing its api and bot mitigation methods. Multiple stages of this project have checkpoints, thus the notebook will run without issues.

As we pointed out in the previous notebook, we will use **XGBoost 1.7** as the algorithm. So let's without further ado, dive into it.

In [3]:
import pandas as pd
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

First thing first, we must load the data, also we have some identifying features, called shortcode, username and id, which will not affect the accuracy of the model, thus we can drop these features all together.

In [4]:
df = pd.read_csv('Data/main_dataset.csv')
df.drop(columns=['Unnamed: 0', 'shortcode', 'username', 'id'], inplace=True)
df.head()

Unnamed: 0,post_type,like,comment,object_1,object_2,object_3,object_4,object_5,object_6,object,...,reel_count,reel_avg_view,reel_avg_comment,reel_avg_like,reel_avg_duration,reel_frequency,media_count,media_avg_comment,media_avg_like,media_frequency
0,GraphSidecar,363269,6844,No Object,No Object,No Object,No Object,No Object,No Object,palace,...,1256,9406907.25,13544.5,634500.7,92.466333,18.554887,7393,8635.166667,410027.25,1.090777
1,GraphSidecar,546578,11354,No Object,No Object,No Object,No Object,No Object,No Object,"bikini, two-piece",...,1256,9406907.25,13544.5,634500.7,92.466333,18.554887,7393,8635.166667,410027.25,1.090777
2,GraphImage,734124,11937,No Object,No Object,No Object,No Object,No Object,No Object,"balance_beam, beam",...,1256,9406907.25,13544.5,634500.7,92.466333,18.554887,7393,8635.166667,410027.25,1.090777
3,GraphSidecar,7720568,42249,2 people,people playing football,people playing soccer,ball,No Object,No Object,basketball,...,387,18249237.58,28318.83333,3607850.0,32.616333,17.03613,3475,52543.16667,7980593.083,2.203446
4,GraphImage,14451079,103278,1 person,baby,No Object,No Object,No Object,No Object,"crib, cot",...,387,18249237.58,28318.83333,3607850.0,32.616333,17.03613,3475,52543.16667,7980593.083,2.203446


Now we can separate independent and dependent variables. Usually we cal independent variables X and dependent variables y.

In [5]:
# remove objects from instagram detection algorithm and only use efficient net algorithm
X, y = df.loc[:, df.columns != 'like'].copy(), df[['like']].copy()

As we said in the previous notebook, XGBoost since the version of 1.7 is capable of working with categorical variables, but before casting the categorical variables through the model, we must change the type of those features to `categorical` in pandas.

In [4]:
categorical_features = X.select_dtypes(exclude=np.number).columns.tolist()
for feature in categorical_features:
    X[feature] = X[feature].astype('category')

Now we can split the data into training and validation. We will use the 25% of data as validation set.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=69)

### What architecture and pre trained models -if any- you used?

In this cell we will define hyperparameters value for model. Have in mind that since we will perform a cross-validation, parameters are actually list of parameters to find the best one among their combination.

In [6]:
model = xgb.XGBRegressor(tree_method='hist', enable_categorical=True)

params = {
    'booster': ['gbtree', 'dart'],
    'eta': np.arange(0.25, 1.1, 0.25),
    'subsample': np.arange(0.25, 1.1, 0.25),
    'refresh_leaf': [0, 1],
    'n_estimators': np.arange(1000, 1500, 100),
}
grid = GridSearchCV(model, param_grid=params, return_train_score=True, cv=5, verbose=1)
grid.fit(X, y)
print(f'{grid.best_estimator_} best hyperparameters are: {grid.best_params_} with the accuracy of: {grid.best_score_:.2f}')

Fitting 5 folds for each of 320 candidates, totalling 1600 fits


KeyboardInterrupt: 

In [None]:
results = pd.DataFrame(grid.cv_results_)
results.to_csv('xgb-results.csv')