References:
- [Hyperparameter tuning in XGBoost](https://blog.cambridgespark.com/hyperparameter-tuning-in-xgboost-4ff9100a3b2f)

# Instagram Like Prediction @310ai Competition - Modeling, Visualization & Evaluation

This notebook is a continuation of ***Data*** notebook. In this part of project we will train the predictive model, analyze it, visualize its output and evaluate it. The reason for this separation, is to make the notebooks easier to understand and less dependent on each other, especially for this case since the Instagram showed that it reguralry changing its api and bot mitigation methods. Multiple stages of this project have checkpoints, thus the notebook will run without issues.

As we pointed out in the previous notebook, we will use **XGBoost 1.7** as the algorithm. So let's without further ado, dive into it.

In [52]:
import pandas as pd
pd.options.display.float_format = '{:,.2f}'.format
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error

First thing first, we must load the data, also we have some identifying features, called shortcode, username and id, which will not affect the accuracy of the model, thus we can drop these features all together.

In [2]:
df = pd.read_csv('Data/main_dataset.csv')
df.drop(columns=['Unnamed: 0', 'shortcode', 'username', 'id'], inplace=True)
df.head()

Unnamed: 0,post_type,like,comment,object_1,object_2,object_3,object_4,object_5,object_6,object,...,reel_count,reel_avg_view,reel_avg_comment,reel_avg_like,reel_avg_duration,reel_frequency,media_count,media_avg_comment,media_avg_like,media_frequency
0,GraphSidecar,363269,6844,No Object,No Object,No Object,No Object,No Object,No Object,palace,...,1256,9406907.25,13544.5,634500.7,92.466333,18.554887,7393,8635.166667,410027.25,1.090777
1,GraphSidecar,546578,11354,No Object,No Object,No Object,No Object,No Object,No Object,"bikini, two-piece",...,1256,9406907.25,13544.5,634500.7,92.466333,18.554887,7393,8635.166667,410027.25,1.090777
2,GraphImage,734124,11937,No Object,No Object,No Object,No Object,No Object,No Object,"balance_beam, beam",...,1256,9406907.25,13544.5,634500.7,92.466333,18.554887,7393,8635.166667,410027.25,1.090777
3,GraphSidecar,7720568,42249,2 people,people playing football,people playing soccer,ball,No Object,No Object,basketball,...,387,18249237.58,28318.83333,3607850.0,32.616333,17.03613,3475,52543.16667,7980593.083,2.203446
4,GraphImage,14451079,103278,1 person,baby,No Object,No Object,No Object,No Object,"crib, cot",...,387,18249237.58,28318.83333,3607850.0,32.616333,17.03613,3475,52543.16667,7980593.083,2.203446


Now we can separate independent and dependent variables. Usually we cal independent variables X and dependent variables y.

In [35]:
# remove objects from instagram detection algorithm and only use efficient net algorithm
exclude_columns = [
    'like',
    'object_1',
    'object_2',
    'object_3',
    'object_4',
    'object_5',
    'object_6',
]
X, y = df.loc[:, ~df.columns.isin(exclude_columns)].copy(), df[['like']].copy()

As we said in the previous notebook, XGBoost since the version of 1.7 is capable of working with categorical variables, but before casting the categorical variables through the model, we must change the type of those features to `categorical` in pandas.

In [36]:
categorical_features = X.select_dtypes(exclude=np.number).columns.tolist()
for feature in categorical_features:
    X[feature] = X[feature].astype('category')

Now we can split the data into training and validation. We will use the 25% of data as validation set.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=69)

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
dtest = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

### What architecture and pre trained models -if any- you used?

In this cell we will define hyperparameters value for model. Have in mind that since we will perform a cross-validation, parameters are actually list of parameters to find the best one among their combination.

In [43]:
# creating a baseline model
mean_train = np.mean(np.asarray(y_train))
baseline_predictions = np.ones(np.asarray(y_test).shape) * mean_train
mae_baseline = mean_absolute_error(np.asarray(y_test), baseline_predictions)
print(f'Baseline MAE: {mae_baseline:.2f}')

Baseline MAE: 1832111.19


In [56]:
params = {
    'max_depth': 6,             # maximum depth of a tree
    'min_child_weight': 1,      # minimum sum of weights of all observations required in a child
    'eta': .25,                 # learning rate
    'subsample': 1,             # fraction of observation to be random samples for each tree
}
num_boost_rounds = 1500

Tunning `max_depth`, `min_child_weight`, `eta` and `subsample` hyperparameters.

TODO:
- test mse and rmse for metrics
- find way to translate mae to score out of 100

In [62]:
grid_search_params = [
    (max_depth, min_child_weight, eta, subsample)
    for max_depth in range(5, 14)
    for min_child_weight in range(3, 10)
    for eta in [.3, .2, .1, .05, .01, .005]
    for subsample in [i/10. for i in range(7, 11)]
]

min_mae = float('Inf')
best_params = None
for max_depth, min_child_weight, eta, subsample in grid_search_params:
    print(f'CV with max_depth: {max_depth}, min_child_weight: {min_child_weight}, eta: {eta}, subsample: {subsample}')

    # updating parameters dictionary
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight
    params['eta'] = eta
    params['subsample'] = subsample

    # performing cv
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_rounds,
        seed=69,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10,
    )

    # updating the best mae
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print(f'\tMAE: {mean_mae:,.2f}, rounds: {boost_rounds}')
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (max_depth, min_child_weight, eta, subsample)

# best hyperparameters combination
print('Best hyperparameters:')
print(f'\tmax_depth: {max_depth}')
print(f'\tmin_child_weight: {min_child_weight}')
print(f'\teta: {eta}')
print(f'\tsubsample: {subsample}')
print(f'\tmae: {min_mae}')


CV with max_depth: 5, min_child_weight: 3, eta: 0.3, subsample: 0.7
	MAE: 430,303.75, rounds: 25
CV with max_depth: 5, min_child_weight: 3, eta: 0.3, subsample: 0.8
	MAE: 442,774.24, rounds: 25
CV with max_depth: 5, min_child_weight: 3, eta: 0.3, subsample: 0.9
	MAE: 441,058.63, rounds: 16
CV with max_depth: 5, min_child_weight: 3, eta: 0.3, subsample: 1.0
	MAE: 420,741.64, rounds: 15
CV with max_depth: 5, min_child_weight: 3, eta: 0.2, subsample: 0.7
	MAE: 424,710.19, rounds: 25
CV with max_depth: 5, min_child_weight: 3, eta: 0.2, subsample: 0.8
	MAE: 416,182.94, rounds: 27
CV with max_depth: 5, min_child_weight: 3, eta: 0.2, subsample: 0.9
	MAE: 434,100.67, rounds: 21
CV with max_depth: 5, min_child_weight: 3, eta: 0.2, subsample: 1.0
	MAE: 414,795.39, rounds: 14
CV with max_depth: 5, min_child_weight: 3, eta: 0.1, subsample: 0.7
	MAE: 429,602.15, rounds: 32
CV with max_depth: 5, min_child_weight: 3, eta: 0.1, subsample: 0.8
	MAE: 420,045.38, rounds: 34
CV with max_depth: 5, min_chil

KeyboardInterrupt: 

In [57]:
cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=num_boost_rounds,
    seed=69,
    nfold=5,
    metrics={'mae'},
    early_stopping_rounds=10
)

cv_results

Unnamed: 0,train-mae-mean,train-mae-std,test-mae-mean,test-mae-std
0,1197139.97,45194.99,1194992.06,199849.55
1,923895.6,36381.61,931012.96,180584.51
2,718165.43,29864.97,748795.43,168947.6
3,566510.06,23838.79,629835.11,145385.01
4,453778.08,17964.92,542740.3,130413.59
5,371318.83,14313.96,486688.8,123371.6
6,310465.06,13327.39,456983.24,124113.82
7,265641.19,11908.19,445200.73,121137.34
8,232097.8,11069.45,441394.45,124558.04
9,206850.12,9066.81,440732.22,128053.72


In [None]:
results = pd.DataFrame(grid.cv_results_)
results.to_csv('xgb-results.csv')