## XGBM

In this notebook, I wanted to try to use the XGBM model. It is probably too strong model for my small dataset, but it is OK to try.

**Extreme Gradient Boosting. Weights**. 

 - Weights are assigned to observations to indicate their "importance:" Samples with higher weights are given higher influence on the total error of the next model, prioritizing those observations.
 - Weights change at each iteration with the goal of correcting the errors/misclassifications of the previous iteration: The first base estimator is fit with uniform weights on the observations.
 - Final prediction is typically constructed by a weighted vote: Weights for each base model depend on their training errors or misclassification rates.

 - ***Since it is a Regression type of problem, to evaluate my model I'm going to use MSE and RMSE as well as to look at train and test scores for overfitting.***

In [None]:
# Windows installation
# pip install xgboost

In [1]:
# Mac installation

# 1. brew install libomp
# 2. install.packages("xgboost")

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer

import xgboost as xgb

In [4]:
# Check xgboost

print(xgb.__version__)

1.4.2


### 1. Import data

In [5]:
df = pd.read_csv('data/P5clean.csv')
df.head(2)

Unnamed: 0,year,overall_score,property_rights,government_integrity,tax_burden,government_spending,business_freedom,labor_freedom,monetary_freedom,trade_freedom,...,log_gdp_per_capita,social_support,healthy_life_expectancy_at_birth,choice_freedom,generosity,perceptions_of_corruption,positive_affect,negative_affect,country_name,life_ladder
0,2020.0,66.9,57.1,38.8,85.9,74.6,65.7,52.1,81.2,88.4,...,9.497,0.71,69.3,0.754,0.007,0.891,0.679,0.265,Albania,5.365
1,2020.0,53.1,50.5,49.7,69.6,50.7,60.2,46.5,53.7,69.2,...,9.85,0.897,69.2,0.823,-0.122,0.816,0.764,0.342,Argentina,5.901



## 2. Modeling

In [6]:
exclude = ['country_name', 'life_ladder']
X_features = [i for i in df if i not in exclude]

In [7]:
# Prepocessing

X = df[X_features]
y = df['life_ladder']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

In [8]:
# Instantiate and fir XGBM

xgb = xgb.XGBRegressor(base_score = 0.25,
                       learning_rate = 0.05, 
                       max_depth=3, 
                       min_child_weight = 1,
                       n_estimators = 900,
                       n_jobs = 1,
                       objective='reg:squarederror', 
                       random_state = 0,
                       reg_alpha = 0)
                       

#fit the model
xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.25, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.05, max_delta_step=0, max_depth=3,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=900, n_jobs=1, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [9]:
# Prediction and MSE score

pred = xgb.predict(X_test)
mse = mean_squared_error(y_test, pred)
mse

0.179471618962201

In [10]:
print('train score:', xgb.score(X_train,y_train))
print('test score:', xgb.score(X_test,y_test))
print("RMSE: %.2f" % (mse**(1/2.0)))

train score: 0.9798099828836212
test score: 0.8573147683420658
RMSE: 0.42
