# eXtreme Gradient Boosting - XGBoost

XGBoost, GBM'in hız ve tahmin performansını arttırmak üzere optimize edilmiş; ölçeklenebilir ve farklı platformlara entegre edilebilir halidir.

* R, Python, Hadoop, Scala, Julia ile kullanılabilir.
* Ölçeklenebilirdir.
* Hızlıdır.
* Tahmin başarısı yüksektir.
* Birçok kaggle yarışmasında başarısını kanıtlamıştır.

## XGBoost

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale 
from sklearn import model_selection
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor

# uyarılar gözükmesin
from warnings import filterwarnings
filterwarnings('ignore')

# bilgilerin gözükmesi için
from sklearn import set_config
set_config(print_changed_only=False)

In [2]:
import pandas as pd
hit = pd.read_csv("Hitters.csv")
df = hit.copy()
df = df.dropna()
dms = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df["Salary"]
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
X = pd.concat([X_, dms[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

In [3]:
# !pip install xgboost



In [17]:
import xgboost as xgb

xgboost'da pandas DataFrame ya da numpy kullanmak yerine xgboost'un kendi veri yapısını kullanırsanız daha performanslı sonuçlar elde edebilirsiniz

In [18]:
# xgboost'un özel veri yapısı DMatrix
# data bağımsız değişken, label bağımlı değişken
DM_train = xgb.DMatrix(data = X_train, label=y_train) # train seti

In [19]:
# data bağımsız değişken, label bağımlı değişken
DM_test = xgb.DMatrix(data = X_test, label=y_test) # test seti

In [20]:
from xgboost import XGBRegressor

In [29]:
xgb_model = XGBRegressor()

In [30]:
xgb_model.fit(X_train, y_train)

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             objective='reg:squarederror', predictor=None, ...)

## XGBoost - Tahmin

In [35]:
y_pred = xgb_model.predict(X_test)

In [36]:
np.sqrt(mean_squared_error(y_test, y_pred))

355.46515176059927

## XGBoost - Model Tuning

In [37]:
xgb_model

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None,
             eval_set=[(<xgboost.core.DMatrix object at 0x7fa3b2408f10>,
                        'train')],
             feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, n_estimators=100, n_jobs=None,
             num_parallel_tree=None, objective='reg:squarederror', ...)

* booster : ağaca dayalı bir booster kullanacağız
* colsample_bytree : ağaç başına değişkenlerden alınacak olan örnek oranı
* learning_rate : daraltma adım boyu, overfitting(aşırı öğrenmeyi) engellemek için kullanıyoruz. 0 ile 1 arasında değer veririz.
* max_depth : overfitting(aşırı öğrenmeyi) engellemek içindir. karmaşıklık parametresi
* n_estimators: ağaç sayısı

In [38]:
xgb_grid = {
    "colsample_bytree":[0.4,0.5,0.6,0.9,1],
    "n_estimators":[100,200,500,1000],
    "max_depth":[2,3,4,5,6],
    "learning_rate":[0.1,0.01,0.5]
}

In [39]:
xgb = XGBRegressor()

In [40]:
xgb_cv_model = GridSearchCV(xgb, 
                            param_grid = xgb_grid, 
                            cv=10, 
                            n_jobs=-1, 
                            verbose=2)

In [41]:
xgb_cv_model.fit(X_train, y_train)

Fitting 10 folds for each of 300 candidates, totalling 3000 fits
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=100; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=200; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=500; total time=   0.1s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=1000; total time=   0.2s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=100; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=200; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=500; total time=   0.1s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=1000; total time=   0.2s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=1000; total time=   0.2s
[CV] END colsample_bytree=0.4, l

GridSearchCV(cv=10, error_score=nan,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None, gpu_id=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    lea...
                                    monotone_constraints=None, n_estimators=100,
                                    n_jobs=None, num_parallel_tree=None,
                                    objective='reg:squarederror',
                                    predictor=None, ...),
             n

In [43]:
xgb_cv_model.best_params_

{'colsample_bytree': 0.5,
 'learning_rate': 0.1,
 'max_depth': 2,
 'n_estimators': 1000}

[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=100; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=200; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=500; total time=   0.1s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=2, n_estimators=1000; total time=   0.2s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=100; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=200; total time=   0.1s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=3, n_estimators=500; total time=   0.1s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=4, n_estimators=100; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=4, n_estimators=100; total time=   0.0s
[CV] END colsample_bytree=0.4, learning_rate=0.1, max_depth=4, n_estimators=200; total time=   0.0s

In [44]:
# final modelimizi oluşturalım
xgb_tuned = XGBRegressor(colsample_bytree = 0.5,
                        learning_rate = 0.1,
                        max_depth = 2,
                        n_estimators = 1000)

In [45]:
xgb_tuned = xgb_tuned.fit(X_train, y_train)

In [47]:
y_pred = xgb_tuned.predict(X_test)

In [48]:
np.sqrt(mean_squared_error(y_test, y_pred))

357.4251457287713

gelişmiş hiperparametre optimizasyonu anlamında modellerinizi daha iyi nasıl hiperparametre optimizasyonuna tabi tutabileceğinizle ilgili bilgiler:
* modelimizin en önemli kullandığı parametreler belirlenir. parametrelerin ön tanımlı değerleri sabit bırakılır. öncelikle en önemli parametrenin değerleri aranır. önem sırasına göre parametrelere uygulanır.