<a href="https://colab.research.google.com/github/Bbat54/Bbat54/blob/main/XG_Boosting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XGBoost model exploration
### Batchimeg Battur

##### Import the libraries:

In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import warnings
import io
from google.colab import files
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 50)

##### Dataset:

Importing the dataset and creating dummy variables for the catgorical variables. As we are going to predict the species of the penguins, the 'species' columns gets assigned as a target variable. Finally, we scale the predictor variables and get the dataset ready for the modeling.

In [None]:
dataset  = pd.read_csv('/penguins_size.csv')
dataset=dataset.dropna()
df = dataset.copy()
target = 'species'
encode = ['sex','island']

for col in encode:
    dummy = pd.get_dummies(df[col], prefix=col)
    df = pd.concat([df,dummy], axis=1)
    del df[col]

target_mapper = {'Adelie':0, 'Gentoo':1, 'Chinstrap':2}
def target_encode(val):
    return target_mapper[val]

df['species'] = df['species'].apply(target_encode)
X = df.drop('species', axis=1)
y = df['species']
# scaling the data
from sklearn import preprocessing
X = preprocessing.scale(X)

##### Split the dataset to train and test sets:
The dataset is split to 80-20 ratio where the 80% of the dataset will be used to train the model and the rest 20% will be used for testing.

In [None]:
#from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

##### **XGBoost model:**
In order to explore the parameters, we are taking 5 different models as shown below:

In [None]:
model1 = xgb.XGBClassifier()
model2 = xgb.XGBClassifier(max_depth=0)
model3 = xgb.XGBClassifier(learning_rate=0.0002)
model4 = xgb.XGBClassifier(min_child_weight=24)
model5 = xgb.XGBClassifier(gamma=50)

###### **Model 1: default parameters**
Model 1 has all the default parameters of the XGBoost model and they are shown/printed below.

In [None]:
print(model1)

XGBClassifier()


###### **Model 2: max_depth**
The default max_depth for model 2 has been changed to 0 to see how it affects the prediction.  

- max_depth [default=6]: Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit. 0 indicates no limit on depth. Beware that XGBoost aggressively consumes memory when training a deep tree. exact tree method requires non-zero value. Range: [0,∞]

Set parameters for the model 2 are shown below:

In [None]:
print(model2)

XGBClassifier(max_depth=0)


###### **Model 3: learning_rate** 
The default learning_rate for model 3 has been changed to 0.0002 to see how it affects the prediction.  

- learning_rate [default=0.3]: Step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features, and eta shrinks the feature weights to make the boosting process more conservative. Range: [0,1]

Set parameters for the model 3 are shown below:

In [None]:
print(model3)

XGBClassifier(learning_rate=0.0002)


###### **Model 4: min_child_weight**
The default min_child_weight for model 4 has been changed to 24 to see how it affects the prediction.  

- min_child_weight [default=1]: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression task, this simply corresponds to minimum number of instances needed to be in each node. The larger min_child_weight is, the more conservative the algorithm will be. Range: [0,∞]

Set parameters for the model 4 are shown below:

In [None]:
print(model4)

XGBClassifier(min_child_weight=24)


##### **Model 5: gamma**
The default gamma for model 4 has been changed to 50 to see how it affects the prediction.  

- gamma [default=0, alias: min_split_loss]: Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative the algorithm will be. Range: [0,∞]

Set parameters for the model 5 are shown below:

In [None]:
print(model5)

XGBClassifier(gamma=50)


All models are fitted to the train datasets:

In [None]:
train_model1 = model1.fit(X_train, y_train)
train_model2 = model2.fit(X_train, y_train)
train_model3 = model3.fit(X_train, y_train)
train_model4 = model4.fit(X_train, y_train)
train_model5 = model5.fit(X_train, y_train)

Species of the penguins were predicted and the accuracy rate of all models were found as shown below:

In [None]:
predicted_y1 = train_model1.predict(X_test)
predicted_y2 = train_model2.predict(X_test)
predicted_y3 = train_model3.predict(X_test)
predicted_y4 = train_model4.predict(X_test)
predicted_y5 = train_model5.predict(X_test)

accuracy1 = accuracy_score(y_test,predicted_y1)
accuracy2 = accuracy_score(y_test,predicted_y2)
accuracy3 = accuracy_score(y_test,predicted_y3)
accuracy4 = accuracy_score(y_test,predicted_y4)
accuracy5 = accuracy_score(y_test,predicted_y5)

print("Accuracy for model 1: %.2f%%" %(accuracy1*100))
print("Accuracy for model 2: %.2f%%" %(accuracy2*100))
print("Accuracy for model 3: %.2f%%" %(accuracy3*100))
print("Accuracy for model 4: %.2f%%" %(accuracy4*100))
print("Accuracy for model 5: %.2f%%" %(accuracy5*100))

Accuracy for model 1: 100.00%
Accuracy for model 2: 41.79%
Accuracy for model 3: 98.51%
Accuracy for model 4: 97.01%
Accuracy for model 5: 74.63%


######**Model 1: default parameters**
Model 1 has all the default parameters of the XGBoost model where the defaults are: max_depth=6, learning_rate=0.3, min_child_weight=1, gamma=0. The accuracy for the default parameter is 100%.

######**Model 2: max_depth**
For the model 2 the default max_depth has been changed to 0 to see how it affects the prediction. From the results we can see that the accuracy has been reduced to 41%. Hence, we can conclude that as we decrease the max_depth the model will perform badly. 

###### **Model 3: learning_rate**
For model 3 the default learning_rate has been changed to 0.0002 to see how it affects the prediction.As the learning rate decreased, the accuracy rate lightly decreased to 98%. The drop in the accuracy has been observed only at the 3rd decimal place values of the learning rate. 

###### **Model 4: min_child_weight**
For model 4 the default min_child_weight has been changed to 24 to see how it affects the prediction. As we see from the results the accuracy has dropped to 95% as the min_child_weight increased.

##### **Model 5: gamma**
For model 5 the default gamma has been changed to 50 to see how it affects the prediction. The accuracy rate gradually decreased to 73% as the gamma has been increased from 0 to 50.

#### **Comparison:**
The accuracy for the AdaBoost (default) model was found to be 93% in the previous practice activity. However, the accuracy of AdaBoost model improved to 99% when the base estimator was changed from decision tree to SVC.

But even with such the improvement, XGBoost overperforms AdaBoost model as it has a 100% accuracy. 

Some links for for more reference and reading:

1. https://www.datacamp.com/community/tutorials/xgboost-in-python
2. https://xgboost.readthedocs.io/en/stable/parameter.html
3. https://blog.paperspace.com/adaboost-optimizer/