<a href="https://colab.research.google.com/github/Ramjeet-Dixit/IITM-AIML-Rdixit/blob/main/Gradient_boosting_vs_adaboost_vs_xgboost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Gradient Boosting, AdaBoost, and XGBoost** are all ensemble learning techniques that build models in stages, with each new model being trained to correct the errors made by the previous ones. However, they differ in their approach and handling of various aspects:

**Gradient Boosting:**

**Approach:** Builds an additive model in a forward stage-wise fashion. It allows for the optimization of arbitrary differentiable loss functions.

**Weak Learners:** Typically uses decision trees as the weak learners.

**Optimization:** Each tree is fit on the residual errors made by the previous ones. The idea is to minimize the loss function (like mean squared error for regression).

**Regularization**:It incorporates several regularization techniques, like shrinkage (learning rate) and subsampling, to prevent overfitting.

**Flexibility**:  Can be used for both regression and classification problems.

**AdaBoost (Adaptive Boosting)**:

Approach: Focuses on classification problems and aims to convert a set of weak learners into a strong one.

Weak Learners: Typically uses short decision trees (stumps) as the weak learners.

Optimization: Adjusts the weights of the training instances based on the errors of the previous model; more weight is given to misclassified instances.

Sequential Learning: Each weak learner is forced to concentrate on the examples that were missed by the previous ones.

Performance: Tends to improve rapidly with the addition of new models at the beginning but then plateaus.

**XGBoost (eXtreme Gradient Boosting):**

Approach: An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable.

Improvements: Implements machine learning algorithms under the Gradient Boosting framework but with enhancements like tree pruning, regularization, and handling of missing values.

Scalability and Speed: Provides a parallel tree boosting (also known as GBDT, GBM) that solves many data science problems in a fast and accurate way.

Regularization: Includes L1 and L2 regularization which improves model generalization capabilities.

Handling of Missing Data: Has an in-built routine to handle missing values.

Flexibility: Allows users to define custom optimization objectives and evaluation criteria, adding a whole new dimension to the model.

Comparison:

Accuracy: XGBoost often provides better performance and accuracy than traditional Gradient Boosting and AdaBoost.

Speed and Efficiency: XGBoost is generally faster and more efficient, as it utilizes both hardware optimization and algorithmic enhancements.

Flexibility: XGBoost offers more hyper-parameter tuning options than AdaBoost and can handle a wider range of data types and structures.

Regularization: XGBoost includes regularization terms in its objective function which helps to prevent overfitting, a feature not present in traditional Gradient Boosting or AdaBoost
.
Scalability: XGBoost can handle larger datasets and can be run on distributed systems, unlike traditional Gradient Boosting and AdaBoost.

In practice, the choice between these methods will depend on the specific problem, the nature of the data, the required model performance, and the computational resources available. XGBoost is often a go-to due to its performance and speed but might be more complex to tune due to the larger number of hyperparameters. Gradient Boosting is a solid choice for a variety of problems, particularly when data is well-behaved and not too large. AdaBoost can be more intuitive and easier to implement but might not provide performance as high as the others in more complex scenarios.

Short decision trees, often referred to as "stumps," are decision trees with a very limited depth ‚Äî typically just one level deep. A stump makes a decision based on a single input feature. Here's what characterizes them:

Structure: A stump consists of one root node (the decision point) and two leaf nodes (the outcomes). It makes a decision based on a single attribute and its threshold. For example, if you were deciding whether to play tennis based on the weather, a stump might make that decision based on just the attribute "Is it raining?".

Simplicity: Because they are so simple, stumps are weak learners. They do slightly better than random guessing but are generally not strong predictive models on their own.

Usage in AdaBoost: In the AdaBoost algorithm, multiple stumps are combined to create a more accurate and robust model. Each stump is created focusing on the errors of the previous one, iteratively improving the model's accuracy.

Advantages: Stumps are computationally cheap to build and easy to understand. In ensemble methods, they ensure that the overall model remains simple, avoiding overfitting.

Limitations: On their own, they are very weak and can't capture complex patterns. They are sensitive to noise and can be significantly impacted by small changes in the training data.

In the context of AdaBoost and other boosting methods, the simplicity of stumps is actually a benefit, as it allows the ensemble method to sequentially focus on correcting errors, leading to a model that combines the strengths of many weak learners into a strong predictive model.

In gradient boosting, the term "gradient" refers to the gradient of the loss function with respect to the model's predictions. Understanding its use and how it works is key to understanding how gradient boosting algorithms, like Gradient Boosting Machines (GBM) and XGBoost, build predictive models.

Use of Gradient in Gradient Boosting
The primary use of the gradient in gradient boosting is to minimize the loss function, which measures the difference between the actual target values and the predicted values. The gradient essentially indicates the direction and rate of change of the loss function. By iteratively fitting new models to the negative gradient of the loss function, gradient boosting aims to reduce the overall prediction error.

How Gradient Boosting Works
Here's a step-by-step explanation of how gradient boosting works:

Initialization:

Start with an initial model
ùêπ
0
F
0
‚Äã
 , which could be as simple as predicting the mean of the target values for regression problems or a baseline probability for classification problems.
Compute Residuals:

Compute the residuals (errors) between the actual target values and the predictions made by the current model. These residuals indicate how much the current model's predictions need to be corrected.
Fit a New Model:

Fit a new model
‚Ñé
ùëö
h
m
‚Äã
  to the residuals. This new model is typically a weak learner, such as a shallow decision tree (a stump).
Update the Model:

Update the existing model by adding the new model, scaled by a learning rate
ùúà
ŒΩ, to the current model. The learning rate controls the contribution of each new model to prevent overfitting.
The updated model is:
ùêπ
ùëö
=
ùêπ
ùëö
‚àí
1
+
ùúà
‚ãÖ
‚Ñé
ùëö
F
m
‚Äã
 =F
m‚àí1
‚Äã
 +ŒΩ‚ãÖh
m
‚Äã
 .
Iterate:

Repeat the process for a predetermined number of iterations or until the model's performance stops improving.

In [2]:
import pandas as pd
df = pd.read_csv("/content/default of credit card clients.csv")
#read 1st index row as header
df.head()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,1,20000,2,2,1,24,2,2,-1,-1,...,0,0,0,0,689,0,0,0,0,1
1,2,120000,2,2,2,26,-1,2,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,3,90000,2,2,2,34,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,4,50000,2,2,1,37,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,5,50000,1,2,1,57,-1,0,-1,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [3]:
#drop the ID column
df = df.drop("ID", axis=1)
df.head()

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default payment next month
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1
2,90000,2,2,2,34,0,0,0,0,0,...,14331,14948,15549,1518,1500,1000,1000,1000,5000,0
3,50000,2,2,1,37,0,0,0,0,0,...,28314,28959,29547,2000,2019,1200,1100,1069,1000,0
4,50000,1,2,1,57,-1,0,-1,0,0,...,20940,19146,19131,2000,36681,10000,9000,689,679,0


In [4]:
#renaming of dependent variable
df.rename({"default payment next month": "default"}, axis=1, inplace=True)
df.head(2)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default
0,20000,2,2,1,24,2,2,-1,-1,-2,...,0,0,0,0,689,0,0,0,0,1
1,120000,2,2,2,26,-1,2,0,0,0,...,3272,3455,3261,0,1000,1000,1000,0,2000,1


In [5]:
#missing value imputation
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   LIMIT_BAL  30000 non-null  int64
 1   SEX        30000 non-null  int64
 2   EDUCATION  30000 non-null  int64
 3   MARRIAGE   30000 non-null  int64
 4   AGE        30000 non-null  int64
 5   PAY_0      30000 non-null  int64
 6   PAY_2      30000 non-null  int64
 7   PAY_3      30000 non-null  int64
 8   PAY_4      30000 non-null  int64
 9   PAY_5      30000 non-null  int64
 10  PAY_6      30000 non-null  int64
 11  BILL_AMT1  30000 non-null  int64
 12  BILL_AMT2  30000 non-null  int64
 13  BILL_AMT3  30000 non-null  int64
 14  BILL_AMT4  30000 non-null  int64
 15  BILL_AMT5  30000 non-null  int64
 16  BILL_AMT6  30000 non-null  int64
 17  PAY_AMT1   30000 non-null  int64
 18  PAY_AMT2   30000 non-null  int64
 19  PAY_AMT3   30000 non-null  int64
 20  PAY_AMT4   30000 non-null  int64
 21  PAY_AMT5   3

We see that every column is `int64`, this is good, since it tells us that they did not mix letters and numbers. In other words, there are no **NA** values, or other character based place holders for missing data, in **df**.

That said, we should still make sure each column contains acceptable values. The list below describes what values are allowed
in each column and was based on the column descriptions on the **[Credit Card Default](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)** webpage.

- **LIMIT_BAL**, The amount of available credit **Integer**
- **SEX**, **Category**
  - 1 = male
  - 2 = female
- **EDUCATION**, **Category**
  - 1 = graduate school
  - 2 = university
  - 3 = high school
  - 4 = others
- **MARRIAGE**, **Category**
  - 1 = Married
  - 2 = Single
  - 3 = Other
- **AGE**, **Integer**
- **PAY_**, When the last 6 bills were paid **Category**
  - -1 = Paid on time
  - 1 = Payment delayed by 1 month
  - 2 = Payment delayed by 2 months
  - ...
  - 8 = Payment delayed by 8 months
  - 9 = Payment delayed by 9 or more months
- **BILL_AMT**, What the last 6 bills were **Integer**
- **PAY_AMT**, How much the last payments were **Integer**
- **DEFAULT**, Whether or not a person defaulted on the next payment **CATEGORY**
  - 0 = Did not default
  - 1 = Defaulted

In [6]:
df.EDUCATION.value_counts()

Unnamed: 0_level_0,count
EDUCATION,Unnamed: 1_level_1
2,14030
1,10585
3,4917
5,280
4,123
6,51
0,14


In [7]:
df.MARRIAGE.value_counts()

Unnamed: 0_level_0,count
MARRIAGE,Unnamed: 1_level_1
2,15964
1,13659
3,323
0,54


In [8]:
len(df[(df['EDUCATION'] == 0) | (df['MARRIAGE'] == 0)])

68

In [9]:
df.default.value_counts()
#imbalanced data

Unnamed: 0_level_0,count
default,Unnamed: 1_level_1
0,23364
1,6636


In [10]:
df = df[(df['EDUCATION'] != 0) & (df['MARRIAGE'] != 0)]
len(df)

29932

In [11]:
df = df[(df['EDUCATION'] != 5) & (df['EDUCATION'] != 6)]
len(df)

29601

In [12]:
df.EDUCATION.value_counts()

Unnamed: 0_level_0,count
EDUCATION,Unnamed: 1_level_1
2,14024
1,10581
3,4873
4,123


In [14]:
df.MARRIAGE.value_counts()

Unnamed: 0_level_0,count
MARRIAGE,Unnamed: 1_level_1
2,15806
1,13477
3,318


In [15]:
x = df.drop("default", axis=1).copy()
y = df["default"].copy()

In [16]:
X_encoded = pd.get_dummies(x, columns=['SEX',
                                       'EDUCATION',
                                       'MARRIAGE',
                                       'PAY_0',
                                       'PAY_2',
                                       'PAY_3',
                                       'PAY_4',
                                       'PAY_5',
                                       'PAY_6'])
X_encoded.shape

(29601, 87)

In [18]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X_encoded,y,random_state=42,stratify=y)

In [19]:
from sklearn.ensemble import GradientBoostingClassifier

clf_gbm = GradientBoostingClassifier(random_state=42)

In [21]:
#dictionary of the hyperparameters to be tuned
params = {
    'max_depth': [2,3,5,10],
    'min_samples_leaf': [5,10,20,50],
    'n_estimators': [10,15],
    'learning_rate': [0.1,0.2,0.3],
    'subsample': [0.1,0.2,0.3]
}

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=clf_gbm ,
                           param_grid=params,
                           cv = 4,
                           scoring="accuracy")

grid_search.fit(x_train, y_train)

In [None]:
clf_gbm = grid_search.best_estimator_
clf_gbm

Best estimator:

learning rate: 0.2

min sample leaf: 5

n estimators 15

random_State ; 42

subsample: 0.2

max_depth: None

In [None]:
clf_gbm2 = GradientBoostingClassifier(random_state=42)

In [None]:
#dictionary of the hyperparameters to be tuned
params = {
    'min_samples_leaf': range(5,11),
    'n_estimators': range(15,21),
    'learning_rate': [0.2],
    'subsample': [0.2]
}

In [None]:
grid_search = GridSearchCV(estimator=clf_gbm2 ,
                           param_grid=params,
                           cv = 4,
                           scoring="accuracy")

In [None]:
grid_search.fit(x_train, y_train)

In [None]:
clf_gbm = grid_search.best_estimator_
clf_gbm

Best estimator:

learning rate: 0.2

min sample leaf: 5

n estimators 15

random_State ; 42

subsample: 0.2

In [None]:
#retrain the model using best HPs
clf_gbm_final = GradientBoostingClassifier(learning_rate=0.2, min_samples_leaf=5, \
                                           n_estimators = 15, subsample=0.2, random_state=42)

clf_gbm_final.fit(x_train,y_train)

In [None]:
y_pred = clf_gbm_final.predict(x_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

In [None]:
y_pred_train = clf_gbm_final.predict(x_train)
print(classification_report(y_train, y_pred_train))

**Adaptive Boosting (Adaboost)**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

#split to make sure there is no data leakage
x_train, x_test, y_train, y_test = train_test_split(X_encoded,y,random_state=42,stratify=y)

In [None]:
# Initialize the AdaBoost classifier with a DecisionTreeClassifier as the base estimator

ada_clf = AdaBoostClassifier(estimator = DecisionTreeClassifier(), random_state=42)

In [None]:
param_grid = {
    'n_estimators': [10, 50, 100, 200],
    'learning_rate': [0.01, 0.1, 1, 10],
    'estimator__max_depth': [1, 2, 3, 4]
}

In [None]:
# Setup grid search with cross-validation
grid_search = GridSearchCV(estimator=ada_clf, param_grid=param_grid, cv=3, \
                           verbose=1, scoring='accuracy')

In [None]:
# Fit grid search
grid_search.fit(x_train, y_train)

In [None]:
# Best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))


In [None]:
# Test set evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(x_test)
print(classification_report(y_test, y_pred))

In [None]:
y_predtrain = best_model.predict(x_train)
print(classification_report(y_train, y_predtrain))

**Dealing Imbalance with SMOTE**

In [None]:
from imblearn.over_sampling import SMOTE

# Instantiate SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to generate synthetic samples
X_resampled, y_resampled = smote.fit_resample(X_encoded, y)

In [None]:
y_resampled.value_counts()

In [None]:
#
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_resampled,y_resampled,random_state=42)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf_gbm = GradientBoostingClassifier(random_state=42)

In [None]:
#dictionary of the hyperparameters to be tuned
params = {
    'max_depth': [2,3,5,10],
    'min_samples_leaf': [5,10,20,50],
    'n_estimators': [10,15],
    'learning_rate': [0.1,0.2,0.3],
    'subsample': [0.1,0.2,0.3]
}

In [None]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=clf_gbm ,
                           param_grid=params,
                           cv = 4,
                           scoring="accuracy")

In [None]:
grid_search.fit(x_train, y_train)

In [None]:
# Best parameters and best score
print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

Best estimator:

depth: 5

learning rate: 0.3

min sample leaf: 50

n estimators 15

random_State ; 42

subsample: 0.3

In [None]:
clf_gbm2 = GradientBoostingClassifier(random_state=42)

In [None]:
#dictionary of the hyperparameters to be tuned
params = {
    'min_samples_leaf': range(5,11),
    'n_estimators': [15,20],
    'learning_rate': [0.3,0.4,0.5],
    'subsample': [0.3,0.4,0.5],
    'max_depth': [4,5,6]
}

In [None]:
grid_search = GridSearchCV(estimator=clf_gbm2 ,
                           param_grid=params,
                           cv = 4,
                           scoring="accuracy")

In [None]:
grid_search.fit(x_train, y_train)

In [None]:
clf_gbm = grid_search.best_estimator_
clf_gbm

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf_gbm3 = GradientBoostingClassifier(random_state=42)

In [None]:
#dictionary of the hyperparameters to be tuned
params = {
    'min_samples_leaf': [6],
    'n_estimators': [20,25,30],
    'learning_rate': [0.3],
    'subsample': [0.5,0.6,0.7],
    'max_depth': [5]
}

In [None]:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(estimator=clf_gbm3 ,
                           param_grid=params,
                           cv = 4,
                           scoring="accuracy")

In [None]:
grid_search.fit(x_train, y_train)

In [None]:
clf_gbm = grid_search.best_estimator_
clf_gbm

In [None]:
from sklearn.metrics import classification_report
# Test set evaluation
y_pred = clf_gbm.predict(x_test)
print(classification_report(y_test, y_pred))

In [None]:
# Test set evaluation
y_predtrain = clf_gbm.predict(x_train)
print(classification_report(y_train, y_predtrain))

In [None]:
clf_gbm4 = GradientBoostingClassifier(random_state=42)
#dictionary of the hyperparameters to be tuned
params = {
    'min_samples_leaf': [6],
    'n_estimators': range(15,26),
    'learning_rate': [0.3],
    'subsample': [0.6],
    'max_depth': [5]
}
grid_search = GridSearchCV(estimator=clf_gbm4 ,
                           param_grid=params,
                           cv = 4,
                           scoring="accuracy")
grid_search.fit(x_train, y_train)

In [None]:
clf_gbm = grid_search.best_estimator_
# Test set evaluation
y_pred = clf_gbm.predict(x_test)
print(classification_report(y_test, y_pred))

In [None]:
# Test set evaluation
y_predtrain = clf_gbm.predict(x_train)
print(classification_report(y_train, y_predtrain))

**XGBOOST**

In [None]:
!pip install scikit-learn==1.3.0 xgboost==1.7.6
#restart session after install

In [None]:
y.value_counts()

In [None]:
from imblearn.over_sampling import SMOTE

# Instantiate SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to generate synthetic samples
X_resampled, y_resampled = smote.fit_resample(X_encoded, y)

In [None]:
y_resampled.value_counts()

In [None]:
#
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_resampled,y_resampled,random_state=42)

In [None]:
import xgboost as xgb # XGBoost stuff

clf_xgb = xgb.XGBClassifier(objective='binary:logistic',
                            eval_metric="logloss", ## this avoids a warning...
                            seed=42)
clf_xgb.fit(x_train, y_train)

In [None]:
y_pred = clf_xgb.predict(x_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test,y_pred))

In [None]:
#check overfitting
y_pred_train = clf_xgb.predict(x_train)
print(classification_report(y_train,y_pred_train))

xgboost with HP

In [None]:
#
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_encoded,y,random_state=42)

In [None]:
from sklearn.model_selection import GridSearchCV
#Round 1
param_grid = {'max_depth': [3, 4, 5],'learning_rate': [0.1, 0.01, 0.05],
              'gamma': [0, 0.25, 1.0],'reg_lambda': [0, 1.0, 10.0],
              'scale_pos_weight': [1, 3, 5]}
#n_estimators; default is 100 (no. of boosting rounds), subsample, colsample_bytree, alpha

optimal_params = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic',seed=42,
                                                          subsample=0.9,colsample_bytree=0.5,early_stopping_rounds=2, eval_metric='auc',
                                                          use_label_encoder=False),param_grid=param_grid,
                              scoring='roc_auc', verbose=0,cv = 3)#multi:softmax in multiclass problems

optimal_params.fit(x_train,y_train,eval_set=[(x_test, y_test)])
#,early_stopping_rounds=10
#,eval_set=[(X_test, y_test)]


In [None]:
optimal_params.best_params_['reg_lambda']

In [None]:
optimal_params.best_params_['scale_pos_weight']

In [None]:
optimal_params.best_params_['max_depth']

In [None]:
optimal_params.best_params_['learning_rate']

In [None]:
optimal_params.best_params_['gamma']

In [None]:
param_grid = {'max_depth': [4],'learning_rate': [0.1, 0.5, 1],\
              'gamma': [1,10,100],'reg_lambda': [10.0, 20, 100],'scale_pos_weight': [3]}

optimal_params2 = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic',seed=42,
                                                          subsample=0.9,colsample_bytree=0.5,early_stopping_rounds=2, eval_metric='auc',
                                                          use_label_encoder=False),param_grid=param_grid,
                              scoring='roc_auc', verbose=0,cv = 3)#multi:softmax in multiclass problems

optimal_params2.fit(x_train,y_train,eval_set=[(x_test, y_test)])

In [None]:
optimal_params2.best_estimator_

In [None]:
optimal_params2.best_params_['reg_lambda']

In [None]:
optimal_params2.best_params_['scale_pos_weight']

In [None]:
optimal_params2.best_params_['max_depth']

In [None]:
optimal_params2.best_params_['learning_rate']

In [None]:
optimal_params2.best_params_['gamma']

In [None]:
param_grid = {'max_depth': [4],'learning_rate': [0.1],\
              'gamma': [1,5,10],'reg_lambda': [10.0, 15, 20],'scale_pos_weight': [3]}

optimal_params3 = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic',seed=42,
                                                          subsample=0.9,colsample_bytree=0.5,early_stopping_rounds=2, eval_metric='auc',
                                                          use_label_encoder=False),param_grid=param_grid,
                              scoring='roc_auc', verbose=0,cv = 3)#multi:softmax in multiclass problems

optimal_params3.fit(x_train,y_train,eval_set=[(x_test, y_test)])

In [None]:
optimal_params3.best_params_['reg_lambda']

In [None]:
optimal_params3.best_params_['gamma']

final model: max depth 4, learning rate=0.1, gamma 5, lambda 10, scale 3

In [None]:
y_pred = optimal_params3.predict(x_test)

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
y_predtrain = optimal_params3.predict(x_train)
print(classification_report(y_train,y_predtrain))

XGBOOST with SMOTE

In [None]:
from imblearn.over_sampling import SMOTE

# Instantiate SMOTE
smote = SMOTE(random_state=42)

# Apply SMOTE to generate synthetic samples
X_resampled, y_resampled = smote.fit_resample(X_encoded, y)

In [None]:
#
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X_resampled,y_resampled,random_state=42)

In [None]:
#Round 1
param_grid = {'max_depth': [3, 4, 5],'learning_rate': [0.1, 0.01, 0.05],
              'gamma': [0, 0.25, 1.0],'reg_lambda': [0, 1.0, 10.0],
              'scale_pos_weight': [1, 3, 5]}
#n_estimators; default is 100 (no. of boosting rounds), subsample, colsample_bytree, alpha

optimal_params = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic',seed=42,
                                                          subsample=0.9,colsample_bytree=0.5,early_stopping_rounds=2, eval_metric='auc',
                                                          use_label_encoder=False),param_grid=param_grid,
                              scoring='roc_auc', verbose=0,cv = 3)#multi:softmax in multiclass problems

optimal_params.fit(x_train,y_train,eval_set=[(x_test, y_test)])

In [None]:
optimal_params.best_params_['reg_lambda']

In [None]:
optimal_params.best_params_['scale_pos_weight']

In [None]:
optimal_params.best_params_['max_depth']

In [None]:
optimal_params.best_params_['learning_rate']

In [None]:
optimal_params.best_params_['gamma']

In [None]:
#Round 2
param_grid = {'max_depth': [5,10,15],'learning_rate': [0.1, 0.5, 1],
              'gamma': [0.25, 0.50, 0.75],'reg_lambda': [10,15,20],
              'scale_pos_weight': [1]}
#n_estimators; default is 100 (no. of boosting rounds), subsample, colsample_bytree, alpha

optimal_params2 = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic',seed=42,
                                                          subsample=0.9,colsample_bytree=0.5,early_stopping_rounds=2, eval_metric='auc',
                                                          use_label_encoder=False),param_grid=param_grid,
                              scoring='roc_auc', verbose=0,cv = 3)#multi:softmax in multiclass problems

optimal_params2.fit(x_train,y_train,eval_set=[(x_test, y_test)])

In [None]:
optimal_params2.best_params_['reg_lambda']

In [None]:
optimal_params2.best_params_['max_depth']

In [None]:
optimal_params2.best_params_['learning_rate']

In [None]:
optimal_params2.best_params_['gamma']

In [None]:
#Round 3
param_grid = {'max_depth': [15,25,40,50],'learning_rate': [0.1],
              'gamma': [0.50],'reg_lambda': [10],
              'scale_pos_weight': [1]}
#n_estimators; default is 100 (no. of boosting rounds), subsample, colsample_bytree, alpha

optimal_params3 = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic',seed=42,
                                                          subsample=0.9,colsample_bytree=0.5,early_stopping_rounds=2, eval_metric='auc',
                                                          use_label_encoder=False),param_grid=param_grid,
                              scoring='roc_auc', verbose=0,cv = 3)#multi:softmax in multiclass problems

optimal_params3.fit(x_train,y_train,eval_set=[(x_test, y_test)])

In [None]:
optimal_params3.best_params_['max_depth']

In [None]:
#Round 4
param_grid = {'max_depth': range(15,25),'learning_rate': [0.1],
              'gamma': [0.50],'reg_lambda': [10],
              'scale_pos_weight': [1]}
#n_estimators; default is 100 (no. of boosting rounds), subsample, colsample_bytree, alpha

optimal_params4 = GridSearchCV(estimator=xgb.XGBClassifier(objective='binary:logistic',seed=42,
                                                          subsample=0.9,colsample_bytree=0.5,early_stopping_rounds=2, eval_metric='auc',
                                                          use_label_encoder=False),param_grid=param_grid,
                              scoring='roc_auc', verbose=0,cv = 3)#multi:softmax in multiclass problems

optimal_params4.fit(x_train,y_train,eval_set=[(x_test, y_test)])

In [None]:
optimal_params4.best_params_['max_depth']

In [None]:
y_pred = optimal_params4.predict(x_test)
print(classification_report(y_test,y_pred))

In [None]:
y_predtrain = optimal_params4.predict(x_train)
print(classification_report(y_train,y_predtrain))

In [None]:
#like random forest
xgb.plot_importance(optimal_params4,max_num_features=25,height=0.05)
