## Ensemble Learning - Boosting
Boosting is an ensemble learning method that trains multiple weak learners sequentially such that the new weak learner improve on the errors (misclassifications) of the weak learner before itself.

With each iteration, the weak rules from each individual classifier are combined to form one, strong prediction rule.

### Steps of Boosting
1. Weak Learners: Boosting algorithms start by creating  a weak base learner. Decision trees are commonly used as weak learners.

2. Sequential Learning: Each subsequent model focuses on the examples that the previous ones misclassified. More weight is assigned to the misclassified examples, forcing the new model to concentrate on getting these instances right.

3. Weight Adjustment: Each model's predictions are combined through a weighted majority vote (for classification) or a weighted sum (for regression). The weights are determined based on the performance of each model. Models that perform well are given higher weights, while misclassified samples are given increased importance.

### Bagging vs Boosting

Bagging
- Models are trained in parallel (independently)
- Dataset is bootstraped
- Works well to mitigate low bias, high variance (overfitting)
- final prediction is done using voting or averaging of individual outcomes

Boosting
- Models are trained sequentially
- Entire dataset is used for each base model. However focus is given to those samples that have that were misclassified.
- Works well to mitigate hig bias (underfitting)
- final prediction is done using weighted average


### Evaluation Metrics

Cross-validation: methods such as K-fold cross validation indicate how well the model generalises to new data.

Evaluation metrics: methods such as confusion matrix (used for classification models, good for datasets with imbalance) and accuracy.


AdaBoost: https://www.youtube.com/watch?v=UgxHn8W4usI


In [None]:
src: https://www.kaggle.com/datasets/clkmuhammed/creditscoreclassification?select=train.csv

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer

In [2]:
train = pd.read_csv("/content/train.csv")
test =  pd.read_csv("/content/test.csv")

In [3]:
train.shape

(79876, 28)

In [4]:
test.shape

(50000, 27)

In [5]:
train.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
0,5634,3392,1,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,26.82262,265.0,No,49.574949,21.46538,High_spent_Small_value_payments,312.494089,Good
1,5635,3392,2,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,31.94496,266.0,No,49.574949,21.46538,Low_spent_Large_value_payments,284.629162,Good
2,5636,3392,3,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,28.609352,267.0,No,49.574949,21.46538,Low_spent_Medium_value_payments,331.209863,Good
3,5637,3392,4,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,31.377862,268.0,No,49.574949,21.46538,Low_spent_Small_value_payments,223.45131,Good
4,5638,3392,5,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,Good,809.98,24.797347,269.0,No,49.574949,21.46538,High_spent_Medium_value_payments,341.489231,Good


In [6]:
test.head()

Unnamed: 0,ID,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,...,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance
0,5642,3392,9,Aaron Maashoh,23.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,4.0,Good,809.98,35.030402,273.0,No,49.574949,21.46538,Low_spent_Small_value_payments,186.266702
1,5643,3392,10,Aaron Maashoh,24.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,4.0,Good,809.98,33.053114,274.0,No,49.574949,21.46538,High_spent_Medium_value_payments,361.444004
2,5644,3392,11,Aaron Maashoh,24.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,4.0,Good,809.98,33.811894,275.0,No,49.574949,21.46538,Low_spent_Medium_value_payments,264.675446
3,5645,3392,12,Aaron Maashoh,24.0,821000265.0,Scientist,19114.12,1824.843333,3.0,...,4.0,Good,809.98,32.430559,276.0,No,49.574949,21.46538,High_spent_Medium_value_payments,343.826873
4,5654,8625,9,Rick Rothackerj,28.0,4075839.0,Teacher,34847.84,3037.986667,2.0,...,5.0,Good,605.03,25.926822,327.0,No,18.816215,39.684018,High_spent_Large_value_payments,485.298434


In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 79876 entries, 0 to 79875
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        79876 non-null  int64  
 1   Customer_ID               79876 non-null  int64  
 2   Month                     79876 non-null  int64  
 3   Name                      79876 non-null  object 
 4   Age                       79876 non-null  float64
 5   SSN                       79876 non-null  float64
 6   Occupation                79876 non-null  object 
 7   Annual_Income             79876 non-null  float64
 8   Monthly_Inhand_Salary     79876 non-null  float64
 9   Num_Bank_Accounts         79876 non-null  float64
 10  Num_Credit_Card           79875 non-null  float64
 11  Interest_Rate             79875 non-null  float64
 12  Num_of_Loan               79875 non-null  float64
 13  Type_of_Loan              79875 non-null  object 
 14  Delay_

In [8]:
train.drop(['Name', 'SSN', 'Num_Credit_Inquiries', 'ID'], inplace=True, axis=1)

In [9]:
train.isnull().sum()

Customer_ID                 0
Month                       0
Age                         0
Occupation                  0
Annual_Income               0
Monthly_Inhand_Salary       0
Num_Bank_Accounts           0
Num_Credit_Card             1
Interest_Rate               1
Num_of_Loan                 1
Type_of_Loan                1
Delay_from_due_date         1
Num_of_Delayed_Payment      1
Changed_Credit_Limit        1
Credit_Mix                  1
Outstanding_Debt            1
Credit_Utilization_Ratio    1
Credit_History_Age          1
Payment_of_Min_Amount       1
Total_EMI_per_month         1
Amount_invested_monthly     1
Payment_Behaviour           1
Monthly_Balance             1
Credit_Score                1
dtype: int64

In [10]:
train = train.dropna(subset=['Credit_Score'])

In [11]:
train.isnull().sum()

Customer_ID                 0
Month                       0
Age                         0
Occupation                  0
Annual_Income               0
Monthly_Inhand_Salary       0
Num_Bank_Accounts           0
Num_Credit_Card             0
Interest_Rate               0
Num_of_Loan                 0
Type_of_Loan                0
Delay_from_due_date         0
Num_of_Delayed_Payment      0
Changed_Credit_Limit        0
Credit_Mix                  0
Outstanding_Debt            0
Credit_Utilization_Ratio    0
Credit_History_Age          0
Payment_of_Min_Amount       0
Total_EMI_per_month         0
Amount_invested_monthly     0
Payment_Behaviour           0
Monthly_Balance             0
Credit_Score                0
dtype: int64

In [12]:
train.shape

(79875, 24)

In [13]:
train.dtypes

Customer_ID                   int64
Month                         int64
Age                         float64
Occupation                   object
Annual_Income               float64
Monthly_Inhand_Salary       float64
Num_Bank_Accounts           float64
Num_Credit_Card             float64
Interest_Rate               float64
Num_of_Loan                 float64
Type_of_Loan                 object
Delay_from_due_date         float64
Num_of_Delayed_Payment      float64
Changed_Credit_Limit        float64
Credit_Mix                   object
Outstanding_Debt            float64
Credit_Utilization_Ratio    float64
Credit_History_Age          float64
Payment_of_Min_Amount        object
Total_EMI_per_month         float64
Amount_invested_monthly     float64
Payment_Behaviour            object
Monthly_Balance             float64
Credit_Score                 object
dtype: object

In [14]:
from sklearn.preprocessing import LabelEncoder

cols = ['Occupation','Type_of_Loan','Credit_Mix','Payment_of_Min_Amount', 'Payment_Behaviour','Credit_Score' ]


label_encoder = LabelEncoder()

for col in cols:
    train[col] = label_encoder.fit_transform(train[col])

## Method 1: AdaBoost

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [18]:
X = train.iloc[:, :-1]

In [17]:
y = train['Credit_Score']

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [21]:
base = DecisionTreeClassifier(max_depth=1)

In [36]:
model = AdaBoostClassifier(estimator=base, n_estimators=100, random_state=42, learning_rate=0.5)

In [37]:
model.fit(X_train, y_train)

In [38]:
y_pred = model.predict(X_test)

In [39]:
acc = accuracy_score(y_test, y_pred)

In [40]:
print(acc)

0.6431924882629108


In [42]:
model = AdaBoostClassifier(estimator=base, n_estimators=100, random_state=42, learning_rate=0.1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(acc) #underfitting

0.6123317683881064


In [50]:
model = AdaBoostClassifier(estimator=base, n_estimators=200, random_state=42, learning_rate=10)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(acc) #overfitting

0.38253521126760565


In [51]:
model = AdaBoostClassifier(estimator=base, n_estimators=150, random_state=42, learning_rate=0.5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(acc) #ideal

0.6475743348982785


## Method 2: Gradient Boost

In [47]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

In [53]:
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_clf.fit(X_train, y_train)

y_pred = gb_clf.predict(X_test)

TypeError: ignored

In [55]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.7433


We see that although gradient boost is more successful than Adaboost it is incredibly time consuming.

## Method 3: XG Boost

In [54]:
xgb_clf = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42) #base estimator by default is a decision tree stump

xgb_clf.fit(X_train, y_train)

y_pred = xgb_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.7433


XGBoost took much less time than normal gradient boost to acieve same accuracy. Let's try to optimize the accuracy even further.

In [None]:
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

xgb_clf = xgb.XGBClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5]
}

grid_search = GridSearchCV(xgb_clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Accuracy:", best_score)

best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print("Test Accuracy of Best Model:", test_accuracy)