
## **Machine Learning - II**

---


### **Implementation of Boosting using AdaBoosting & Gradient Boosting Approach**


---

**Aim :** To implement Boosting using two approaches: AdaBoost and Gradient Boosting on two different datasets and evaluate the models.

In [16]:
# Importing necessary libraries
import pandas as pd  # For data manipulation
from sklearn.datasets import load_iris  # For loading the Iris dataset
from sklearn.metrics import accuracy_score  # For model evaluation using accuracy score

# Loading the Iris dataset for AdaBoost implementation
dataset = load_iris()
X = dataset.data  # Feature variables
y = dataset.target  # Target variable

In [17]:
# Displaying the shape of features and target
print(X.shape)  # Shape of X (features)
print(y.shape)  # Shape of y (target)

(150, 4)
(150,)


In [18]:
# Splitting the dataset into training and test sets for model evaluation
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(X_train.shape)  # Shape of training data
print(X_test.shape)  # Shape of test data
print(y_train.shape)  # Shape of training labels
print(y_test.shape)  # Shape of test labels

(105, 4)
(45, 4)
(105,)
(45,)


### **AdaBoost Implementation**

In [19]:
from sklearn.ensemble import AdaBoostClassifier  # Importing AdaBoost Classifier

# Defining the AdaBoost model with 50 estimators
model = AdaBoostClassifier(n_estimators=50)
model.get_params()  # Displaying model parameters

{'algorithm': 'SAMME.R',
 'estimator': None,
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': None}

In [20]:
# Fitting the AdaBoost model on the training data
model.fit(X_train, y_train)



In [21]:
# Predicting on the test set
y_pred = model.predict(X_test)

In [22]:
# Evaluating model accuracy
print("AdaBoost Model Accuracy:", accuracy_score(y_test, y_pred))  # Accuracy score closer to 1 is better

AdaBoost Model Accuracy: 0.9555555555555556


### **Gradient Boosting Implementation**


In [23]:
# Loading the dataset for Gradient Boosting implementation
train_data = pd.read_csv('train.csv')  # Reading the training data (e.g., Titanic dataset)
test_data = pd.read_csv('test.csv')  # Reading the test data

In [24]:
# Displaying the first few rows of the test data
print(test_data.head())

   PassengerId  Pclass                                          Name     Sex  \
0          892       3                              Kelly, Mr. James    male   
1          893       3              Wilkes, Mrs. James (Ellen Needs)  female   
2          894       2                     Myles, Mr. Thomas Francis    male   
3          895       3                              Wirz, Mr. Albert    male   
4          896       3  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female   

    Age  SibSp  Parch   Ticket     Fare Cabin Embarked  
0  34.5      0      0   330911   7.8292   NaN        Q  
1  47.0      1      0   363272   7.0000   NaN        S  
2  62.0      0      0   240276   9.6875   NaN        Q  
3  27.0      0      0   315154   8.6625   NaN        S  
4  22.0      1      1  3101298  12.2875   NaN        S  


In [25]:
# Dropping irrelevant columns for Gradient Boosting
drop_columns = ['Name', 'Age', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked']
train_data.drop(labels=drop_columns, axis=1, inplace=True)
test_data.drop(labels=drop_columns, axis=1, inplace=True)

In [26]:
# Encoding categorical data (Sex column)
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
train_data['Sex'] = lb.fit_transform(train_data['Sex'])  # Encoding 'Sex' column in training data
test_data['Sex'] = lb.fit_transform(test_data['Sex'])  # Encoding 'Sex' column in test data

In [27]:
# Separating the target variable 'Survived' and feature variables
ytrain = train_data['Survived']  # Target variable
train_data.drop(labels="Survived", axis=1, inplace=True)  # Dropping 'Survived' from features
xtrain = train_data  # Feature variables

In [28]:
# Splitting data into training and validation sets for Gradient Boosting
from sklearn.model_selection import train_test_split
state = 12  # Random state for reproducibility
test_size = 0.30  # 30% data for validation
xtrain, xval, ytrain, yval = train_test_split(xtrain, ytrain, test_size=test_size, random_state=state)

In [29]:
# Defining the Gradient Boosting Classifier with different learning rates
from sklearn.ensemble import GradientBoostingClassifier  # Importing Gradient Boosting Classifier
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1]  # List of learning rates to experiment with

# Looping through each learning rate to check model performance
for learning_rate in lr_list:
    gb_clf = GradientBoostingClassifier(n_estimators=20, learning_rate=learning_rate,
                                        max_features=2, max_depth=2, random_state=0)
    gb_clf.fit(xtrain, ytrain)  # Training the model
    print(f"Learning Rate: {learning_rate:.3f}\tAccuracyScore(Training): {gb_clf.score(xtrain, ytrain):.3f}\t AccuracyScore(Validation): {gb_clf.score(xval, yval):.3f}")

Learning Rate: 0.050	AccuracyScore(Training): 0.804	 AccuracyScore(Validation): 0.739
Learning Rate: 0.075	AccuracyScore(Training): 0.822	 AccuracyScore(Validation): 0.731
Learning Rate: 0.100	AccuracyScore(Training): 0.815	 AccuracyScore(Validation): 0.761
Learning Rate: 0.250	AccuracyScore(Training): 0.841	 AccuracyScore(Validation): 0.757
Learning Rate: 0.500	AccuracyScore(Training): 0.865	 AccuracyScore(Validation): 0.795
Learning Rate: 0.750	AccuracyScore(Training): 0.878	 AccuracyScore(Validation): 0.780
Learning Rate: 1.000	AccuracyScore(Training): 0.881	 AccuracyScore(Validation): 0.746


In [32]:
# Final Gradient Boosting Model with a chosen learning rate
gb_clf2 = GradientBoostingClassifier(n_estimators=20, learning_rate=0.5,  # Final model with learning rate 0.5
                                     max_features=2, max_depth=2, random_state=0)
gb_clf2.fit(xtrain, ytrain)  # Fitting the model

In [33]:
# Predicting on the validation set
predictions = gb_clf2.predict(xval)

In [34]:
# Evaluating model performance using confusion matrix and classification report
from sklearn.metrics import confusion_matrix, classification_report  # Importing metrics

# Displaying the confusion matrix
print("Confusion Matrix")
print(confusion_matrix(yval, predictions))

Confusion Matrix
[[149  12]
 [ 43  64]]


In [35]:
# Displaying the classification report (Precision, Recall, F1-score)
print(classification_report(yval, predictions))

              precision    recall  f1-score   support

           0       0.78      0.93      0.84       161
           1       0.84      0.60      0.70       107

    accuracy                           0.79       268
   macro avg       0.81      0.76      0.77       268
weighted avg       0.80      0.79      0.79       268



### **Conclusion and Interpretation of Results:**

---
#### **AdaBoost Results:**
- **Model Accuracy:** 95.56% on the Iris dataset.
---  
### **Interpretation:**
---
- AdaBoost performed exceptionally well on the Iris dataset with an accuracy of **95.56%**, showing that the model is quite effective for this dataset. Since the accuracy is close to 1 (or 100%), the model is able to classify the majority of the test set correctly.
- **AdaBoost** works by combining several weak learners (in this case, decision trees) to create a strong classifier. It performs well when the dataset is not too large and has well-defined decision boundaries, as demonstrated by its high accuracy on the relatively simple Iris dataset.

---

#### **Gradient Boosting Results:**
---
- The Gradient Boosting model was evaluated on a different dataset (possibly the Titanic dataset). Below are the results for different learning rates:

| **Learning Rate** | **Training Accuracy** | **Validation Accuracy** |
|-------------------|-----------------------|-------------------------|
| 0.05              | 80.4%                 | 73.9%                   |
| 0.075             | 82.2%                 | 73.1%                   |
| 0.1               | 81.5%                 | 76.1%                   |
| 0.25              | 84.1%                 | 75.7%                   |
| 0.5               | 86.5%                 | 79.5%                   |
| 0.75              | 87.8%                 | 78.0%                   |
| 1.0               | 88.1%                 | 74.6%                   |

---
### **Interpretation:**
---
- **Optimal Learning Rate:** The model with a learning rate of **0.5** yielded the best validation accuracy of **79.5%**, which indicates that it is the most suitable learning rate for this dataset.
- **Overfitting Risk:** As the learning rate increases, the training accuracy increases, reaching **88.1%** for a learning rate of 1.0, but the validation accuracy starts to decrease (74.6%). This indicates the model might be **overfitting** at higher learning rates since it performs well on the training set but struggles with unseen data.
  
- **Confusion Matrix Analysis:**
  - The confusion matrix shows that the model predicted:
    - **149 true positives** (correctly predicted class 0)
    - **64 true negatives** (correctly predicted class 1)
    - **43 false positives** (class 1 predicted as class 0)
    - **12 false negatives** (class 0 predicted as class 1)
  
  This suggests that the model performs well in predicting class 0 (likely the majority class) but struggles slightly with class 1.

- **Classification Report:**
  - **Precision for class 0:** 0.78 (78%) - Out of all predictions for class 0, 78% were correct.
  - **Recall for class 0:** 0.93 (93%) - The model was able to identify 93% of actual class 0 cases.
  - **Precision for class 1:** 0.84 (84%) - Out of all predictions for class 1, 84% were correct.
  - **Recall for class 1:** 0.60 (60%) - The model was able to identify only 60% of actual class 1 cases, showing it struggles with this minority class.
  
  **F1-Scores:**
  - F1-Score for class 0: **0.84** (a balance between precision and recall, favoring class 0)
  - F1-Score for class 1: **0.70** (lower than class 0 due to the lower recall for class 1)
---
**Overall Performance:**
- **Accuracy:** The overall accuracy of the Gradient Boosting model is **79%**, which is decent, but there is some room for improvement.
- **Class Imbalance:** The performance difference between precision and recall for class 0 and class 1 suggests that the dataset may be imbalanced. The model tends to perform better on the majority class (class 0) than on the minority class (class 1).

---

#### **Key Takeaways:**
1. **AdaBoost** is highly effective for smaller and simpler datasets, achieving excellent accuracy on the Iris dataset.
2. **Gradient Boosting** is a more flexible and powerful model that can handle more complex datasets. However, its performance depends on tuning parameters such as learning rate, depth, and the number of estimators.
3. **Model Overfitting:** Higher learning rates may lead to overfitting, as seen with the Gradient Boosting model.
4. **Class Imbalance:** The model struggles more with class imbalance, leading to lower recall and F1-scores for the minority class.
---