# Ensemble Learning



## A. Implement Random Forest Classifier model to predict the safety of the car.
Dataset link: https://www.kaggle.com/datasets/elikplim/car-evaluation-data-set


### Step 0: Import Necessary Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, accuracy_score

### Step 1: Load the dataset

In [None]:
data = pd.read_csv('car_evaluation.csv', header=None)

In [None]:
data.shape

(1728, 7)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       1728 non-null   object
 1   1       1728 non-null   object
 2   2       1728 non-null   object
 3   3       1728 non-null   object
 4   4       1728 non-null   object
 5   5       1728 non-null   object
 6   6       1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


### Step 2: Add headers

In [None]:
data.columns = ['buying_price', 'maintenance_cost', 'number_of_doors', 'number_of_persons', 'lug_boot', 'safety', 'decision']

In [None]:
data.head()

Unnamed: 0,buying_price,maintenance_cost,number_of_doors,number_of_persons,lug_boot,safety,decision
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [None]:
data.describe()

Unnamed: 0,buying_price,maintenance_cost,number_of_doors,number_of_persons,lug_boot,safety,decision
count,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0,1728.0
mean,1.5,1.5,1.5,1.0,1.0,1.0,1.553241
std,1.118358,1.118358,1.118358,0.816733,0.816733,0.816733,0.875948
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.75,0.75,0.75,0.0,0.0,0.0,1.0
50%,1.5,1.5,1.5,1.0,1.0,1.0,2.0
75%,2.25,2.25,2.25,2.0,2.0,2.0,2.0
max,3.0,3.0,3.0,2.0,2.0,2.0,3.0


### Step 3: Initialize LabelEncoder

In [None]:
le = LabelEncoder()

### Step 4: Encode all categorical features, including the target variable

In [None]:
for column in data.columns:
    data[column] = le.fit_transform(data[column])

### Step 5: Separate features (X) and target (y)


In [None]:
X = data.drop('decision', axis=1)  # Features
y = data['decision']  # Target (Encoded)

### Step 6: Train-test split (80% training, 20% testing)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 7: Initialize the Random Forest Classifier with 10,000 trees

In [None]:
rf_classifier = RandomForestClassifier(n_estimators=10000, random_state=42)

### Step 8: Train the model on the training data

In [None]:
rf_classifier.fit(X_train, y_train)

### Step 9: Make predictions on the test data

In [None]:
y_pred = rf_classifier.predict(X_test)

### Step 10: Evaluate the model's performance

In [None]:
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

In [None]:
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(report)

Accuracy: 0.97
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94        83
           1       0.58      1.00      0.73        11
           2       1.00      1.00      1.00       235
           3       1.00      0.94      0.97        17

    accuracy                           0.97       346
   macro avg       0.89      0.96      0.91       346
weighted avg       0.98      0.97      0.97       346



## **B: Use different voting mechanism and Apply AdaBoost (Adaptive Boosting), Gradient Tree Boosting (GBM), XGBoost classification on Iris dataset and compare the performance of three models using different evaluation measures.**
Dataset Link: https://www.kaggle.com/datasets/uciml/iris

### 1. Import Necessary Libraries

In [None]:
# Importing libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing libraries for model building and evaluation
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Importing XGBoost
from xgboost import XGBClassifier

# Importing Label Encoder
from sklearn.preprocessing import LabelEncoder

### 2. Load the Iris Dataset

In [None]:
# Load the Iris dataset
url = "Iris.csv"  # Provide the correct path where the dataset is located
iris_data = pd.read_csv(url)

# Display the first few rows of the dataset
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa



---

#### Observations:
- The Iris dataset typically consists of 150 samples and 5 columns (4 features and 1 target). The target variable is a categorical variable, representing three classes of Iris flowers (`Setosa`, `Versicolor`, and `Virginica`).
- The features are continuous numerical values representing flower characteristics like petal and sepal length/width.

---

### 3. Check for Missing Values

In [None]:
# Check for missing values
iris_data.isnull().sum()

Unnamed: 0,0
Id,0
SepalLengthCm,0
SepalWidthCm,0
PetalLengthCm,0
PetalWidthCm,0
Species,0


---

#### Observations:
- If no missing values are found, we can proceed with the dataset as is.
- In case missing values are found, we'd need to handle them using techniques like imputation or removing affected rows.

---


### 4. Encode the Target Variable

The target variable (`Species`) is categorical and needs to be encoded into numeric labels for the classification algorithms to process it effectively.


In [None]:
# Label encode the target variable (Species)
le = LabelEncoder()
iris_data['Species'] = le.fit_transform(iris_data['Species'])

# Check the encoding
iris_data['Species'].unique()

array([0, 1, 2])

---

#### Observations:
- The three classes will now be represented as numeric values (`0`, `1`, and `2`). This encoding is crucial since machine learning algorithms require numerical inputs for the target variable.

---


### 5. Split the Data into Training and Test Sets

We will split the dataset into training and test sets. Typically, we use an 80/20 split, where 80% of the data is used for training the model and 20% is reserved for testing its performance.


In [None]:
# Split the data into features (X) and target (y)
X = iris_data.drop('Species', axis=1)
y = iris_data['Species']

# Perform an 80/20 split for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

---

#### Observations:
- **Stratification**: We use `stratify=y` to ensure that the proportion of classes in both training and test sets is the same as in the original dataset. This prevents bias, especially if the dataset is imbalanced.

---


### 6. Train the Models (AdaBoost, Gradient Boosting, XGBoost)

We will now train the models using **AdaBoost**, **Gradient Boosting (GBM)**, and **XGBoost**. These models are all ensemble methods, but they have different boosting techniques.



#### 6.1 AdaBoost Classifier

In [None]:
# Initialize and train AdaBoost Classifier
ada_model = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_model.fit(X_train, y_train)

# Predict on the test data
y_pred_ada = ada_model.predict(X_test)




#### 6.2 Gradient Boosting Classifier


In [None]:
# Initialize and train Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)

# Predict on the test data
y_pred_gb = gb_model.predict(X_test)

#### 6.3 XGBoost Classifier


In [None]:
# Initialize and train XGBoost Classifier
xgb_model = XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='mlogloss', random_state=42)
xgb_model.fit(X_train, y_train)

# Predict on the test data
y_pred_xgb = xgb_model.predict(X_test)

Parameters: { "use_label_encoder" } are not used.



---

#### Observations:
- **AdaBoost**: It sequentially adjusts the weights of misclassified instances, focusing on hard-to-classify examples.
- **Gradient Boosting**: Focuses on minimizing the errors of the previous trees by using residuals, making it a more refined boosting method compared to AdaBoost.
- **XGBoost**: It optimizes Gradient Boosting by using regularization to avoid overfitting, making it a more powerful and faster method.
- All three classifiers are trained on the training set using 100 estimators.

---


### 7. Evaluate the Models

We will evaluate the performance of each model using common classification metrics like **accuracy**, **confusion matrix**, and **classification report** (which includes precision, recall, and F1-score).


#### 7.1 AdaBoost Evaluation

In [None]:
# Evaluate AdaBoost Classifier
ada_accuracy = accuracy_score(y_test, y_pred_ada)
ada_report = classification_report(y_test, y_pred_ada)
ada_cm = confusion_matrix(y_test, y_pred_ada)

print(f"AdaBoost Accuracy: {ada_accuracy:.2f}")
print("AdaBoost Classification Report:\n", ada_report)
print("AdaBoost Confusion Matrix:\n", ada_cm)

AdaBoost Accuracy: 1.00
AdaBoost Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        10
           2       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

AdaBoost Confusion Matrix:
 [[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]


#### 7.2 Gradient Boosting Evaluation


In [None]:
# Evaluate Gradient Boosting Classifier
gb_accuracy = accuracy_score(y_test, y_pred_gb)
gb_report = classification_report(y_test, y_pred_gb)
gb_cm = confusion_matrix(y_test, y_pred_gb)

print(f"Gradient Boosting Accuracy: {gb_accuracy:.2f}")
print("Gradient Boosting Classification Report:\n", gb_report)
print("Gradient Boosting Confusion Matrix:\n", gb_cm)


Gradient Boosting Accuracy: 1.00
Gradient Boosting Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        10
           2       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

Gradient Boosting Confusion Matrix:
 [[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]


#### 7.3 XGBoost Evaluation


In [None]:
# Evaluate XGBoost Classifier
xgb_accuracy = accuracy_score(y_test, y_pred_xgb)
xgb_report = classification_report(y_test, y_pred_xgb)
xgb_cm = confusion_matrix(y_test, y_pred_xgb)

print(f"XGBoost Accuracy: {xgb_accuracy:.2f}")
print("XGBoost Classification Report:\n", xgb_report)
print("XGBoost Confusion Matrix:\n", xgb_cm)

XGBoost Accuracy: 1.00
XGBoost Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00        10
           2       1.00      1.00      1.00        10

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

XGBoost Confusion Matrix:
 [[10  0  0]
 [ 0 10  0]
 [ 0  0 10]]


---

#### Observations:
- Compare the **accuracy** of the three models. Typically, XGBoost tends to outperform due to its regularization features.
- The **classification report** will give us insights into the precision, recall, and F1-score, which are essential when class distribution is uneven or when we care about specific error types (e.g., false positives vs. false negatives).
- The **confusion matrix** provides a visual understanding of how well each class was classified and where the misclassifications occurred.

---



### 8. Compare the Results

Now, we will compare the results from the three models to understand which performs the best overall.



In [None]:
# Compare the accuracy scores of the three models
print(f"AdaBoost Accuracy: {ada_accuracy:.2f}")
print(f"Gradient Boosting Accuracy: {gb_accuracy:.2f}")
print(f"XGBoost Accuracy: {xgb_accuracy:.2f}")

AdaBoost Accuracy: 1.00
Gradient Boosting Accuracy: 1.00
XGBoost Accuracy: 1.00


---

#### Observations:
- We expect **XGBoost** to have the highest accuracy due to its optimizations and regularization techniques, but **Gradient Boosting** and **AdaBoost** are also strong contenders.
- Each model has its own strengths and trade-offs. AdaBoost is simpler and faster for smaller datasets, Gradient Boosting is more powerful for structured data, and XGBoost tends to be the most accurate for larger, more complex datasets.
- **Based on the results**, we would choose the model that best fits the problem's needs (accuracy, computational efficiency, or the ability to handle large datasets).

---


### Conclusion:

1. We imported and preprocessed the **Iris dataset**, splitting it into training and testing sets.
2. We trained three different boosting algorithms: **AdaBoost**, **Gradient Boosting**, and **XGBoost**.
3. We evaluated the models on accuracy, classification reports, and confusion matrices.
4. We compared the results of each model, and based on observations, **XGBoost** is likely to be the best choice for this dataset.
