

---

## **Machine Learning - II**

---

### **Implementation of Bagging Classifier and Comparison with Decision Tree and Random Forest**

---

#### **Objective:**

To implement and compare the performance of Bagging Classifier with Decision Tree and Random Forest classifiers on the Iris dataset.

In [None]:
# Importing necessary libraries
from sklearn.datasets import load_iris
import pandas as pd

In [None]:
# Loading the Iris dataset
data = load_iris()

# Creating a DataFrame for the Iris dataset
df = pd.DataFrame(data.data, columns=data.feature_names)

# Adding the target column to the DataFrame
df['Species'] = data.target

# Displaying the DataFrame
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


### **Data Information and Summary**

In [None]:
# Displaying information about the dataset such as number of columns and data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   Species            150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


In [None]:
# Displaying statistical summary of the dataset
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


### **Splitting the Data into Training and Testing Sets**

In [None]:
# Defining features (X) and target (y)
x = df[data.feature_names]
y = df['Species']

In [None]:
# Splitting the data into training and testing sets (70% train, 30% test)
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.3, random_state=0)

# Printing the shape of the training and testing sets to confirm split
print(xtrain.shape)
print(xtest.shape)
print(ytrain.shape)
print(ytest.shape)

(105, 4)
(45, 4)
(105,)
(45,)


### **Training and Predicting using Decision Tree Classifier**

In [None]:
# Importing DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Creating a DecisionTreeClassifier object
dtc = DecisionTreeClassifier()

# Training the Decision Tree classifier on the training data
dtc.fit(xtrain, ytrain)

In [None]:
# Making predictions on the testing data
predictiondt = dtc.predict(xtest)

# Displaying the predictions of the Decision Tree classifier
predictiondt

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2, 1, 1, 2, 0, 2, 0,
       0])

sklearn.ensemble.RandomForestClassifier Parameters that we can set to adjust this model

1. n_estimators - Number of decision trees used to create the ensemble model
2. criteria - Criteria used for splitting eg. Gini Impurity
3. max_depth - maximum depth of decision tree
4. max_leaf_nodes - maximum leaf nodes allowed in the decision trees.
5. bootstrap - True when random sampling with replacement and false when sampling without replacement
6. n_jobs - -1 if all the processors are to be used for computation
7. random_state
8. max_sample - maximum number of samples allowed in the sub-sets

### **Training and Predicting using Random Forest Classifier**

In [None]:
# Importing RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Creating a RandomForestClassifier object
rfc = RandomForestClassifier()

# Training the Random Forest classifier on the training data
rfc.fit(xtrain, ytrain)

In [None]:
# Making predictions on the testing data
predictionrf = rfc.predict(xtest)

# Displaying the predictions of the Random Forest classifier
predictionrf

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2, 1, 1, 2, 0, 2, 0,
       0])

sklearn.ensemble.BaggingClassifier Important parameters that we can tune in to adjust the model

1. estimator
2. n_estimators - number of weak learners or models we are using in the model
3. max_samples - Maximum number of samples allowed in a sub-set
4. max_features - How many features are to be considered for sampling
5. bootstrap - True when random sampling with replacement and false when sampling without replacement
6. bootstrap_features - bootstrap for the columns
7. n_jobs
8. random_state

### **Training and Predicting using Bagging Classifier**

In [None]:
# Importing BaggingClassifier
from sklearn.ensemble import BaggingClassifier

# Creating a BaggingClassifier object
model = BaggingClassifier()

# Note: You can customize the Bagging Classifier by specifying the base estimator, number of estimators, etc.
# Example: BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, max_samples=0.8)

# Training the Bagging classifier on the training data
model.fit(xtrain, ytrain)

In [None]:
# Making predictions on the testing data
predictionb = model.predict(xtest)

# Displaying the predictions of the Bagging classifier
predictionb

array([2, 1, 0, 2, 0, 2, 0, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 0, 2, 1,
       0, 0, 2, 0, 0, 1, 1, 0, 2, 1, 0, 2, 2, 1, 0, 2, 1, 1, 2, 0, 2, 0,
       0])

### **Evaluating the Accuracy of the Classifiers**

In [None]:
# Importing accuracy_score for evaluating model performance
from sklearn.metrics import accuracy_score

# Calculating and displaying the accuracy of the Decision Tree classifier
print("Accuracy Score of DecisionTreeClassifier = ", accuracy_score(predictiondt, ytest))

# Calculating and displaying the accuracy of the Random Forest classifier
print("Accuracy Score of RandomForestClassifier = ", accuracy_score(predictionrf, ytest))

# Calculating and displaying the accuracy of the Bagging classifier
print("Accuracy Score of BaggingClassifier = ", accuracy_score(predictionb, ytest))

Accuracy Score of DecisionTreeClassifier =  0.9777777777777777
Accuracy Score of RandomForestClassifier =  0.9777777777777777
Accuracy Score of BaggingClassifier =  0.9777777777777777




---


### **Interpretation of Output**


---


**1. Decision Tree Classifier Accuracy: 97.78%**
   - The Decision Tree classifier achieved an accuracy of approximately 97.78% on the test set.
   - This means that the model correctly classified 97.78% of the test instances.
   - A high accuracy suggests that the Decision Tree model performed well on this dataset, likely capturing the underlying patterns effectively.

**2. Random Forest Classifier Accuracy: 97.78%**
   - The Random Forest classifier also achieved an accuracy of approximately 97.78%.
   - Random Forest is an ensemble method that combines multiple Decision Trees to improve performance and reduce overfitting.
   - The identical accuracy to the Decision Tree indicates that, in this case, using multiple trees did not significantly improve the performance, possibly because the Decision Tree was already well-tuned.

**3. Bagging Classifier Accuracy: 97.78%**
   - The Bagging Classifier, which also uses an ensemble approach with base estimators like Decision Trees, achieved the same accuracy of 97.78%.
   - This consistency in accuracy across all three models suggests that the dataset may not have a high level of complexity or noise, making it relatively easy for different models to perform equally well.


---



### **Conclusion**
- All three models (Decision Tree, Random Forest, and Bagging) achieved the same accuracy of 97.78%, indicating robust performance on the Iris dataset.
- The high accuracy across models suggests that the data is well-structured, and the chosen features are effective in predicting the target class.
- While ensemble methods like Random Forest and Bagging typically outperform a single Decision Tree, in this specific case, the performance is identical, suggesting that the single Decision Tree was sufficient to capture the necessary information for this dataset.
---