# **Case**:
In this case we will use one of the boosting methods, namely AdaBoost, to classify the type of Iris flower.  In this exercise we will use the very commonly used Iris dataset.  This exercise will make predictions predicting 3 types of Iris flowers, namely, Iris Setosa, Iris Versicolor, and Iris Virginica based on the length and width of the sepals and petals.

 We will compare the performance of the Decision Tree and AdaBoost algorithms in this case. 

# **Import Libraries and Load Data**

In [3]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

# Load the data
df = pd.read_csv('../Data/iris.csv')
# show the first 5 rows of the data
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


# **Check Null Column**

In [4]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

# **Feature Extraction and Label Encoding**

In [7]:
# Feature Extraction
# slice dataframe from 'sepal_length' to 'petal_width' column
X = df.iloc[:, 2 : -1]
y = df['Species']

# Label Encoding
y = LabelEncoder().fit_transform(y)

# check shape feature shape
print(X.shape)

# check label value
print(y)

(150, 3)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


# **Split Data**

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# **Train Model**

## **Decision Tree**

In [9]:
# By default, scikit-learn's DecisionTreeClassifier will use the "Gini" 
# value for the criteria
#  There are several "hyperparamaters" that can be used.  Please read the 
#  documentation
#  In this case we will use the default parameters 
dt = DecisionTreeClassifier()

# fit data 
dt.fit(X_train, y_train)

# predict test set
y_pred_dt = dt.predict(X_test)

# calculate test data accuracy score
acc_dt = accuracy_score(y_test, y_pred_dt)
print(f'Test set accuracy {acc_dt}')
print('Round Test set accuracy: {:.2f}'.format(acc_dt))

Test set accuracy 0.9666666666666667
Round Test set accuracy: 0.97


## **Random Forest**

In [23]:
# In this case we will use the estimator in RandomForest
# For detailed parameters (hyperparameters) please check the documentation 
# if we use default parameters, in py 3.12 will give warning
# especially for algorithm 'SAMME.R' is not supported
# so we will use n_estimators = 3 and algorithm = 'SAMME' to create same result
ada = AdaBoostClassifier(n_estimators=3, algorithm='SAMME')

# fit data
ada.fit(X_train, y_train)

# predict test set
y_pred_ada = ada.predict(X_test)

# calculate test data accuracy score
acc_rf = accuracy_score(y_test, y_pred_ada)
print(f'Test set accuracy {acc_rf}')
print('Round Test set accuracy: {:.2f}'.format(acc_rf))

Test set accuracy 0.9666666666666667
Round Test set accuracy: 0.97
