# **IRIS FLOWER CLASSIFICATION**


submitted by : **Abhinaya A.S.**

This project was done as a part of the 1-month (10 Aug - 10 Sept, 2024), work from home internship offered to me by **CognoRise InfoTech Pvt. Ltd**.

## **OVERVIEW**

### **Knowing the Dataset**
* multivariate data set 
* introduced by the British statistician and biologist **Ronald Fisher** in his 1936 paper **The use of multiple measurements in taxonomic problems**
* It is sometimes called **Anderson's Iris data set**, because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species
* consists of 50 samples from each of the 3 species of Iris (Iris Setosa, Iris virginica, and Iris versicolor)
* 4 features were measured from each sample: the length, width of the sepals and petals, in cm.
* No. of rows = 150, No. of columns = 5
* format = csv

### **Problem Statement**
To create a machine learning model which will catogerize an iris flower into its actual species based on the flower features like sepal length, sepal width, petal length and petal width. The performance of the model shall be evaluated using relevant criteria.

### **Problem Solving Approach**
1. Learn the theory behind **Decision Tree Classification** algorithm
2. Construct a classifier class using the OOP method from scratch, that works by using the decision tree classifier algorithm. Python programming language is used.
3. Use the created class for building a decision tree classifier model
4. Use a portion of the iris dataset to train the classification model
5. Test the performance of the model on another portion of iris dataset and evaluate using appropriate performance metrics
6. Discuss the implications of the performance metrics

## **SOLUTION**

### **STEP 1 : Importing Required Dependencies**

In [1]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score

### **STEP 2 : Defining the Decision tree based Classifier Class**

### **Node**

In [2]:
# defining the class for a node in our decision tree
class Node():
    def __init__(self, factor_idx = None, cut_off = None, first = None, second = None, impurity_loss = None, type = None ):
        # for decision node
        self.factor_idx = factor_idx # index of the column in which the deciding factor viz. a feature is present
        self.cut_off = cut_off # the value of the deciding feature upon which the splitting is done
        self.first = first # first child of the node. any data point with value of deciding factor less than or equal to cut off value will be sent to this child
        self.second = second # second child of the node
        self.impurity_loss = impurity_loss # the loss in impurity of the data while moving from the node to it's children

        # for leaf node
        self.type = type # the catogery/type/class in which majority of the datapoints in the node belongs to

### **Tree**

In [17]:
# defining the class for the decision tree which does the job of classification
class DecisionTree_Catogerizer():
    def __init__(self, min_data = 2, max_depth = 2):

        self.root = None
        self.min_data = min_data
        self.max_depth = max_depth
        # initialize the root node of the tree
        

    # defining the class methods 

    '''METHOD 1 - entropy
    used for calculating the entropy in the data within a node'''
    def entropy(self, y):

        '''y represents the 2D column matrix (a numpy array). 
        It is basically the column of the target variable of the data inside a node.
         No. of rows in this array is same as the number of datapoints present in the node. '''
        
        types = np.unique(y) # a list of catogeries to which the data points in the node belong
        entropy = 0
        # iterate over each catogery
        for cls in types:
            cls_proportion = len(y[ y==cls ])/len(y)
            entropy += -cls_proportion* np.log2(cls_proportion)
        return entropy
    
    '''METHOD 2 - gini_index
    used for calculating the ginin index of data within a node. Method similar to entropy'''
    def gini_index(self, y):
        types = np.unique(y)
        gini = 0
        for cls in types:
            cls_proportion = len(y[ y==cls ])/len(y)
            gini += cls_proportion**2
        return 1-gini
    
    '''METHOD 3 - cleave
    used to divide the data in a node into it's child nodes based on the cut-off value of the deciding factor'''
    def cleave(self, data, factor_idx, cut_off):

        '''data : a 2D numpy array corresponding to the subset of the dataset
          present in the node which is to be split'''
        data_first = np.array([row for row in data if row[ factor_idx ] <= cut_off])
        data_second = np.array([row for row in data if row[ factor_idx ] > cut_off])
        
        return data_first, data_second

    '''METHOD 4 - info_gain
    used for calculating the information gain value associated with a specific spliting of a node.
    Each split is characterised by a specific deciding factor and its specific cut-off value'''
    def info_gain(self, parent, f_child, s_child, mode = 'gini'):

        '''parent  : represents the target variable column of the data in the node which was split
           f_child : represents the target variable column of the data in the first_child node obtained after split
           s_child : represents the target variable column of the data in the second_child node obtained after the split'''
        # assigning weights to the child nodes
        w_f = len(f_child)/len(parent)
        w_s = len(s_child)/len(parent)

        if mode == 'gini':
            # calculating the weighted average of impurity in the child nodes
            weighted_avg_impurity = w_f*self.gini_index(f_child) + w_s*self.gini_index(s_child)

            # subtracting this from impurity of the parent node
            gain = self.gini_index(parent) - weighted_avg_impurity
            return gain
        else :
            weighted_avg_impurity = w_f*self.entropy(f_child) + w_s*self.entropy(s_child)
            gain = self.entropy(parent) - weighted_avg_impurity
            return gain
        
    '''METHOD 5 - perfect_cleave
    returns a dictionary of information regarding the best possible cleavage of a node in the decision tree'''
    def perfect_cleave(self, data, num_factors):

        '''data : a 2D numpy array corresponding to a subset of the dataset predsent in a node.
                  It's last column consists of the target variable values. Rest of the columns
                  corresponds to various features    '''

        # initializes an empty dictionary
        perfect_cleave = {}
        # sets value for maximum information gain which will be updated later
        max_info_gain = -float('inf')

        # outer loop to iterate over possible deciding factors
        for idx in range(num_factors):
            factor_values = data[:,idx]
            cut_off_values = np.unique(factor_values)

            # inner loop to iterate over possible cut_off values of the current deciding factor
            for cut_off in cut_off_values:
                # splitting the data in the node based on the current factor and current threshold
                data_f, data_s = self.cleave(data, idx, cut_off)
                if len(data_f)>0 and len(data_s)>0:
                    parent_y = data[:,-1]
                    f_child_y = data_f[:,-1]
                    s_child_y = data_s[:,-1]
                    current_info_gain = self.info_gain(parent_y, f_child_y, s_child_y)
                    # checking if the information gain of the current split is more than that of previous split
                    if current_info_gain > max_info_gain:
                        # update the dictionary values
                        perfect_cleave['factor_idx'] = idx
                        perfect_cleave['cut_off'] = cut_off
                        perfect_cleave['data_f'] = data_f
                        perfect_cleave['data_s'] = data_s
                        perfect_cleave['impurity_loss'] = current_info_gain
                        max_info_gain = current_info_gain

        return perfect_cleave
    
    '''METHOD 6 - find_leaf_type
    to determine the catogery represented by a leaf node in the decision tree'''
    def find_leaf_type(self, y):
        y = list(y)
        return max(y, key = y.count)

    
    '''METHOD 7 - make_tree
    for building the decision tree by making nodes, splitting nodes and so on.
    This is a recursive function'''
    def make_tree(self, data, depth = 0):

        '''data : a 2D array. Its last column consists of target variable value.
                  Rest of the columns correspond to various features in the dataset'''
        
        # seperating feature matrix and target column from the data
        X, y = data[:,:-1], data[:,-1]
        num_samples, num_factors = np.shape(X)

        # checking for stopping conditions
        if num_samples >= self.min_data and depth <= self.max_depth:
            split_dict = self.perfect_cleave(data, num_factors)
            if split_dict['impurity_loss']>0:
                first_subtree = self.make_tree(split_dict['data_f'],depth = depth + 1)
                second_subtree = self.make_tree(split_dict['data_s'], depth = depth + 1)
                return Node(split_dict['factor_idx'], split_dict['cut_off'], first_subtree, second_subtree, split_dict['impurity_loss'])
        
        # create leaf node
        leaf_type = self.find_leaf_type(y)
        return Node(type = leaf_type)
    
    '''METHOD 8 - fit
    the method for training the catogerizer model using the training dataset'''
    def fit(self, X, y):
        '''X : represents the feature matrix of training dataset.
               It is a 2D numpy array.
           y : represents the column matrix of target variable in the training set.
               It is a 2D numpy array'''
        
        # joining X,y to get complete dataset
        dataset = np.concatenate((X,y), axis = 1)
        # setting up root node of the tree by feeding entire training set into make_tree function
        self.root = self.make_tree(dataset)
    
    

    '''METHOD 9 - find_type
    used for predicting the catogery of a given feature vector'''
    
    def find_type(self, x, tree):
        if tree.type != None:
            return tree.type
        factor_value = x[tree.factor_idx]
        if factor_value <= tree.cut_off:
            return self.find_type(x, tree.first)
        else:
            return self.find_type(x, tree.second)

    '''METHOD 10 - predict
    used for predicting the types of data points for a given dataset'''
    def predict(self, X):
        predictions = [self.find_type(x, self.root) for x in X]
        return predictions 

### **STEP 3 : Loading the Dataset**

Viewing the dataset to understand it better

In [4]:
data = pd.read_csv('IRIS.csv')
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
data.dtypes

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

Checking for missing values

In [6]:
data.shape

(150, 5)

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


### **STEP 4 : Train-Test split**

In [8]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=392024)

### **STEP 5 : Initializing the Classification Model**

In [9]:
# creating an instance of 'DecisionTree_Catogerizer' class and storing it in the variable 'classify'
classify = DecisionTree_Catogerizer(min_data = 3, max_depth = 3)

### **STEP 6 : Training the Model**

In [10]:
# training the 'classify' model using the training dataset
classify.fit(X_train, y_train)

### **STEP 7 : Testing the Model**

In [11]:
# testing the model on the unseen test dataset and collecting the predictions made by the model
y_pred = classify.predict(X_test)

### **STEP 8 : Model Evaluation**

**Confusion Matrix**

* for a multi-class classification problem, with say n classes, it is an n by n matrix
* It gives insights about classifications made by the model
* diagonal elements give the correct predictions
* off-diagonal elements give the incorrect predictions


In [12]:
performance_matrix = confusion_matrix(y_test, y_pred)
print(performance_matrix)

[[ 7  0  0]
 [ 0 11  0]
 [ 0  0 12]]


**Values in confusion matrix** :

to interpret values in the confusion matrix w.r.t any class in a multiclass problem, that class is regarded as positive and rest of the classes is regarded as negative. In the terms given below 'True/ False' indicate the correctness of classification done by the model. While the 'positive/ negative' indicate the class predicted by the model which may be correct or incorrect.
* **True Positives (TP)** : for a particular class, this denotes the number of instances correctly classified by the model
* **False Positives (FP)** : w.r.t a particular class, this denotes the number of instances that originally belong to other classes but incorrectly predicted by the model as positive
* **False Negatives (FN)** : w.r.t a particular class, this denotes the number of instances in that class incorrectly classified by the model into other classes
* **True Negatives (TN)** : w.r.t a particular class,this denotes the number of instances in other classes correctly predicted by the model



**Accuracy Score**

* for any class in a multiclass problem, accuracy is determined by $\frac{TP + TN}{TP + FP + FN + TN}$
* accuracy of the model is determined by averaging accuracies over all classes
* it is a value between 0 and 1


In [13]:
print(f"Accuracy of the 'classify' model is {accuracy_score(y_test, y_pred)}")

Accuracy of the 'classify' model is 1.0


A high accuracy score in a classifier indicates that the model is making correct predictions most of the time.

**Precision**

* for any class in a multiclass problem, accuracy is determined by $\frac{TP}{TP + FP }$
* precision of the model is determined by averaging precisions over all classes
* it is a value between 0 and 1

In [14]:
print(f"Precison of the 'classify' model is {precision_score(y_test, y_pred, average='macro')}")


Precison of the 'classify' model is 1.0


**Recall Score**

* for any class in a multiclass problem, accuracy is determined by $\frac{TP}{TP + FN}$
* recall of the model is determined by averaging recalls over all classes
* it is a value between 0 and 1

In [15]:
print(f"Recall of the 'classify' model is {recall_score(y_test, y_pred, average='macro')}")

Recall of the 'classify' model is 1.0


**F1 - Score**

* for any class in a multiclass problem, F1-score is determined by $\frac{2 * Precision * Recall}{Presion + Recall}$
* F1-score of the model is determined by averaging F1-scores over all classes
* it is a value between 0 and 1

In [16]:
print(f"F1-Score of the 'classify' model is {f1_score(y_test, y_pred, average ='macro')}")

F1-Score of the 'classify' model is 1.0


Our model is a success as it has highest value for all metrics.