# Exploring classification algorithms applied on the iris flower data set.

2023/24

By Trish O'Grady

***

# Introduction

## The Iris Dataset

In 1936, Ronald Fisher, a statistician and biologist developed a linear function to differentiate Iris plant species based on the morphology of their flowers. The Fisher’s Iris data set contains 50 samples of the 3 Iris species. They are Iris Setosa, Iris Virginica and Iris versicolor. The dataset contains 4 features - the widths and lengths of petals and sepals. Its reguraly used to test machine learning alogoritms.[0]

## What is Supervised Learning?

Supervised learning is a type of machine Learning that uses labeled training data to teach an algorithm how to make predictions on its own. Because it comprises a training phase where the algorithm is given a dataset that contains both input data and the appropriate output or target values, it is known as a "supervised" learning method.[1] The algorithm then learns by generalizing from the training data how to translate the input data to the desired output.
There are two types of supervised learning. One type is classification whereby the objective of classification issues is to assign input data to one of a number of predefined classes or labels.[2] For example each of the three flower classes in the iris dataset—Versicolor, Setosa, and Virginica—has four features: sepal length, sepal width, petal length, and petal width. The classification of iris flowers aims to predict flowers based on their distinctive characteristics. The second type is Regression whereby regression problems are used to predict a numerical value or quantity based on the input data, which can be utilized for tasks like predicting the value of a house based on variables like location, semi-detached, private garden and so forth.[2]

The key components to supervised learning are as follows:
Input Data or Features. These are the parameters or variables that define the input. In a classification problem, these could be for example the measurements of an object. These might be variables influencing a numerical result in a regression problem.[3]
Output data or Labels. The output data from supervised learning represents the predictions that the algorithm is attempting to make. Labeled data refers to training data that has already been associated with the desired result. For instance, the output in a classification problem might be a category label, while in a regression problem might be a number.[3]
Training Data: The input-output pairs make up the training dataset. It is applied to the machine learning model's training. This information is used by the algorithm to discover patterns and connections between the features of the input and the related output.[3]
Model Learning: The machine learning model learns the fundamental patterns, relationships, and rules that map input data to output data using the training data. The objective is to develop a model that generalizes well and makes precise assumptions about new data.[3]
Prediction: The model takes the input data, analyses it, and generates a prediction or conclusion as an output. Once the model has been trained, it may be used to make predictions on new data.[3]

Overall, supervised learning is a fundamental and effective machine learning approach that serves as the foundation for numerous practical applications.

# What are classification algorithms?

Machine learning applications use classification algorithms, a subset of supervised learning algorithms, to divide data into discrete classes or categories based on input features. These algorithms are made to recognize and learn from data patterns, which may then be used to categorize data points into predetermined groups.[4]
Logistic regression is employed for binary classification issues when there are only two possible classifications for the output. The logistic function is used to model the likelihood that an input belongs to a given class.[4]
In order to make judgments, decision trees iteratively divide the data into subsets according to the most important attribute. Both binary and multi-class classification can be done using them.[5]
Multiple decision trees are combined in Random Forest, an ensemble learning technique, to increase accuracy and decrease overfitting. It works well for tasks involving classification and regression.[5]
The kind of problem, the type of data, and the specific criteria all influence the choice of classification algorithm. It's normal practice to test out various algorithms to see which one does the task most effectively.[4]

# An overview of the scikit-learn Python library

Scikit-Learn is a free machine learning library for Python. It provides tools for data analysis and modelling. It offers a variety of methods for classification, regression, clustering, and dimensionality reduction and supports both supervised and unsupervised machine learning.[6]

Giving an input a label or category based on its features or qualities is the basic function of classification in machine learning.
Scikit-learn handles classification in a number of ways. Some algorithms include Logistic Regression, Principle Componant Analysis(PCA), Random Forests and Decision Trees. These are helpful for classification and regression issues.[7]

Before choosing an algorithm for classification, data needs to be prepared and labelled so it is classified. After choosing a suitable algorithm like Logistic Regression, the model needs to be trained so it can find patterns within the dataset. Metrics then need to be applied to evaluate the performance. Once the model is trained it can be used to make predictions.[7]

Like classification, regression is a supervised machine learning task that aims to predict a continuous numeric output variable using input information. It is used for estimating values, or modelling relationships between variables. It too needs prepared data before choosing an algorithm. The model is also trained and metrics are applied before making predictions.[6]

In conclusion, scikit-learn is a flexible and popular Python framework for data analysis and machine learning. It is a crucial component of the toolset for data scientists and machine learning practitioners since it offers a complete set of tools for developing, assessing, and deploying machine learning models.[6]

In [3]:
!pip install scikit-learn
!pip install numpy



In [4]:
import sklearn
import pandas as pd
import numpy as np

In [5]:
from sklearn import datasets

iris = datasets.load_iris()

In [6]:
csv_path = r"C:\Users\Trish\repo\MachineLearningandStatistics\Machine_Learning_And_Statistics\iris_dataset.csv"

In [7]:
df=pd.read_csv(csv_path)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [8]:
df.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [9]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# Machine Learning 

Machine Learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence (AI) based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention.[8]

Machine Learning involves 'training' the model, which includes using a subset of a dataset, in which the performance of the model is unknown until it is 'tested' on additional data that was not available during training, referred to as the test set. In this situation, the goal of machine learning is to achieve the best results on the test set.[8]
The goal of (supervised) machine learning is to create a model that can generate accurate predictions over time. Machine learning is all about results, ie the output.  Statistical modeling, on the other hand, is more concerned with identifying correlations between variables and their significance, but it can also make predictions.[9]

The generated data can be utilised to train the Machine Learning model of choice to make predictions in the real world after many different simulations have run and tested all of the different conceivable situations. [10]

Initially, it is assumed that there is a null hypothsis, which is an initial statement in Machine Learning that claims there is no association between two measured events. However, an alternative hypothesis can be accepted meaning that the data is drawn from a distribution that is different to it. The aim of the iris flower classification is to predict flowers based on their specific features. There is a probabilty threshold that determines when you reject the null hypothsis - p values.



## Identify the Distribution of the Target Variable


In [10]:
df['target'].value_counts()

target
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

# Splitting the features and target

The target variable is split from all the other variables, to predict an outcome. All the variables such as sepal length, sepal width etc are used to determine what the target will be. The target variable is removed from the dataframe and stored elsewhere. Two variables are created. All the features are stored in variable X (the target column is dropped from this variabel) and the target is stored in variable Y

In [11]:
X = df.drop(columns='target', axis=1)

Y = df['target']

print(X)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]


In [13]:
print(Y)

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: target, Length: 150, dtype: object


# Logistic Regression is the classification algorithm that is implemented using scikit-learn Python

In its most basic form, logistic regression is a statistical model that uses a logistic function to model a binary dependent variable.11]

The logistic model is used in statistics to model the probability of a specific class or event, such as pass/fail, win/lose etc. This can be used to represent a variety of occurrences, such as determining whether an image contains a cat, dog, or other animal. Each detected object in the image would be assigned a probability ranging from 0 to 1, with a total of one. [12]

A variable needs to be declared as a model. Then the regression model needs to be loaded into it. The machine learning model needs to be trained with training data using the model.fit function. It finds the relationship between all the variables in the data set, with the target variable used for prediction. The trained model is used to predict new outcomes.[11]

# Split data into training and test data

Four variables are created - X_train, X_test, Y-train, Y_test.  The X and Y variables are both split into X and Y test and train data. Test size is the % of the test data required in this case 20% of the data. Stratify=Y distributes the values 0 and 1 evenly. 
The mean values change, that is, the data is split each time the mean function is run. To prevent this, and get the same split each time, the parameter Random State is used to specify the random state instance.

It returns the number of data points in each array. X.shape is the original data. Training data is 80% of the data and 20% is test data.

In [14]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [31]:
import statsmodels.api as sm
logit_model=sm.Logit(Y,(np.asarray(X)))
result=logit_model.fit()
print(result.summary2())


ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

In [22]:
Y

0         Iris-setosa
1         Iris-setosa
2         Iris-setosa
3         Iris-setosa
4         Iris-setosa
            ...      
145    Iris-virginica
146    Iris-virginica
147    Iris-virginica
148    Iris-virginica
149    Iris-virginica
Name: target, Length: 150, dtype: object

In [24]:
X

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


# References
    0.  https://en.wikipedia.org/wiki/Iris_flower_data_set
	1. https://data-flair.training/blogs/iris-flower-classification/
	2. https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
	3. https://www.sciencedirect.com/topics/computer-science/supervised-learning
    4. https://scikit-learn.org/stable/auto_examples/classification/index.html
    5. https://www.datacamp.com/blog/classification-machine-learning
    6. https://scikit-learn.org/stable/index.html
    7. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
    8. https://www.geeksforgeeks.org/data-science-vs-machine-learning/
    9. https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3
    10.https://towardsdatascience.com/modelling-and-simulations-in-data-science-b3f546a953d1
    11. https://en.wikipedia.org/wiki/Logistic_regression
    12. Choosing and Using Statistics: A Biologist's Guide, 3rd Edition, Dytham.C (2011)