# Exploring classification algorithms applied on the iris flower data set.

2023/24

By Trish O'Grady

***

# Introduction

## What is Supervised Learning?

Supervised learning is a type of machine Learning that uses labeled training data to teach an algorithm how to make predictions on its own. Because it comprises a training phase where the algorithm is given a dataset that contains both input data and the appropriate output or target values, it is known as a "supervised" learning method.[1] The algorithm then learns by generalizing from the training data how to translate the input data to the desired output.
There are two types of supervised learning. One type is classification whereby the objective of classification issues is to assign input data to one of a number of predefined classes or labels.[2] For example each of the three flower classes in the iris dataset—Versicolor, Setosa, and Virginica—has four features: sepal length, sepal width, petal length, and petal width. The classification of iris flowers aims to predict flowers based on their distinctive characteristics. The second type is Regression whereby regression problems are used to predict a numerical value or quantity based on the input data, which can be utilized for tasks like predicting the value of a house based on variables like location, semi-detached, private garden and so forth.[2]

The key components to supervised learning are as follows:
Input Data or Features. These are the parameters or variables that define the input. In a classification problem, these could be for example the measurements of an object. These might be variables influencing a numerical result in a regression problem.[3]
Output data or Labels. The output data from supervised learning represents the predictions that the algorithm is attempting to make. Labeled data refers to training data that has already been associated with the desired result. For instance, the output in a classification problem might be a category label, while in a regression problem might be a number.[3]
Training Data: The input-output pairs make up the training dataset. It is applied to the machine learning model's training. This information is used by the algorithm to discover patterns and connections between the features of the input and the related output.[3]
Model Learning: The machine learning model learns the fundamental patterns, relationships, and rules that map input data to output data using the training data. The objective is to develop a model that generalizes well and makes precise assumptions about new data.[3]
Prediction: The model takes the input data, analyses it, and generates a prediction or conclusion as an output. Once the model has been trained, it may be used to make predictions on new data.[3]

Overall, supervised learning is a fundamental and effective machine learning approach that serves as the foundation for numerous practical applications.

# What are classification algorithms?

Machine learning applications use classification algorithms, a subset of supervised learning algorithms, to divide data into discrete classes or categories based on input features. These algorithms are made to recognize and learn from data patterns, which may then be used to categorize data points into predetermined groups.[4]
Logistic regression is employed for binary classification issues when there are only two possible classifications for the output. The logistic function is used to model the likelihood that an input belongs to a given class.[4]
In order to make judgments, decision trees iteratively divide the data into subsets according to the most important attribute. Both binary and multi-class classification can be done using them.[5]
Multiple decision trees are combined in Random Forest, an ensemble learning technique, to increase accuracy and decrease overfitting. It works well for tasks involving classification and regression.[5]
The kind of problem, the type of data, and the specific criteria all influence the choice of classification algorithm. It's normal practice to test out various algorithms to see which one does the task most effectively.[4]

# An overview of the scikit-learn Python library

Scikit-Learn is a free machine learning library for Python. It provides tools for data analysis and modelling. It offers a variety of methods for classification, regression, clustering, and dimensionality reduction and supports both supervised and unsupervised machine learning.[6]

Giving an input a label or category based on its features or qualities is the basic function of classification in machine learning.
Scikit-learn handles classification in a number of ways. Some algorithms include Logistic Regression, Principle Componant Analysis(PCA), Random Forests and Decision Trees. These are helpful for classification and regression issues.[7]

Before choosing an algorithm for classification, data needs to be prepared and labelled so it is classified. After choosing a suitable algorithm like Logistic Regression, the model needs to be trained so it can find patterns within the dataset. Metrics then need to be applied to evaluate the performance. Once the model is trained it can be used to make predictions.[7]

Like classification, regression is a supervised machine learning task that aims to predict a continuous numeric output variable using input information. It is used for estimating values, or modelling relationships between variables. It too needs prepared data before choosing an algorithm. The model is also trained and metrics are applied before making predictions.[6]

In conclusion, scikit-learn is a flexible and popular Python framework for data analysis and machine learning. It is a crucial component of the toolset for data scientists and machine learning practitioners since it offers a complete set of tools for developing, assessing, and deploying machine learning models.[6]

In [1]:
import pandas as pd
import sklearn


ModuleNotFoundError: No module named 'sklearn'

In [1]:
!pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.2-cp310-cp310-win_amd64.whl (9.3 MB)
     ---------------------------------------- 9.3/9.3 MB 982.9 kB/s eta 0:00:00
Collecting joblib>=1.1.1
  Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
     ---------------------------------------- 302.2/302.2 kB 1.2 MB/s eta 0:00:00
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.3.2 scikit-learn-1.3.2 threadpoolctl-3.2.0



[notice] A new release of pip available: 22.2.2 -> 23.3.1
[notice] To update, run: C:\Users\Trish\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [5]:
import sklearn
import pandas as pd

In [6]:
from sklearn import datasets

iris = datasets.load_iris()

In [9]:
csv_path = r"C:\Users\Trish\repo\MachineLearningandStatistics\Machine_Learning_And_Statistics\iris_dataset.csv"

In [10]:
df=pd.read_csv(csv_path)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


# Logistic Regression is the classification algorithm that is implemented using scikit-learn Python

In its most basic form, logistic regression is a statistical model that uses a logistic function to model a binary dependent variable.[8]

The logistic model is used in statistics to model the probability of a specific class or event, such as pass/fail, win/lose etc. This can be used to represent a variety of occurrences, such as determining whether an image contains a cat, dog, or other animal. Each detected object in the image would be assigned a probability ranging from 0 to 1, with a total of one. [9]

A variable needs to be declared as a model. Then the regression model needs to be loaded into it. The machine learning model needs to be trained with training data using the model.fit function. It finds the relationship between all the variables such as sex , cholesterol etc in the data set, with the target variable to predict if someone has a heart defect. The trained model is used to predict new outcomes.[8]

In [3]:
df=pd.read_csv("tableconvert_csv_o6an2r.csv")

print(df)

FileNotFoundError: [Errno 2] No such file or directory: 'tableconvert_csv_o6an2r.csv'

# References
	1. https://data-flair.training/blogs/iris-flower-classification/
	2. https://www.geeksforgeeks.org/regression-classification-supervised-machine-learning/
	3. https://www.sciencedirect.com/topics/computer-science/supervised-learning
    4. https://scikit-learn.org/stable/auto_examples/classification/index.html
    5. https://www.datacamp.com/blog/classification-machine-learning
    6. https://scikit-learn.org/stable/index.html
    7. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
    8. https://en.wikipedia.org/wiki/Logistic_regression
    9. Choosing and Using Statistics: A Biologist's Guide, 3rd Edition, Dytham.C (2011)