# Project Two: Predicting Iris Flowers

<div>
<img src="../images/flower.jpg" alt="flower image" width="20%"/>
</div>

## Introduction
---

Flowers play an essential role in our lives, from expressing emotions to contributing to environmental sustainability. Among them, the Iris flower holds a special place—not just in nature, but also in the world of data science. The Iris dataset is one of the most well-known datasets, first introduced in R.A. Fisher's 1936 paper, <i>The Use of Multiple Measurements in Taxonomic Problems</i>. This dataset has since become a fundamental resource in machine learning and statistical classification. It is widely used for testing algorithms and can be accessed through the UCI Machine Learning Repository.

As mentioned above, we'll be using the [iris dataset](https://www.kaggle.com/datasets/uciml/iris) to predict flower species and evaluate the efficiency of our predictions. Additionally, I aim to explore other aspects of the dataset, such as identifying flowers with the largest petals and sepals and analyzing their significance.
- Which classification model performs best in predicting flower species?
- How do petal and sepal sizes vary among different species?
- Can we use petal and sepal measurements to group flowers using clustering methods?

The dataset itself is straightforward with only 4 features and 150 rows. The dataset is evenly distributed among the three species (50 samples per class). Here are all the features in detail:
<table>
    <tr>
        <td><strong>Feature</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>Id</td>
        <td>Row Number</td>
    </tr>
    <tr>
        <td>SepalLengthCm</td>
        <td>Length of the sepal</td>
    </tr>
    <tr>
        <td>SepalWidthCm</td>
        <td>Width of the sepal</td>
    </tr>
    <tr>
        <td>PetalLengthCm</td>
        <td>Length of the Petal</td>
    </tr>
    <tr>
        <td>PetalWidthCm</td>
        <td>Width of the Petal</td>
    </tr>
    <tr>
        <td>Species</td>
        <td>Species of the iris. Either being Iris-setosa, Iris-versicolor, or Iris-virginica</td>
    </tr>
</table>
As a little refresher, the sepal is the leaf-like part that encloses and protects the flower bud, and the petal is the modified leaf that surrounds the reproductive parts of a flower and usually contains the color.

In [9]:
#Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk

In [10]:
#dataset
df = pd.read_csv('iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Pre-processing
---

Since this dataset is so popular for machine learning and it's so small with little features, it seems likely this data is already clean. Though let's just make sure kaggle didn't have any import errors and/or the dataset wasn't poorly uploaded.

### Dropping nulls

In [5]:
df.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

No nulls are here so we don't have to worry about it.

### Duplicates

In [6]:
print(df.duplicated().value_counts())

False    150
dtype: int64


No duplicates either which is nice, lets move on to more interesting steps

### Checking for unusual types

In [7]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

Types also aren't an issue which should confirm the db looks okay, for now.

## Modeling
---

Since this is a classification issue, we'll be using the k-Nearest Neighbors (KNN) which should be good for a small dataset like this.

## Visualization
---

## Evaluation
---

## Impact & Implications
---

## Storytelling
---

## References
---

- [Iris Dataset](https://www.kaggle.com/datasets/uciml/iris)
- [Introduction facts](https://www.geeksforgeeks.org/iris-dataset/)