### Project Title: Iris Flower Classification
#### Done By: Nozipho Sithembiso Ndebele
---

<div style="text-align: center;">
<img src="https://miro.medium.com/v2/resize:fit:1200/0*KQboQDi8ywWIryIP.jpg" alt="Iris Image" width="1000"/>
</div>

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**

The Iris flower dataset is a classic multivariate dataset introduced by Ronald A. Fisher in 1936. It has been widely used in statistical pattern recognition and machine learning as a benchmark for classification problems. The data was originally collected by Edgar Anderson to study the morphological variation in Iris flowers of three different species.

### Purpose
The aim of this project is to build a classification model that can accurately identify the species of an Iris flower based on its sepal and petal dimensions. This provides a practical example of supervised machine learning, specifically multi-class classification.

### Significance
The Iris dataset is often used for:

- Demonstrating the application of classification algorithms (e.g., Logistic Regression, KNN, SVM, Decision Trees, Random Forest, etc.)

- Teaching exploratory data analysis (EDA) and data visualization.

- Exploring feature importance and decision boundaries in classification.

- Benchmarking new machine learning models and pipelines.

It is an ideal dataset due to its simplicity, balanced classes, and non-trivial classification challenge.


### Problem Domain
The project focuses on predicting the species of an Iris flower based on measurable characteristics. The three species—Setosa, Versicolor, and Virginica—exhibit unique patterns in petal and sepal dimensions.

We aim to build a machine learning model that can:

- Accurately classify a flower into one of the three species.

- Understand the relationship between features and the target class.

- Evaluate the performance of different classification models.

### Challenges
- Class Overlap: While Setosa is linearly separable, Versicolor and Virginica have overlapping feature spaces.

- Feature Selection: Identifying which features contribute most to model accuracy.

- Model Evaluation: Ensuring that metrics like accuracy, precision, recall, and confusion matrix are considered for multi-class classification.

- Bias and Overfitting: Avoiding overfitting with simple models on a small dataset.

### Key Questions
- Which features are most important in distinguishing Iris species?

- How well can we classify the flowers using basic machine learning models?

- Which algorithm performs best on this dataset?

- Can we visualize the decision boundaries for interpretability?

---
<a href=#one></a>
## **Importing Packages**

### Purpose
To set up the Python environment with the necessary libraries for data manipulation, visualization, and machine learning. These libraries will facilitate data preprocessing, feature extraction, model training, and evaluation.

### Details
* Pandas: For handling and analyzing data.

* NumPy: For numerical operations.

* Matplotlib/Seaborn: For data visualization to understand trends and patterns.

* scikit-learn: For building and evaluating machine learning models.

* NLTK/Spacy: For text preprocessing and natural language processing tasks.

---

In [1]:
# Import necessary packages  

# Data manipulation and analysis  
import pandas as pd  # Pandas for data handling  
import numpy as np  # NumPy for numerical operations  

# Data visualization  
import matplotlib.pyplot as plt  # Matplotlib for static plots  
import seaborn as sns  # Seaborn for statistical visualization  
import plotly.express as px  # Plotly for interactive plots  

# Natural Language Processing  
import nltk  # Natural Language Toolkit  
from nltk.corpus import stopwords  # Stopword removal  
from nltk.tokenize import word_tokenize  # Tokenization  
import re  # Regular expressions for text cleaning  

# Configure visualization settings
sns.set(style='whitegrid')  # Set the default style for Seaborn plots
plt.rcParams['figure.figsize'] = (10, 6)  # Set default figure size for Matplotlib

# Suppress warnings
import warnings  # Import the warnings module
warnings.filterwarnings('ignore')  # Ignore all warning messages


  from pandas.core import (


---
<a href=#two></a>
## **Data Collection and Description**
### Purpose
Understanding the dataset structure and feature relationships is essential for performing meaningful classification and evaluation.

### Details
The dataset consists of 150 observations, with 50 samples from each of the three Iris species.

- Source: UCI Machine Learning Repository

- Format: CSV file with 150 rows and 5 columns

- Class Distribution: Balanced (50 samples per species)

### Types of Data
- `sepal_length`	Length of the sepal (in cm)
- `sepal_width`	Width of the sepal (in cm)
- `petal_length`	Length of the petal (in cm)
- `petal_width`	Width of the petal (in cm)
- `species`	Species of the flower: Setosa, Versicolor, or Virginica



---
<a href=#three></a>
## **Loading Data**
### Purpose
The purpose of this section is to load the dataset into the notebook for further manipulation and analysis. This is the first step in working with the data, and it allows us to inspect the raw data and get a sense of its structure.

### Details
In this section, we will load the dataset into a Pandas DataFrame and display the first few rows to understand what the raw data looks like. This will help in planning the next steps of data cleaning and analysis.


---

In [2]:
# Load the dataset into a Pandas DataFrame

# The dataset is stored in a CSV file named 'Domestic violence.csv'
df = pd.read_csv('IRIS.csv')

In [3]:
# df is the original dataset (DataFrame), this creates a copy of it
df_copy = df.copy()

# Now 'df_copy' is an independent copy of 'df'. Changes to 'df_copy' won't affect 'df'.


In [4]:
# Display the first few rows of the dataset to get a sense of what the raw data looks like
df_copy.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
# Display the number of rows and columns in the dataset to understand its size
df_copy.shape

(150, 5)

In [6]:
# Check the structure of the dataset
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


---
<a href=#four></a>
## **Data Cleaning and Filtering**
Before analyzing the data, it is crucial to clean and filter it. This process involves handling missing values, removing outliers, correcting errors, and possibly reducing the data by filtering out irrelevant features. These steps ensure that the analysis is based on accurate and reliable data.

Details
In this section, we will:

* Check for Missing Values: Identify if there are any missing values in the dataset and handle them accordingly.
* Remove Duplicates: Ensure there are no duplicate rows that could skew the analysis.
* Correct Errors: Look for and correct any obvious data entry errors.
* Filter Data: Depending on the analysis requirements, filter the data to include only relevant records.

In [7]:
# 1. Check for missing values in the dataset

def check_missing_values(df):
    """
    Check for missing values in the dataset and display the number of missing values per column.

    Parameters:
    df (pandas.DataFrame): The dataset to check for missing values.

    Returns:
    pandas.Series: A series showing the number of missing values for each column.
    """
     # Check for missing values in the dataset and display them
    print("Missing values per column:")
    missing_values = df.isnull().sum()
    print(missing_values)
    return missing_values


In [8]:
# Assuming df is your DataFrame
missing_values = check_missing_values(df_copy)


Missing values per column:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


After examining the dataset, no missing values were found across any of the columns. This ensures data completeness and eliminates the need for imputation or further cleaning related to missing data.


In [9]:
def remove_duplicates(df):
    """
    Checks for duplicate rows in the dataset and removes them if any are found.

    Args:
    df (pandas.DataFrame): The dataframe to check for duplicate rows.

    Returns:
    pandas.DataFrame: The dataframe with duplicate rows removed, if any existed.
    """
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    print(f"\nNumber of duplicate rows: {duplicate_rows}")
    
    # Remove duplicates if any exist
    if duplicate_rows > 0:
        df.drop_duplicates(inplace=True)
        print(f"Duplicate rows removed. Updated dataframe has {len(df)} rows.")
    else:
        print("No duplicate rows found.")
    
    return df

In [10]:
df_copy = remove_duplicates(df_copy)


Number of duplicate rows: 3
Duplicate rows removed. Updated dataframe has 147 rows.


Upon reviewing the dataset, 3 duplicate rows were found and removed. This ensures that all records are unique, and no further action is required for data deduplication.


## **Saving the Cleaned Dataset**
### Purpose

This section outlines how to save the cleaned dataset for future use. Saving the dataset ensures that the data cleaning process does not need to be repeated and allows for consistent use in subsequent analyses.

### Details

We will save the cleaned dataset as a CSV file.

In [11]:
#6. Save the cleaned dataset to a new CSV file

def save_cleaned_dataset(df, filename='cleaned_IRIS.csv'):
    """
    Saves the cleaned dataframe to a CSV file.

    Args:
    df (pandas.DataFrame): The cleaned dataframe to save.
    filename (str): The name of the file to save the dataframe to (default is 'cleaned_domestic_violence.csv').

    Returns:
    None
    """
    # Save the cleaned dataset to a CSV file
    df.to_csv(filename, index=False)
    print(f"Cleaned dataset saved successfully as '{filename}'.")


In [12]:
save_cleaned_dataset(df_copy)


Cleaned dataset saved successfully as 'cleaned_IRIS.csv'.


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**

It is the process of analyzing datasets to summarize key features, often through visualization methods. It aims to discover patterns, spot anomalies, and formulate hypotheses for deeper insights, which informs subsequent analysis.
#### Advantages

- Helps in understanding the data before modeling.
- Provides insights that guide feature selection and engineering.
- Assists in choosing appropriate modeling techniques.
- Uncovers potential data quality issues early.

`The following methods were employed to communicate our objective:`



---


---
<a href=#nine></a>
## **Conclusion and Future Work**


##### Conclusion



##### Future Work

To build upon this study, future work could focus on the following areas:



---
<a href=#ten></a>
## **References**

## Additional Sections to Consider

**Contributors**: Nozipho Sithembiso Ndebele
