# Pands Project_Iris Data Set Analysis 

-----------------

**Author**: Alec Reid

**Student Number**: G00411003

**Date Commenced**: 03/01/2025

---------------

### Table of Contents

**1. Introduction**

- Project Overview
- Objectives
- Dataset Description

**2. Data Exploration**

- Imports
- Loading the Dataset
- Data Structure and Summary Statistics
- Class Distribution

**3. Data Visualization**

- Histograms and Boxplots
- Pairplots / Scatter Matrix
- Correlation Heatmap
- Class Separation Visuals (e.g., PCA or t-SNE)

**4. Data Preprocessing** 
- Missing Values Check
- Feature Scaling / Normalization
- Train-Test Split

**5. Conclusion**
- Summary of Findings
- Limitations
- Future Work

**6. References**
- Bibliography
- Where AI was used in this project

--------

**References(*Move to Biobliography at End of Project*)** 

**Dataset Sources:**
- Dua, D., & Graff, C. (2017). Iris Data Set. UCI Machine Learning Repository. Retrieved from https://www.kaggle.com/datasets/uciml/iris
- Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188. DOI: 10.1111/j.1469-1809.1936.tb02137.x

**Websites:**
- text

**Articles:**
- text

**Respositries:**
- text

**Where AI was used in this Project:**
- Table of Content Headings - OpenAI. (2025). ChatGPT (May 3 version) [Large language model]. https://chat.openai.com/
- Iris Data Set Description - OpenAI. (2025). ChatGPT (May 3 version) [Large language model]. https://chat.openai.com/
-----------

## (1.) Introduction

### (1.1) Project Overview 
In this project we to perform exploratory data analysis on the Iris dataset. 

### (1.2) Project Objectives
We want to perform the followinjg data analysis task on the Iris dataset in this project:
- Loading the Dataset
- Data Structure and Summary Statistics
- Class Distribution
- Histograms and Boxplots
- Pairplots / Scatter Matrix
- Correlation Heatmap
- Class Separation Visuals (e.g., PCA or t-SNE)
- Missing Values Check
- Feature Scaling / Normalization
- Train-Test Split

### (1.3) Dataset Description
The Iris dataset is one of the most famous and widely used datasets in pattern recognition and machine learning. It was introduced by the British statistician Ronald A. Fisher in 1936 as an example of linear discriminant analysis. Each class contains 50 instances, making it a balanced dataset.

![Iris flowers.png](<attachment:Iris flowers.png>)

**Overview:**
- Purpose: Classification of iris species based on flower measurements

- Instances (rows): 150

- Features (columns): 4 numerical features (all continuous)

- Classes (target variable): 3 species of Iris flowers

**Features (Attributes):**
Each row represents an iris flower sample with the following measurements:

- Sepal length (in cm)

- Sepal width (in cm)

- Petal length (in cm)

- Petal width (in cm)

**Classes (Target labels):** 
- Iris setosa

- Iris versicolor

- Iris virginica

**Applications in Data Analytics:**
- Classification algorithms (e.g., k-NN, SVM, Decision Trees)

- Data visualization (e.g., scatter plots, pair plots)

- Dimensionality reduction techniques (e.g., PCA, LDA)

- Teaching basic machine learning concepts

**The original citation for the Iris dataset is:**

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
DOI: 10.1111/j.1469-1809.1936.tb02137.x

----------------

## 2. Data Exploration 

### (2.1) Imports




In [11]:
# Imports 

import pandas as pd # We are going to use this module for Data loading, Data Summarisation and Data Cleaning
import numpy as np # We are going to use this module for array manipulation 
import os as os # We are going to use this module to interact with the file system environment 
import matplotlib.pyplot as plt # We are going to use this module for Data Visualisation 



### (2.2) Loading the Iris Dataset

In [12]:
# Get current working directory and path to iris dataset
cwd = os.getcwd()

print("Current working directory:", cwd)

# Get path to IrisDataet.csv file
filename = "IrisDataset.csv"  

# Get absolute path to the file
file_path = os.path.abspath(filename)

print("Full path to the file:", file_path)

Current working directory: c:\AlecProjects\pands-project
Full path to the file: c:\AlecProjects\pands-project\IrisDataset.csv


In [13]:
# Read Iris Dataset
IrisData = pd.read_csv ('IrisDataset.csv')
IrisData.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [16]:
# Search for any null values 
null_values = IrisData.isnull().sum()
print(null_values)

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


### (2.2) Class Distribution

In [24]:
# Show the number and names of the unique iris species in the species column 
unique_species_count = IrisData['Species'].nunique(), IrisData['Species'].unique()
print("Number of unique species:", unique_species_count)

Number of unique species: (3, array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object))


In [None]:
# Show the number of irises in each species 
df = pd.read_csv("IrisDataset.csv")
print(df['Species'].value_counts())

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64


### (2.3) Data Structure and Summary Statistics

In [15]:
# Describe Iris Dataset
DatasetSummary = IrisData.describe()
print (DatasetSummary)

               Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count  150.000000     150.000000    150.000000     150.000000    150.000000
mean    75.500000       5.843333      3.054000       3.758667      1.198667
std     43.445368       0.828066      0.433594       1.764420      0.763161
min      1.000000       4.300000      2.000000       1.000000      0.100000
25%     38.250000       5.100000      2.800000       1.600000      0.300000
50%     75.500000       5.800000      3.000000       4.350000      1.300000
75%    112.750000       6.400000      3.300000       5.100000      1.800000
max    150.000000       7.900000      4.400000       6.900000      2.500000


The above summary is currently showing the mean, standard deviation, minimum values, maximum values and quartile values inclusive of all iris species. Lets seperate out the table into the different Iris species so we can get a beter idea of how they differ from each other.

In [36]:
# Filter the dataset for rows where Species is 'Iris-setosa'
setosa_data = IrisData[IrisData['Species'] == 'Iris-setosa']

# Drop the 'Id' column from the filtered dataset as it is not a variable column 
setosa_data = setosa_data.drop('Id', axis=1)

# Describe the filtered dataset
setosa_summary = setosa_data.describe()
print(setosa_summary)

       SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count       50.00000     50.000000      50.000000      50.00000
mean         5.00600      3.418000       1.464000       0.24400
std          0.35249      0.381024       0.173511       0.10721
min          4.30000      2.300000       1.000000       0.10000
25%          4.80000      3.125000       1.400000       0.20000
50%          5.00000      3.400000       1.500000       0.20000
75%          5.20000      3.675000       1.575000       0.30000
max          5.80000      4.400000       1.900000       0.60000


In [42]:
# Filter the dataset for rows where Species is 'Iris-versicolor'
versicolor_data = IrisData[IrisData['Species'] == 'Iris-versicolor']

# Drop the 'Id' column from the filtered dataset as it is not a variable column 
versicolor_data = versicolor_data.drop('Id', axis=1)

# Describe the filtered dataset
versicolor_summary = versicolor_data.describe()
print(versicolor_summary)

       SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count      50.000000     50.000000      50.000000     50.000000
mean        5.936000      2.770000       4.260000      1.326000
std         0.516171      0.313798       0.469911      0.197753
min         4.900000      2.000000       3.000000      1.000000
25%         5.600000      2.525000       4.000000      1.200000
50%         5.900000      2.800000       4.350000      1.300000
75%         6.300000      3.000000       4.600000      1.500000
max         7.000000      3.400000       5.100000      1.800000


In [40]:
# Filter the dataset for rows where Species is 'Iris-virginica'
virginica_data = IrisData[IrisData['Species'] == 'Iris-virginica']

# Drop the 'Id' column from the filtered dataset as it is not a variable column 
virginica_data = virginica_data.drop('Id', axis=1)

# Describe the filtered dataset
virginica_summary = virginica_data.describe()
print(virginica_summary)

       SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count       50.00000     50.000000      50.000000      50.00000
mean         6.58800      2.974000       5.552000       2.02600
std          0.63588      0.322497       0.551895       0.27465
min          4.90000      2.200000       4.500000       1.40000
25%          6.22500      2.800000       5.100000       1.80000
50%          6.50000      3.000000       5.550000       2.00000
75%          6.90000      3.175000       5.875000       2.30000
max          7.90000      3.800000       6.900000       2.50000
