# Irish Data set analysis

Author: Galal Abdelaziz

This notebook presents my analysis of the well-known [Fisher’s Iris dataset](https://archive.ics.uci.edu/dataset/53/iris) for the Programming and Scripting module project.

***

![Banner](img/Iris.png)

***

## Introduction:

The __Iris dataset__ is one of the most well-known and commonly used datasets in the field of machine learning and statistics. Here's an overview:

* The Iris dataset consists of __150 samples__ of iris flowers from three different species: __Setosa__, __Versicolor__, and __Virginica__.

* Each sample includes __four features__:

    * __Sepal length__: The length of the iris flower’s sepals (the green leaf-like structures that encase the flower bud).
    * __Sepal width__: The width of the iris flower’s sepals.
    * __Petal length__: The length of the iris flower’s petals (the colored structures of the flower).
    * __Petal width__: The width of the iris flower’s petals.

* Species in the Iris Dataset:

The target variable represents the species of the iris flower and has three classes:

* __Iris Setosa__: Characterized by its relatively small size, with distinctive characteristics in sepal and petal dimensions.
* __Iris Versicolor__: Moderate in size, with features falling between those of Iris setosa and Iris virginica.
* __Iris Virginica__: Generally larger in size, with notable differences in sepal and petal dimensions compared to the other two species.

* Notes:

    * The dataset was introduced by the British biologist and statistician __Ronald Fisher__ in 1936 as an example of __discriminant analysis__.
    * Researchers and data scientists use the features of the iris flowers to classify each sample into one of the three species.
    * The dataset is particularly popular due to its __simplicity__ and the __clear separation__ of the different species based on the provided features.

* Historical Context:

    * The Iris dataset played a foundational role in statistical analysis and machine learning.
    * Ronald Fisher’s work on the dataset paved the way for the development of many classification algorithms still in use today.
    * It continues to be a benchmark for testing new machine learning models.

* Role in Machine Learning:

    * The Iris dataset serves as a standard benchmark for testing __classification algorithms__.
    * Researchers use it to compare the performance of different algorithms and evaluate their accuracy, precision, and recall.

***

## Variables: 

The Iris dataset contains two types of variables:

* Numeric Variables:

    * Sepal length.
    * Sepal width.
    * Petal length.
    * Petal width.

* Categorical Variable:
    * The only categorical variable in the dataset is the variety/species of iris flowers. It includes three classes:

        * Setosa.
        * Versicolor.
        * Virginica.

***

## Analysis 


![Pie](Program_run_results/number_of_samples.png)


### The pie chart illustrates an equal distribution between all three species (50 each).

![Histogram](Program_run_results/variable_histograms.png)

### The histogram illustrates the following:

* The distribution of the values of each feature across all samples.
* There's significant overlap between Sepal Length and Sepal Width histograms, while minimal overlap is observed in the Petal Length and Petal Width graphs.

![Scatterplots](Program_run_results/variable_scatterplots.png)

### The scatter plot depicts the following:

* __Species Clusters__: Some species distinctly group together based on specific features, like petal length and width distinguishing Iris setosa from Iris versicolor and Iris virginica.
* __Feature Relationships__: These plots reveal correlations between features, like the positive relationship between petal length and width.
* __Outliers__: Scatter plots aid in identifying outliers, data points that significantly deviate from others.
* __Notes__:
    * The graph visualizes the relationships between the various pairs of features in the Iris dataset.
    * Since plotting a feature against itself would not be informative, these plots are replaced with histograms.

![Correlation](Program_run_results/variable_correlation_matrix.png)

### The heat map shows the following:

* The data analysis reveals strong correlations between the following: 
    * Petal Width and Petal Length.
    * Petal Length and Sepal Length.
    * Petal Width and Sepal Length.
    
* A strong positive correlation is typically observed between petal length and petal width across all species, suggesting that as one increases, the other tends to as well.
* Sepal measurements also show correlations, albeit typically weaker than those of petals.

![sepal_length_vs_sepal_width](Program_run_results/sepal_length_vs_sepal_width.png)

### The scatter plot depicts the following:

* The Setosa species features shorter sepal lengths alongside wider sepal widths.
* The Versicolor species displays sepal dimensions that are intermediate, with moderate lengths and widths.
* The Virginica species is distinguished by longer sepal lengths and narrower sepal widths.

![petal_length_vs_petal_width](Program_run_results/petal_length_vs_petal_width.png)

### The scatter plot depicts the following:

* The Setosa species features smaller petal lengths and widths.
* The Versicolor species displays petal dimensions that are intermediate, with moderate lengths and widths.
* The Virginica species is distinguished by the largest petal lengths and widths.
* There is a strong correlation between petal length and width within each species.

***

## Analysis Key Findings:

* Distribution:

    * The data set have equal distribution between all three species (50 each).
    
* Species Separation:

    * The dataset exhibits clear separation between the three iris species: Setosa, Versicolor, and Virginica.
    * Visualizing the feature distributions (such as sepal length vs. sepal width or petal length vs. petal width) reveals distinct clusters for each species.

* Feature Importance:

    * Petal dimensions (length and width) play a crucial role in species differentiation.
    * Setosa has the smallest petals, while Virginica has the largest.
    * Sepal dimensions are less discriminative but still contribute to species classification.

* Scatter Plots:

    * Scatter plots of petal length vs. petal width show clear boundaries between species.
    * Setosa has the smallest petals, forming a tight cluster.
    * Versicolor and Virginica overlap more, but their petal dimensions still allow separation.

* Relationships:

    * The data analysis reveals strong correlations between the following:
    
         * Petal Width and Petal Length.
         * Petal Length and Sepal Length.
         * Petal Width and Sepal Length., aiding in species classification.

* Outlier Detection: 

    * Some outliers are present, potentially indicating unique or anomalous specimens within the dataset.

***

## End