# Irish Data set analysis

Author: Galal Abdelaziz

This notebook presents my analysis of the well-known [Fisher’s Iris dataset](https://archive.ics.uci.edu/dataset/53/iris) for the Programming and Scripting module project.

***

![Banner](img/Iris.png)

***

## Introduction:

The __Iris dataset__ is one of the most well-known and commonly used datasets in the field of machine learning and statistics.

* The Iris dataset consists of __150 samples__ of iris flowers from three different species: [Setosa](https://en.wikipedia.org/wiki/Iris_setosa), [Versicolor](https://en.wikipedia.org/wiki/Iris_versicolor), and [Virginica](https://en.wikipedia.org/wiki/Iris_virginica).

* Each sample includes __four features__:

    * __Sepal length__: The length of the iris flower’s sepals (the green leaf-like structures that encase the flower bud).
    * __Sepal width__: The width of the iris flower’s sepals.
    * __Petal length__: The length of the iris flower’s petals (the colored structures of the flower).
    * __Petal width__: The width of the iris flower’s petals.

* Species in the Iris Dataset:

The target variable denotes the species of the iris flower, featuring three classes:

* __Iris Setosa__: Identified by its relatively small size and distinctive sepal and petal dimensions.
* __Iris Versicolor__: Exhibiting moderate size, with characteristics intermediate between Setosa and Virginica.
* __Iris Virginica__: Generally larger, with notable differences in sepal and petal dimensions compared to the other species.

* Historical Notes:

    * The dataset was introduced by the British biologist and statistician [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) in 1936 as an example of [discriminant analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis).
    * The Iris dataset laid the foundation for statistical analysis and machine learning.
    * Ronald Fisher’s work on the dataset paved the way for the development of many classification algorithms still utilized today.
    * It remains a pivotal benchmark for assessing new machine learning models.

* Role in Machine Learning:

    * The dataset's popularity stems from its __simplicity__ and the __clear separation__ of the different species based on the provided features.
    * The Iris dataset serves as a standard benchmark for testing [classification algorithms](https://en.wikipedia.org/wiki/Statistical_classification#Algorithms).
    * Researchers and data scientists use the features of the iris flowers to classify each sample into one of the three species.
    * Researchers employ it to compare algorithmic performance and assess accuracy, precision, and recall.

***

## Variables: 

The Iris dataset contains two types of variables:

* Numeric Variables:

    * Sepal length.
    * Sepal width.
    * Petal length.
    * Petal width.

* Categorical Variable:
    * The only categorical variable in the dataset is the variety/species of iris flowers. It includes three classes:

        * Setosa.
        * Versicolor.
        * Virginica.

***

## Analysis 


![Pie](Program_run_results/number_of_samples.png)


### The pie chart illustrates an equal distribution between all three species (50 each).

![Histogram](Program_run_results/variable_histograms.png)

### The histogram illustrates the following:

* Sepal Length:

    * Setosa (Red): Most values range between 4.5 and 5.5 cm.
    * Versicolor (Green): Values range from around 5.0 to 7.0 cm, with a peak around 5.5 to 6.0 cm.
    * Virginica (Blue): Values range from about 6.0 to 7.5 cm, with a peak around 6.5 cm.
    * Insight: Setosa has a distinctly shorter sepal length compared to Versicolor and Virginica.

* Sepal Width:

    * Setosa (Red): Most values range between 3.0 and 4.0 cm, with a peak around 3.5 cm.
    * Versicolor (Green): Values range from about 2.0 to 3.5 cm, with a peak around 2.8 cm.
    * Virginica (Blue): Values range from approximately 2.2 to 3.5 cm, with a peak around 3.0 cm.
    * Insight: Setosa tends to have a wider sepal compared to Versicolor and Virginica, which have overlapping but slightly narrower distributions.

* Petal Length:

    * Setosa (Red): Values are very tightly clustered around 1.0 to 1.8 cm.
    * Versicolor (Green): Values range from about 3.0 to 5.0 cm, with a peak around 4.0 cm.
    * Virginica (Blue): Values range from approximately 4.5 to 7.0 cm, with a peak around 5.5 cm.
    * Insight: Setosa has significantly shorter petals compared to the other species. Versicolor and Virginica have longer petals, but Virginica's petal length tends to be the longest.

* Petal Width:

    * Setosa (Red): Values are tightly clustered around 0.2 to 0.6 cm.
    * Versicolor (Green): Values range from about 1.0 to 1.8 cm, with a peak around 1.3 cm.
    * Virginica (Blue): Values range from approximately 1.4 to 2.5 cm, with a peak around 2.0 cm.
    * Insight: Similar to petal length, Setosa has much narrower petals. Versicolor and Virginica have wider petals, with Virginica generally having the widest.

* Overall Insights from Histograms:

    * __Setosa__ is easily distinguishable from Versicolor and Virginica based on petal dimensions (length and width), having significantly shorter and narrower  petals.
    * __Sepal dimensions__ provide some overlap between species, but Setosa’s wider sepals and shorter sepal length help in differentiation.
    * __Versicolor and Virginica__ have overlapping ranges in many features, but petal dimensions (both length and width) can help distinguish them, with Virginica generally exhibiting larger values.

![Scatterplots](Program_run_results/variable_scatterplots.png)

### The scatter pairplot depicts the following:

__Diagonal Elements (Histograms):__

* The diagonal elements of the pairplot show histograms for the individual features (sepal length, sepal width, petal length, petal width) for each species.
* Setosa (Red), Versicolor (Green), and Virginica (Blue) distributions are displayed, helping to visualize the spread and central tendencies of each feature within each species.

__Off-Diagonal Elements (Scatter Plots):__

* The scatter plots show the relationship between every pair of features, with different colors representing different species.

* Sepal Length vs. Sepal Width (First Row, Second Column):

    * Setosa (Red): Shows a clear cluster with generally higher sepal widths and shorter sepal lengths.
    * Versicolor (Green) and Virginica (Blue): Overlap significantly but can be somewhat distinguished by their slight separation in sepal length.

* Sepal Length vs. Petal Length (First Row, Third Column):

    * Clear separation of Setosa, which has shorter petal lengths.
    * Versicolor and Virginica show a linear relationship, with Virginica generally having longer petal lengths.

* Sepal Length vs. Petal Width (First Row, Fourth Column):

    * Setosa is distinctly separated with narrower petal widths.
    * Versicolor and Virginica show overlapping distributions but Virginica tends to have slightly wider petals.

* Sepal Width vs. Petal Length (Second Row, Third Column):

    * Setosa is again clearly separated by having shorter petal lengths.
    * Versicolor and Virginica are more intermixed but can still be distinguished based on the range of petal lengths.

* Sepal Width vs. Petal Width (Second Row, Fourth Column):

    * Distinction is clear for Setosa with narrower petal widths.
    * Versicolor and Virginica overlap but Virginica tends to have slightly wider petal widths.

* Petal Length vs. Petal Width (Third Row, Fourth Column):

    * This plot shows the most distinct separation among all species.
    * Setosa is clearly separated with both shorter and narrower petals.
    * Versicolor and Virginica show a linear relationship, but Virginica tends to have longer and wider petals compared to Versicolor.

__Overall Insights from the Pairplot__

* Setosa:

    * Easily distinguishable from Versicolor and Virginica in almost all pairwise comparisons, especially in features related to petal dimensions.
    * Generally has shorter and narrower petals and slightly wider sepals.

* Versicolor and Virginica:

    * These two species overlap in many feature pairs but can still be differentiated through closer examination of the plots.
    * Versicolor tends to have intermediate values, while Virginica usually has the largest values for petal dimensions.
    * Petal length vs. petal width plot shows a clear linear separation between these two species.

* Summary:

    * The pairplot effectively visualizes the relationships between each pair of features, highlighting the distinguishing characteristics of the three species of iris flowers.
    * It is particularly useful for identifying which features are most effective in distinguishing between species, aiding in tasks such as classification and pattern recognition.

![Correlation](Program_run_results/variable_correlation_matrix.png)

### The correlation matrix heatmap shows the following:

* Each cell in the matrix represents the correlation coefficient between two features, which ranges from -1 to 1. This value indicates the strength and direction of the linear relationship:

    * 1: Perfect positive correlation
    * -1: Perfect negative correlation
    * 0: No correlation

__Diagonal Elements:__

* These represent the correlation of each feature with itself, which is always 1.

__Off-Diagonal Elements:__ 

* These represent the correlation between different pairs of features.

* Sepal Length and Sepal Width (-0.12)

    * Very weak negative correlation.
    * This indicates that sepal length and sepal width are almost independent of each other in terms of linear relationship.

* Sepal Length and Petal Length (0.87)

    * Strong positive correlation.
    * As sepal length increases, petal length also tends to increase.

* Sepal Length and Petal Width (0.82)

    * Strong positive correlation.
    * As sepal length increases, petal width tends to increase as well.

* Sepal Width and Petal Length (-0.43)

    * Moderate negative correlation.
    * As sepal width increases, petal length tends to decrease.

* Sepal Width and Petal Width (-0.37)

    * Moderate negative correlation.
    * As sepal width increases, petal width tends to decrease.

* Petal Length and Petal Width (0.96)

    * Very strong positive correlation.
    * As petal length increases, petal width almost always increases as well. This is the strongest correlation observed in the matrix.

__Overall Insights from the Correlation Matrix Heatmap__

* Strong Correlations:

    * The strongest correlations are observed between petal length and petal width (0.96), followed by sepal length with petal length (0.87) and sepal length with petal width (0.82). These strong positive correlations suggest that these feature pairs are highly related and tend to increase together.

* Moderate to Weak Correlations:

    * Sepal width shows moderate negative correlations with petal length (-0.43) and petal width (-0.37), indicating that wider sepals are somewhat associated with shorter and narrower petals.
    * The correlation between sepal length and sepal width is very weak (-0.12), indicating these features are mostly independent.

* Importance in Analysis

    * Feature Selection: Understanding these correlations is crucial for feature selection in machine learning. Highly correlated features (e.g., petal length and petal width) might be redundant and can be considered for dimensionality reduction techniques.
    * Data Interpretation: These correlations help interpret the biological relationships between different parts of the iris flowers, providing insights into how certain features are likely to vary together.

![sepal_length_vs_sepal_width](Program_run_results/sepal_length_vs_sepal_width.png)

### The scatter plot depicts the following:

* The Setosa species features shorter sepal lengths alongside wider sepal widths.
* The Versicolor species displays sepal dimensions that are intermediate, with moderate lengths and widths.
* The Virginica species is distinguished by longer sepal lengths and narrower sepal widths.
* The sepal length and sepal width are almost independent of each other in terms of linear relationship.


![petal_length_vs_petal_width](Program_run_results/petal_length_vs_petal_width.png)

### The scatter plot depicts the following:

* The Setosa species features smaller petal lengths and widths.
* The Versicolor species displays petal dimensions that are intermediate, with moderate lengths and widths.
* The Virginica species is distinguished by the largest petal lengths and widths.
* There is a strong positive correlation between petal length and width within each species.

***

## Analysis Key Findings:

* Distribution:

    * The data set have equal distribution between all three species (50 each).
    
* Species Separation:

    * The dataset exhibits clear separation between the three iris species: Setosa, Versicolor, and Virginica.
    * Visualizing the feature distributions (such as sepal length vs. sepal width or petal length vs. petal width) reveals distinct clusters for each species.

* Feature Importance:

    * Petal dimensions (length and width) play a crucial role in species differentiation.
    * Setosa has the smallest petals, while Virginica has the largest.
    * Sepal dimensions are less discriminative but still contribute to species classification.

* Scatter Plots:

    * Scatter plots of petal length vs. petal width show clear boundaries between species.
    * Setosa has the smallest petals, forming a tight cluster.
    * Versicolor and Virginica overlap more, but their petal dimensions still allow separation.

* Relationships:

     * The data analysis reveals strong positive correlations between the following:
    
         * Petal Width and Petal Length.
         * Petal Length and Sepal Length.
         * Petal Width and Sepal Length.
     * Aiding in species classification.

* Outlier Detection: 

    * Some outliers are present, potentially indicating unique or anomalous specimens within the dataset.

***

## End