# IRIS

*EXPLORATORY DATA ANALYSIS (EDA)*

Author: Marcin Kaminski

## About the data

The dataset contains information about three species of iris: Iris setosa, Iris versicolor, and Iris virginica.

The dataset includes measurements of four traits: sepal length and width, and petal length and width.

Each row in the dataset represents a single flower, and the measurements are given in centimeters.

The dataset consists of 150 samples, 50 for each species, and is widely used as a basic dataset for testing classification algorithms and in data science and machine learning.

Columns:

* **sepal length** - Calyx length in cm
* **sepal width** - Calyx width in cm
* **petal length** - Petal length in cm
* **petal width** - Petal width in cm
* **class** - Iris class (setosa, versicolor, virginica)

## 1. GENERAL OVERVIEW

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('iris.csv', sep =",") # Reading data from a csv file

In [4]:
df

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [5]:
df.head(10) # First 10 rows

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [6]:
df.tail(10) # Last 10 rows

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
140,6.7,3.1,5.6,2.4,Iris-virginica
141,6.9,3.1,5.1,2.3,Iris-virginica
142,5.8,2.7,5.1,1.9,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
149,5.9,3.0,5.1,1.8,Iris-virginica


In [7]:
df.sample(10) # Random 10 rows

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
140,6.7,3.1,5.6,2.4,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
33,5.5,4.2,1.4,0.2,Iris-setosa
54,6.5,2.8,4.6,1.5,Iris-versicolor
7,5.0,3.4,1.5,0.2,Iris-setosa
16,5.4,3.9,1.3,0.4,Iris-setosa
76,6.8,2.8,4.8,1.4,Iris-versicolor
88,5.6,3.0,4.1,1.3,Iris-versicolor
114,5.8,2.8,5.1,2.4,Iris-virginica


In [9]:
df.info() # Information about columns in DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   sepal length   150 non-null    float64
 1    sepal width   150 non-null    float64
 2    petal length  150 non-null    float64
 3    petal width   150 non-null    float64
 4    class         150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [11]:
df.describe() # Summary of all numeric columns

Unnamed: 0,sepal length,sepal width,petal length,petal width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


The average sepal length in the sample of 150 analyzed iris flowers is 5.84 cm and the average sepal width is 3.05 cm.
The average petal length is 3.76 cm and the average petal width is about 1.20 cm.
The greatest diversity, characterized by the highest standard deviation of 1.76 cm, occurs among the petal length. 
On the other hand, iris flowers differ the least in terms of sepal width (std = 0.43 cm).

50% of the tested iris flower sample had a sepal length greater than 5.8 cm. With a maximum of 7.9 cm, this may mean that most irises have relatively long petals (larger than the median). In addition, 50% of the tested flower sample had calyxes with a width of more than 3 cm, but not greater than 4.4 cm. Half of the flowers had petals over 4.35 cm long and over 1.3 cm wide.

In [13]:
df.nunique() # Unique values in each column

sepal length     35
 sepal width     23
 petal length    43
 petal width     22
 class            3
dtype: int64

In [15]:
df.sort_values(by = "sepal length") # Sorting DataFrame by column "sepal length", ascending

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
13,4.3,3.0,1.1,0.1,Iris-setosa
42,4.4,3.2,1.3,0.2,Iris-setosa
38,4.4,3.0,1.3,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
41,4.5,2.3,1.3,0.3,Iris-setosa
...,...,...,...,...,...
122,7.7,2.8,6.7,2.0,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica


In [17]:
df.sort_values(by = "sepal length", ascending = False) # Sorting DataFrame by column "sepal length", descending

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
131,7.9,3.8,6.4,2.0,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica
122,7.7,2.8,6.7,2.0,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica
...,...,...,...,...,...
41,4.5,2.3,1.3,0.3,Iris-setosa
42,4.4,3.2,1.3,0.2,Iris-setosa
38,4.4,3.0,1.3,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa


Flowers of the species "Iris setosa" are characterized by statistically the shortest calyxes and flowers of "Iris virginica" have relatively the longest calyxes.

In [20]:
df2 = df[df["sepal length"] > 5.8] # Only flowers with sepal length above 5.8 cm (median for the entire dataset)

In [21]:
df2

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
50,7.0,3.2,4.7,1.4,Iris-versicolor
51,6.4,3.2,4.5,1.5,Iris-versicolor
52,6.9,3.1,4.9,1.5,Iris-versicolor
54,6.5,2.8,4.6,1.5,Iris-versicolor
56,6.3,3.3,4.7,1.6,Iris-versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


Could it be that the flowers of the species "Iris setosa" do not have a calyx longer than 5.8 cm?

We will verify this later.

In [24]:
df3 = df[df["sepal length"] < 5.8] # Only flowers with sepal length below 5.8 cm (median for the entire dataset)

In [23]:
df3

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
98,5.1,2.5,3.0,1.1,Iris-versicolor
99,5.7,2.8,4.1,1.3,Iris-versicolor
106,4.9,2.5,4.5,1.7,Iris-virginica
113,5.7,2.5,5.0,2.0,Iris-virginica
