# Exploratory Data Analysis (EDA) on IRIS Dataset

Exploratory Data Analysis (EDA) is a foundational step in the data analysis process. Before delving into complex modeling, it's imperative to understand the characteristics of the data we're working with.

## Definition

> **Exploratory Data Analysis (EDA)** is an approach to analyze data sets to summarize their main characteristics, often using statistical graphics, plots, and information tables.

## Purpose of EDA

1. **Detecting Outliers:** Identify unusual or unexpected data points.
2. **Testing Assumptions:** Validate if certain assumptions about data are true.
3. **Preliminary Selection of Models:** Inform the choice of suitable statistical models.
4. **Determining Relationships:** Identify patterns, relationships, or anomalies.

## Techniques Used in EDA

### 1. Descriptive Statistics

- **Mean:** Average of all data points.
- **Median:** Middle value when data is sorted.
- **Mode:** Most frequently occurring value.
- **Standard Deviation:** Measures the amount of variation in the dataset.

### 2. Visualization

- **Histograms:** Show the distribution of a dataset.
- **Box Plots:** Visualize basic statistics like outliers, min/max.
- **Scatter Plots:** Show relationships between two variables.
- **Pair Plots:** Visualize pairwise relationships in a dataset.

### 3. Correlation

A statistical measure that expresses the extent to which two variables change together.

- **Positive Correlation:** As one variable increases, the other also does.
- **Negative Correlation:** As one variable increases, the other decreases.

### 4. Handling Missing Data

This is crucial, as many algorithms won't work if there are missing values in the data.

- **Drop:** Remove records with missing values.
- **Impute:** Replace missing values using methods like mean, median, or mode.

## Benefits of EDA

- **Better Understanding of Data:** Deep insights into the nature and structure of data.
- **Informed Decision Making:** Equips analysts and stakeholders with a thorough understanding to make data-driven decisions.
- **Building Better Models:** Helps in feature engineering and model selection for better prediction and classification tasks.

## Conclusion

EDA is a critical step in the data science pipeline. A well-executed EDA informs subsequent steps and can often be the difference between a successful model and a flawed one. Always prioritize EDA to ensure a deep, thorough understanding of any dataset you're working with.


In [6]:
# Importing libraries in Python
import sklearn.datasets as datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the iris dataset
iris = pd.read_csv('iris.csv')

- `df.head()`: Returns the first few rows (default is 5) of the DataFrame `df`.

In [7]:
iris.head()
iris.head(8)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
5,6,5.4,3.9,1.7,0.4,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
7,8,5.0,3.4,1.5,0.2,Iris-setosa


- `df.tail()`: Returns the last few rows (default is 5) of the DataFrame `df`.

In [8]:
iris.tail()
iris.tail(8)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
142,143,5.8,2.7,5.1,1.9,Iris-virginica
143,144,6.8,3.2,5.9,2.3,Iris-virginica
144,145,6.7,3.3,5.7,2.5,Iris-virginica
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


- `df.shape`: Provides a tuple representing the dimensionality (rows, columns) of the DataFrame `df`.


In [11]:
iris.shape

(150, 6)

- `df.info()`: Gives a concise summary of the DataFrame, including data types and non-null values.


In [12]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


- `df.describe()`: Provides descriptive statistics of the columns in the DataFrame `df`.


In [13]:
iris.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


- `df.isnull()`: Returns a DataFrame of the same shape as `df` but with `True` for missing values and `False` for non-missing values.


In [17]:
iris.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

- `df.drop_duplicates()`: Removes duplicate rows from the DataFrame `df`, keeping the first occurrence by default.


In [20]:
df1 = iris.drop_duplicates()
print(df1.shape)
iris.shape

(150, 6)


(150, 6)

- `df.value_counts()`: Returns a series containing counts of unique values, sorted in descending order.


In [26]:
iris.value_counts('Species')


Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

- `df.sample()`: Returns a random sample of items (rows by default) from the DataFrame `df`.


In [27]:
iris.sample(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
43,44,5.0,3.5,1.6,0.6,Iris-setosa
67,68,5.8,2.7,4.1,1.0,Iris-versicolor
111,112,6.4,2.7,5.3,1.9,Iris-virginica
61,62,5.9,3.0,4.2,1.5,Iris-versicolor
37,38,4.9,3.1,1.5,0.1,Iris-setosa
0,1,5.1,3.5,1.4,0.2,Iris-setosa
116,117,6.5,3.0,5.5,1.8,Iris-virginica
93,94,5.0,2.3,3.3,1.0,Iris-versicolor
119,120,6.0,2.2,5.0,1.5,Iris-virginica
89,90,5.5,2.5,4.0,1.3,Iris-versicolor


- `df.nlargest(n, 'column')`: Returns the first `n` occurrences ordered by a specified column in descending order.


In [31]:
iris.nlargest(15,"SepalLengthCm")

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
131,132,7.9,3.8,6.4,2.0,Iris-virginica
117,118,7.7,3.8,6.7,2.2,Iris-virginica
118,119,7.7,2.6,6.9,2.3,Iris-virginica
122,123,7.7,2.8,6.7,2.0,Iris-virginica
135,136,7.7,3.0,6.1,2.3,Iris-virginica
105,106,7.6,3.0,6.6,2.1,Iris-virginica
130,131,7.4,2.8,6.1,1.9,Iris-virginica
107,108,7.3,2.9,6.3,1.8,Iris-virginica
109,110,7.2,3.6,6.1,2.5,Iris-virginica
125,126,7.2,3.2,6.0,1.8,Iris-virginica


- `df.nsmallest(n, 'column')`: Returns the first `n` occurrences ordered by a specified column in ascending order.


In [30]:
iris.nsmallest(15,"SepalLengthCm")

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
13,14,4.3,3.0,1.1,0.1,Iris-setosa
8,9,4.4,2.9,1.4,0.2,Iris-setosa
38,39,4.4,3.0,1.3,0.2,Iris-setosa
42,43,4.4,3.2,1.3,0.2,Iris-setosa
41,42,4.5,2.3,1.3,0.3,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
6,7,4.6,3.4,1.4,0.3,Iris-setosa
22,23,4.6,3.6,1.0,0.2,Iris-setosa
47,48,4.6,3.2,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa


- `df.loc[]`: Accesses a group of rows and columns by labels or boolean array.


In [45]:
iris.loc[iris["SepalLengthCm"]<7.9]

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica


- `df.iloc[]`: Accesses a group of rows and columns by integer-based index location.


In [35]:
iris.iloc[1:4,1:3]

Unnamed: 0,SepalLengthCm,SepalWidthCm
1,4.9,3.0
2,4.7,3.2
3,4.6,3.1
