This project explores the Iris dataset to understand its structure, summarize key statistics, and practice foundational data analysis using Python and Pandas.

In [2]:
# importing necessary libraries.

import pandas as pd
import matplotlib.pyplot as plt

# loading data from a CSV file.

data = pd.read_csv('iris.csv')
data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# What each row and column represents.

**Rows**
- **Observation**: Each row is one measured iris flower (a single sample/record).

**Columns**
- `sepal_length`: Numeric (cm) — length of the sepal.
- `sepal_width`: Numeric (cm) — width of the sepal.
- `petal_length`: Numeric (cm) — length of the petal.
- `petal_width`: Numeric (cm) — width of the petal.
- `species`: Categorical — iris species label (e.g., setosa, versicolor, virginica).

In [None]:
# understanding the structure of the dataset.

data.shape
data.columns
data.dtypes


sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

# Why data types matter?

Choosing correct data types ensures accurate analysis, efficient computation, and proper handling by visualizations and models.

- **Correct operations**: Numeric columns enable aggregation and statistics (mean, sum, std); categorical columns support grouping and counts.
- **Accurate analysis**: Incorrect types (e.g., numbers stored as strings) can lead to misleading summaries or runtime errors.
- **Visualization & modeling**: Plots and machine learning pipelines expect appropriate types; types determine encoding and scaling decisions.
- **Performance & storage**: Suitable types reduce memory usage and improve computation speed.

**Numeric columns**:
- `sepal_length`, `sepal_width`, `petal_length`, `petal_width` (all in cm)

**Categorical column**:
- `species` — species label (e.g., setosa, versicolor, virginica)

In [None]:
# basic statistics.

data.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [8]:
data.mean(numeric_only=True)
data.median(numeric_only=True)


sepal_length    5.80
sepal_width     3.00
petal_length    4.35
petal_width     1.30
dtype: float64

# Understanding Mean and Median.

**Mean (arithmetic average)** — the sum of all values divided by the number of observations. The mean uses every data point and summarizes the dataset's central tendency when values are symmetrically distributed.

**Median (50th percentile)** — the middle value when observations are sorted. The median represents the central position and is robust to extreme values (outliers).

# Why the mean and median can differ?

- **Outliers:** Extreme values pull the mean toward them but have little effect on the median.
- **Skewed distributions:** For right-skewed data the mean is typically greater than the median; for left-skewed data the mean is typically less than the median.
- **Multimodality:** When data have multiple peaks, the mean can lie between modes and may not reflect a 
 observation; the median may better reflect a central split depending on context.
- **Sample size and discreteness:** In small or discrete samples a single value can shift the mean more than the median.

# Practical guidance.
- Report both: give the mean (with standard deviation) and the median (with interquartile range) to summarize center and spread.
- Prefer the median when data are skewed or contain outliers; prefer the mean for symmetric distributions and for methods that assume the mean.
- Use visual checks (histogram, boxplot) to decide which measure better represents your data.

In [9]:
# frequencies of each species.

data['species'].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

# Why Frequency and Balance Matter?

**Frequency (counts and proportions)** — the number of observations in each category and the share they represent in the dataset. Reporting frequency is the first step in understanding your data: it reveals which categories are common, which are rare, and whether sampling or measurement biases may be present.

Why frequency matters:
- **Representativeness:** Counts show whether your sample adequately covers the population or subgroups of interest.
- **Statistical power:** Rare categories have higher variance and lower power for detecting effects or differences.
- **Estimator stability:** Many statistics (means, variances) and model parameters are unstable when computed from very small groups.
- **Model behavior:** Supervised models tend to favor majority classes unless measures are taken (class weighting, resampling).

**Balance in categories** — when categories have similar counts (or proportions). Balance is desirable in many settings because it reduces bias and gives models and statistical tests adequate information for each group.

What balance (or imbalance) implies:
- **Balanced:** Models can learn patterns for each class reliably; evaluation metrics reflect performance across classes.
- **Imbalanced:** A model may achieve high overall accuracy by predicting the majority class while performing poorly on minority classes; common metrics (accuracy) become misleading.

Practical guidance:
- Always report raw counts and proportions (e.g., `value_counts()` and `value_counts(normalize=True)`).
- Visualize frequencies with bar plots to surface imbalance quickly.
- For modeling, use stratified train/test splits, consider class weights, or apply resampling (oversample minority / undersample majority) when appropriate.
- Use evaluation metrics robust to imbalance (precision/recall, F1-score, balanced accuracy, confusion matrix).

Checking frequency and balance early prevents incorrect conclusions and guides appropriate preprocessing and modeling choices.