# Descriptive Statistics with Python

Descriptive statistics summarize and describe the main features of a dataset. 
Key measures like central tendency and dispersion are used to understand the distribution of data.

In this notebook, we will explore descriptive statistics using the famous **Iris dataset**, which contains data about three types of iris flowers. Python libraries like `pandas` and `numpy` will be used for calculations, and `matplotlib` and `seaborn` will assist with visualizations.

Let's begin!


## Understanding the Dataset

The **Iris dataset** contains information about the sepal length, sepal width, petal length, and petal width of 150 iris flowers. 
These flowers belong to three species: *setosa*, *versicolor*, and *virginica*.

We will calculate descriptive statistics for each feature and visualize the data.


## Measures of Central Tendency

Central tendency measures identify the center of a dataset. 
The **mean**, **median**, and **mode** are the most commonly used measures.

- **Mean**: It is the average of all values. It is affected by extreme values.
- **Median**: It is the middle value when the data is sorted. It is not affected by extreme values.
- **Mode**: It is the most frequently occurring value in the dataset.

Let's calculate these measures for the sepal lengths of the flowers.


## Measures of Dispersion

Dispersion measures describe how data points are spread around the central value. 
Important metrics include:

- **Range**: The difference between the maximum and minimum values.
- **Variance**: The average of squared differences from the mean.
- **Standard Deviation**: The square root of variance, showing how much data deviates from the mean.
- **Interquartile Range (IQR)**: The range between the first (25th percentile) and third quartile (75th percentile).

These metrics provide insights into data variability. For instance, lower dispersion indicates consistency, while higher dispersion shows variability.


## Percentiles and Quartiles

Percentiles and quartiles divide data into intervals:

- **Percentiles**: Data is divided into 100 equal parts.
- **Quartiles**: Data is divided into 4 equal parts.
  - Q1: 25th percentile
  - Q2: 50th percentile (median)
  - Q3: 75th percentile

The **Interquartile Range (IQR)** is used to identify outliers by analyzing values that fall below Q1 or above Q3.


## Visualizing Data with Plots

Visualizations are essential for understanding data distribution and patterns. 
We will use the following plots:

- **Histograms**: Show the frequency distribution of numerical data.
- **Box Plots**: Highlight the median, quartiles, and potential outliers.
- **Scatter Plots**: Display relationships between two variables.
- **Pie Charts**: Represent proportions of categories in a dataset.

Visualizations help in identifying trends and anomalies.


# Summary

In this notebook, descriptive statistics were applied to the **Iris dataset**. 
We calculated measures of central tendency and dispersion, analyzed percentiles and quartiles, and created visualizations.

Understanding these concepts is essential for data exploration and decision-making. They form the foundation for advanced statistical methods and machine learning models.


In [3]:
from sklearn.datasets import load_iris
import pandas as pd

# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Rename columns for clarity
df.columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

# Add species names
df['species'] = df['species'].apply(lambda x: iris.target_names[x])

# Display the first few rows
df.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
