<a href="https://colab.research.google.com/github/Skidmark156/username-DataScience-2025/blob/main/completed/02_Summary_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📈 Summary Statistics

Summary statistics help us describe a dataset with just a few numbers.
Instead of staring at thousands of rows, we can quickly understand **center, spread, and shape**.

This notebook covers:
- Measures of central tendency (mean, median, mode)
- Measures of spread (range, variance, standard deviation, IQR)
- Skewness and kurtosis (just enough to impress your friends)
- Quick summaries with Pandas


## 1. Central Tendency

These describe the 'middle' of the data:
- **Mean** – average
- **Median** – middle value
- **Mode** – most frequent value

In [None]:
import numpy as np
import pandas as pd
from scipy import stats

data = [5, 7, 8, 5, 10, 12, 7, 7, 6, 9]

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", stats.mode(data, keepdims=True)[0][0])

Mean: 7.6
Median: 7.0
Mode: 7


👉 **Question:** Which is more robust to outliers, the mean or the median?

The median is more robust to outliers because it is not affected by extreme values

## 2. Spread

Spread tells us how variable the data are.
- **Range**: max – min
- **Variance**: average squared deviation from the mean
- **Standard Deviation**: square root of variance (in original units)
- **Interquartile Range (IQR)**: middle 50% (Q3 – Q1)

In [None]:
print("Range:", np.max(data) - np.min(data))
print("Variance:", np.var(data, ddof=1))
print("Standard Deviation:", np.std(data, ddof=1))
print("IQR:", stats.iqr(data))

👉 **Exercise:** Add an extreme outlier (e.g., 100) to the dataset and see how the mean, median, and standard deviation change.

In [None]:
# Original dataset
import numpy as np
data = [5, 7, 8, 5, 10, 12, 7, 7, 6, 9]

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Std Dev:", np.std(data, ddof=1))

# Add an outlier
data_outlier = data + [100]
print("\nWith Outlier:")
print("Mean:", np.mean(data_outlier))
print("Median:", np.median(data_outlier))
print("Std Dev:", np.std(data_outlier, ddof=1))

Mean: 7.6
Median: 7.0
Std Dev: 2.2211108331943574

With Outlier:
Mean: 16.0
Median: 7.0
Std Dev: 27.939219745726614


## 3. Shape: Skewness & Kurtosis

- **Skewness**: measures asymmetry (left/right tail)
- **Kurtosis**: measures 'peakedness' or heavy tails

Most real-life datasets are *not* perfectly normal, so these help describe the difference.

In [None]:
print("Skewness:", stats.skew(data))
print("Kurtosis:", stats.kurtosis(data))

👉 **Note:** High kurtosis means more extreme outliers; low kurtosis means flat/boring data.

## 4. Quick Summaries with Pandas

Instead of writing 10 functions, Pandas does it for you with `.describe()`.

In [None]:
df = pd.DataFrame({"Values": data})
df.describe()

👉 **Task:** Use `.describe()` on another dataset (e.g., `penguins` from Seaborn or your own CSV).

In [None]:
import pandas as pd

df = pd.DataFrame({"Values": data})
df.describe()

Unnamed: 0,Values
count,10.0
mean,7.6
std,2.221111
min,5.0
25%,6.25
50%,7.0
75%,8.75
max,12.0


---
✅ That’s it for summary stats! Next up → [Matplotlib Basics](03-Matplotlib_Basics.ipynb)