<a href="https://colab.research.google.com/github/Riley-Hoang/3603-Programming-for-Data-Science/blob/main/Assignments/07-Describing_and_Visualizing_Data%20/02_Summary_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ðŸ“ˆ Summary Statistics

Summary statistics help us describe a dataset with just a few numbers.
Instead of staring at thousands of rows, we can quickly understand **center, spread, and shape**.

This notebook covers:
- Measures of central tendency (mean, median, mode)
- Measures of spread (range, variance, standard deviation, IQR)
- Skewness and kurtosis (just enough to impress your friends)
- Quick summaries with Pandas


## 1. Central Tendency

These describe the 'middle' of the data:
- **Mean** â€“ average
- **Median** â€“ middle value
- **Mode** â€“ most frequent value

In [1]:
import numpy as np
import pandas as pd
from scipy import stats

data = [5, 7, 8, 5, 10, 12, 7, 7, 6, 9]

print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Mode:", stats.mode(data, keepdims=True)[0][0])

Mean: 7.6
Median: 7.0
Mode: 7


ðŸ‘‰ **Question:** Which is more robust to outliers, the mean or the median?

<p><strong>Question:</strong> Which is more robust to outliers, the mean or the median?</p>
<p><strong>Answer:</strong> The <em>median</em> is more robust to outliers. Unlike the mean, which can be heavily influenced by extreme values, the median remains largely unaffected by unusually large or small numbers in the dataset.</p>


## 2. Spread

Spread tells us how variable the data are.
- **Range**: max â€“ min
- **Variance**: average squared deviation from the mean
- **Standard Deviation**: square root of variance (in original units)
- **Interquartile Range (IQR)**: middle 50% (Q3 â€“ Q1)

In [2]:
print("Range:", np.max(data) - np.min(data))
print("Variance:", np.var(data, ddof=1))
print("Standard Deviation:", np.std(data, ddof=1))
print("IQR:", stats.iqr(data))

Range: 7
Variance: 4.933333333333334
Standard Deviation: 2.2211108331943574
IQR: 2.5


ðŸ‘‰ **Exercise:** Add an extreme outlier (e.g., 100) to the dataset and see how the mean, median, and standard deviation change.

In [6]:
import numpy as np
import pandas as pd
from scipy import stats

data = [5, 7, 8, 5, 10, 12, 7, 7, 6, 9]

# Original stats
print("Original Mean:", np.mean(data))
print("Original Median:", np.median(data))
print("Original SD:", np.std(data, ddof=1))

# Add outlier
data_out = data + [100]

print("\nWith Outlier Mean:", np.mean(data_out))
print("With Outlier Median:", np.median(data_out))
print("With Outlier SD:", np.std(data_out, ddof=1))


Original Mean: 7.6
Original Median: 7.0
Original SD: 2.2211108331943574

With Outlier Mean: 16.0
With Outlier Median: 7.0
With Outlier SD: 27.939219745726614


## 3. Shape: Skewness & Kurtosis

- **Skewness**: measures asymmetry (left/right tail)
- **Kurtosis**: measures 'peakedness' or heavy tails

Most real-life datasets are *not* perfectly normal, so these help describe the difference.

In [3]:
print("Skewness:", stats.skew(data))
print("Kurtosis:", stats.kurtosis(data))

Skewness: 0.6618453051463379
Kurtosis: -0.4119795471146812


ðŸ‘‰ **Note:** High kurtosis means more extreme outliers; low kurtosis means flat/boring data.

## 4. Quick Summaries with Pandas

Instead of writing 10 functions, Pandas does it for you with `.describe()`.

In [4]:
df = pd.DataFrame({"Values": data})
df.describe()

Unnamed: 0,Values
count,10.0
mean,7.6
std,2.221111
min,5.0
25%,6.25
50%,7.0
75%,8.75
max,12.0


ðŸ‘‰ **Task:** Use `.describe()` on another dataset (e.g., `penguins` from Seaborn or your own CSV).

In [7]:
df = pd.read_csv("/content/sample_data/penguins.csv")
df.describe()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
count,342.0,342.0,342.0,342.0
mean,43.92193,17.15117,200.915205,4201.754386
std,5.459584,1.974793,14.061714,801.954536
min,32.1,13.1,172.0,2700.0
25%,39.225,15.6,190.0,3550.0
50%,44.45,17.3,197.0,4050.0
75%,48.5,18.7,213.0,4750.0
max,59.6,21.5,231.0,6300.0


---
âœ… Thatâ€™s it for summary stats! Next up â†’ [Matplotlib Basics](03-Matplotlib_Basics.ipynb)