# Module 1.3: Basic Statistics for Data Science

Statistics is the science of collecting, analyzing, and interpreting data. It's the heart of data science, allowing us to move from raw numbers to meaningful insights. 📊

In this notebook, we'll focus on **Descriptive Statistics**, which is all about summarizing and describing the main features of a dataset.

**Goal of this Notebook:**
We'll learn how to calculate and interpret the most common measures used to understand data:

1.  **Measures of Central Tendency:** Mean, Median, and Mode.
2.  **Measures of Spread/Dispersion:** Variance and Standard Deviation.

Let's start by creating a simple dataset of people's ages using NumPy.

In [None]:
import numpy as np

# A dataset of ages
ages = np.array([25, 28, 22, 35, 45, 30, 28, 25, 60, 22])

print(f"Our dataset of ages: {ages}")

## 1. Measures of Central Tendency

These measures give us a single value that represents the 'center' or 'typical' value of a dataset.

### Mean (Average)
The most common measure. It's calculated by summing all values and dividing by the count of values.

Formula: $ \text{Mean} = \frac{\sum_{i=1}^{n} x_i}{n} $

In [None]:
mean_age = np.mean(ages)
print(f"The mean (average) age is: {mean_age}")

### Median (Middle Value)
The middle value of a dataset when it's sorted. It's less affected by extreme outliers than the mean.

Example: For `[1, 2, 9]`, the median is `2`. For `[1, 2, 8, 10]`, it's the average of `2` and `8`, which is `5`.

In [None]:
median_age = np.median(ages)
print(f"The median age is: {median_age}")
print(f"Sorted ages: {np.sort(ages)}")

> **Note:** The mean age is 32, but the median is 28. The single high value (60) pulled the mean up. This is why the median is often a better representation of the 'typical' value in skewed datasets.

### Mode (Most Frequent Value)
The value that appears most often in the dataset.

In [None]:
# NumPy doesn't have a built-in mode function, so we'll use SciPy
from scipy import stats

mode_age = stats.mode(ages)
# The result object contains the mode(s) and their count
print(f"The mode of the ages is: {mode_age.mode}")

## 2. Measures of Spread (Dispersion)

These measures tell us how spread out our data is. Are the values all clustered together, or are they widely scattered?

### Variance
The average of the squared differences from the Mean. A higher variance means the data is more spread out.

Formula: $ \sigma^2 = \frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n} $

In [None]:
variance_age = np.var(ages)
print(f"The variance of the ages is: {variance_age:.2f}")

### Standard Deviation
This is the square root of the variance. It's the most common measure of spread and is easier to interpret because it's in the **same units as the original data**.

A simple interpretation: A low standard deviation means values are close to the mean. A high standard deviation means values are spread out over a wider range.

Formula: $ \sigma = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \mu)^2}{n}} $

In [None]:
std_dev_age = np.std(ages)
print(f"The standard deviation of the ages is: {std_dev_age:.2f}")
print(f"This means a typical age is roughly {mean_age:.2f} ± {std_dev_age:.2f} years.")

## ✅ What's Next?

Fantastic! You've now completed the entire **`01_Python_and_Math_Foundations`** module. You've covered the basics of Python programming, the mathematical objects we use (vectors/matrices), and the statistical methods to describe them.

With this foundation, we are ready to tackle the most important libraries for data manipulation. In the next module, **`02_Data_Analysis_and_Wrangling`**, we'll dive deep into **NumPy** and **Pandas**.