# Module 01: Descriptive Statistics

**Difficulty**: ⭐ Beginner

**Estimated Time**: 90 minutes

**Prerequisites**: 
- Module 00: Setup and Introduction
- Basic Python and NumPy knowledge

## Learning Objectives

By the end of this notebook, you will be able to:
1. Calculate and interpret measures of central tendency (mean, median, mode)
2. Calculate and interpret measures of dispersion (variance, standard deviation, range, IQR)
3. Understand and identify different types of data distributions
4. Visualize data distributions using histograms and box plots
5. Apply descriptive statistics to analyze real-world datasets
6. Detect outliers using statistical methods

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Configure visualization
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

# Display options
np.set_printoptions(precision=4, suppress=True)
pd.set_option('display.precision', 4)

print("Setup complete!")

## 1. Measures of Central Tendency

Central tendency measures describe where the "center" of a dataset is located. The three main measures are:

### Mean (Average)
The sum of all values divided by the number of values.

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

**Use when**: Data is roughly symmetric and has no extreme outliers.

### Median (Middle Value)
The middle value when data is sorted. If there are an even number of values, it's the average of the two middle values.

**Use when**: Data has outliers or is skewed.

### Mode (Most Frequent)
The value that appears most frequently in the dataset.

**Use when**: You need to know the most common value, especially for categorical data.

In [None]:
# Example: Student test scores
test_scores = np.array([78, 85, 92, 88, 76, 95, 89, 84, 91, 87])

print("Test Scores:", test_scores)
print("\n=== Measures of Central Tendency ===")

# Mean
mean_score = np.mean(test_scores)
print(f"Mean: {mean_score:.2f}")

# Median
median_score = np.median(test_scores)
print(f"Median: {median_score:.2f}")

# Mode (using scipy.stats)
mode_result = stats.mode(test_scores, keepdims=True)
mode_score = mode_result.mode[0]
print(f"Mode: {mode_score:.2f}")

print(f"\nInterpretation: The average test score is {mean_score:.1f}, ")
print(f"with half the students scoring above {median_score:.1f}.")

In [None]:
# Example showing why median is better for skewed data
# Salaries at a small company (in thousands)

salaries = np.array([45, 50, 52, 48, 51, 49, 47, 53, 500])  # CEO salary = 500k

print("Salaries (in thousands):", salaries)
print("\n=== Impact of Outliers ===")

mean_salary = np.mean(salaries)
median_salary = np.median(salaries)

print(f"Mean salary: ${mean_salary:.2f}k")
print(f"Median salary: ${median_salary:.2f}k")

print(f"\nNotice: The CEO's salary (${salaries[-1]}k) pulls the mean up significantly.")
print(f"The median ({median_salary}k) better represents the 'typical' employee salary.")

## 2. Measures of Dispersion (Spread)

Dispersion measures describe how spread out the data is. Knowing only the center isn't enough—we need to understand the variability.

### Range
The difference between the maximum and minimum values.

$$\text{Range} = \max(x) - \min(x)$$

### Variance
The average squared deviation from the mean.

$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

### Standard Deviation
The square root of variance (in the same units as the original data).

$$\sigma = \sqrt{\sigma^2}$$

### Interquartile Range (IQR)
The range of the middle 50% of the data (Q3 - Q1).

In [None]:
# Using the test scores from earlier
print("Test Scores:", test_scores)
print("\n=== Measures of Dispersion ===")

# Range
data_range = np.ptp(test_scores)  # ptp = peak to peak
print(f"Range: {data_range}")
print(f"  (Max: {np.max(test_scores)}, Min: {np.min(test_scores)})")

# Variance
variance = np.var(test_scores)
print(f"\nVariance: {variance:.2f}")

# Standard Deviation
std_dev = np.std(test_scores)
print(f"Standard Deviation: {std_dev:.2f}")

# Quartiles and IQR
q1 = np.percentile(test_scores, 25)
q2 = np.percentile(test_scores, 50)  # This is the median
q3 = np.percentile(test_scores, 75)
iqr = q3 - q1

print(f"\nQuartiles:")
print(f"  Q1 (25th percentile): {q1}")
print(f"  Q2 (50th percentile/Median): {q2}")
print(f"  Q3 (75th percentile): {q3}")
print(f"  IQR (Q3 - Q1): {iqr}")

In [None]:
# Visualizing the relationship between mean and standard deviation

# Create two datasets with same mean but different standard deviations
data_low_std = np.random.normal(loc=50, scale=5, size=1000)
data_high_std = np.random.normal(loc=50, scale=15, size=1000)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Low standard deviation
axes[0].hist(data_low_std, bins=30, edgecolor='black', alpha=0.7)
axes[0].axvline(np.mean(data_low_std), color='red', linestyle='--', 
                linewidth=2, label=f'Mean = {np.mean(data_low_std):.2f}')
axes[0].set_title(f'Low Spread (σ = {np.std(data_low_std):.2f})', 
                  fontsize=13, fontweight='bold')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# High standard deviation
axes[1].hist(data_high_std, bins=30, edgecolor='black', alpha=0.7)
axes[1].axvline(np.mean(data_high_std), color='red', linestyle='--', 
                linewidth=2, label=f'Mean = {np.mean(data_high_std):.2f}')
axes[1].set_title(f'High Spread (σ = {np.std(data_high_std):.2f})', 
                  fontsize=13, fontweight='bold')
axes[1].set_xlabel('Value')
axes[1].set_ylabel('Frequency')
axes[1].legend()

plt.tight_layout()
plt.show()

print("Notice: Both datasets have approximately the same mean (~50),")
print("but the right dataset has much higher variability (spread).")

## 3. Understanding Data Distributions

The **distribution** of data describes how values are spread across the range. Common distribution shapes:

- **Normal (Bell Curve)**: Symmetric, most values cluster around the mean
- **Skewed Right**: Long tail on the right, mean > median
- **Skewed Left**: Long tail on the left, mean < median
- **Uniform**: All values equally likely
- **Bimodal**: Two peaks (modes)

In [None]:
# Visualizing different distribution types

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Normal distribution
normal_data = np.random.normal(loc=50, scale=10, size=1000)
axes[0, 0].hist(normal_data, bins=30, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(np.mean(normal_data), color='red', linestyle='--', 
                   label=f'Mean={np.mean(normal_data):.1f}')
axes[0, 0].axvline(np.median(normal_data), color='blue', linestyle='--', 
                   label=f'Median={np.median(normal_data):.1f}')
axes[0, 0].set_title('Normal Distribution', fontweight='bold')
axes[0, 0].legend()

# Right-skewed distribution
right_skew = np.random.exponential(scale=20, size=1000)
axes[0, 1].hist(right_skew, bins=30, edgecolor='black', alpha=0.7)
axes[0, 1].axvline(np.mean(right_skew), color='red', linestyle='--', 
                   label=f'Mean={np.mean(right_skew):.1f}')
axes[0, 1].axvline(np.median(right_skew), color='blue', linestyle='--', 
                   label=f'Median={np.median(right_skew):.1f}')
axes[0, 1].set_title('Right-Skewed Distribution', fontweight='bold')
axes[0, 1].legend()

# Left-skewed distribution
left_skew = 100 - np.random.exponential(scale=20, size=1000)
axes[0, 2].hist(left_skew, bins=30, edgecolor='black', alpha=0.7)
axes[0, 2].axvline(np.mean(left_skew), color='red', linestyle='--', 
                   label=f'Mean={np.mean(left_skew):.1f}')
axes[0, 2].axvline(np.median(left_skew), color='blue', linestyle='--', 
                   label=f'Median={np.median(left_skew):.1f}')
axes[0, 2].set_title('Left-Skewed Distribution', fontweight='bold')
axes[0, 2].legend()

# Uniform distribution
uniform_data = np.random.uniform(low=0, high=100, size=1000)
axes[1, 0].hist(uniform_data, bins=30, edgecolor='black', alpha=0.7)
axes[1, 0].axvline(np.mean(uniform_data), color='red', linestyle='--', 
                   label=f'Mean={np.mean(uniform_data):.1f}')
axes[1, 0].set_title('Uniform Distribution', fontweight='bold')
axes[1, 0].legend()

# Bimodal distribution
bimodal = np.concatenate([np.random.normal(30, 5, 500), 
                          np.random.normal(70, 5, 500)])
axes[1, 1].hist(bimodal, bins=30, edgecolor='black', alpha=0.7)
axes[1, 1].axvline(np.mean(bimodal), color='red', linestyle='--', 
                   label=f'Mean={np.mean(bimodal):.1f}')
axes[1, 1].set_title('Bimodal Distribution', fontweight='bold')
axes[1, 1].legend()

# Remove empty subplot
fig.delaxes(axes[1, 2])

plt.tight_layout()
plt.show()

## 4. Box Plots: Visualizing Distribution Summary

Box plots (box-and-whisker plots) show the five-number summary:
1. Minimum (excluding outliers)
2. Q1 (25th percentile)
3. Median (Q2, 50th percentile)
4. Q3 (75th percentile)
5. Maximum (excluding outliers)

**Outliers** are typically shown as individual points beyond the whiskers.

In [None]:
# Create sample data with outliers
data_with_outliers = np.concatenate([
    np.random.normal(50, 10, 100),
    [5, 95, 100]  # Outliers
])

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(data_with_outliers, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title('Histogram View', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Value')
axes[0].set_ylabel('Frequency')

# Box plot
box_plot = axes[1].boxplot(data_with_outliers, vert=True, patch_artist=True)
box_plot['boxes'][0].set_facecolor('lightblue')
axes[1].set_title('Box Plot View', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Value')
axes[1].set_xticklabels(['Data'])

# Add annotations to box plot
q1 = np.percentile(data_with_outliers, 25)
median = np.median(data_with_outliers)
q3 = np.percentile(data_with_outliers, 75)

axes[1].text(1.15, q1, f'Q1 = {q1:.1f}', fontsize=10)
axes[1].text(1.15, median, f'Median = {median:.1f}', fontsize=10, fontweight='bold')
axes[1].text(1.15, q3, f'Q3 = {q3:.1f}', fontsize=10)

plt.tight_layout()
plt.show()

print("The box plot clearly shows the three outliers as individual points.")
print("The box represents the middle 50% of the data (IQR).")

## 5. Detecting Outliers

Outliers are data points that are significantly different from other observations. We can detect them using:

### IQR Method
- Lower bound: Q1 - 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
- Any values outside these bounds are outliers

### Z-Score Method
- Calculate how many standard deviations a point is from the mean
- Typically, |z| > 3 indicates an outlier

In [None]:
# Outlier detection using IQR method

def detect_outliers_iqr(data):
    """
    Detect outliers using the IQR method.
    Returns: tuple of (outlier_indices, lower_bound, upper_bound)
    """
    q1 = np.percentile(data, 25)
    q3 = np.percentile(data, 75)
    iqr = q3 - q1
    
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    
    # Find outliers
    outlier_mask = (data < lower_bound) | (data > upper_bound)
    outlier_indices = np.where(outlier_mask)[0]
    
    return outlier_indices, lower_bound, upper_bound

# Apply to our data
outliers, lower, upper = detect_outliers_iqr(data_with_outliers)

print("=== IQR Outlier Detection ===")
print(f"Lower bound: {lower:.2f}")
print(f"Upper bound: {upper:.2f}")
print(f"\nNumber of outliers: {len(outliers)}")
print(f"Outlier values: {data_with_outliers[outliers]}")

In [None]:
# Outlier detection using Z-score method

def detect_outliers_zscore(data, threshold=3):
    """
    Detect outliers using the Z-score method.
    Returns: outlier_indices
    """
    mean = np.mean(data)
    std = np.std(data)
    
    # Calculate Z-scores
    z_scores = np.abs((data - mean) / std)
    
    # Find outliers (|z| > threshold)
    outlier_indices = np.where(z_scores > threshold)[0]
    
    return outlier_indices

# Apply to our data
outliers_z = detect_outliers_zscore(data_with_outliers)

print("=== Z-Score Outlier Detection ===")
print(f"Number of outliers (|z| > 3): {len(outliers_z)}")
print(f"Outlier values: {data_with_outliers[outliers_z]}")

## 6. Real-World Example: Analyzing Student Performance

Let's apply what we've learned to analyze a realistic dataset of student exam scores.

In [None]:
# Create a realistic student performance dataset
np.random.seed(42)

# Simulate 100 students' scores (0-100)
# Most students score between 60-90, with some variation
student_scores = np.concatenate([
    np.random.normal(75, 10, 85),  # 85 typical students
    np.random.normal(50, 5, 10),   # 10 struggling students
    np.random.normal(95, 2, 5)     # 5 top performers
])

# Clip to valid range [0, 100]
student_scores = np.clip(student_scores, 0, 100)

# Create a DataFrame for better analysis
df_scores = pd.DataFrame({
    'score': student_scores,
    'student_id': range(1, len(student_scores) + 1)
})

print("=== Student Performance Summary ===")
print(df_scores['score'].describe())
print(f"\nTotal students: {len(df_scores)}")

In [None]:
# Comprehensive visualization

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Histogram with mean and median
axes[0, 0].hist(student_scores, bins=20, edgecolor='black', alpha=0.7)
axes[0, 0].axvline(np.mean(student_scores), color='red', linestyle='--', 
                   linewidth=2, label=f'Mean = {np.mean(student_scores):.1f}')
axes[0, 0].axvline(np.median(student_scores), color='blue', linestyle='--', 
                   linewidth=2, label=f'Median = {np.median(student_scores):.1f}')
axes[0, 0].set_title('Distribution of Scores', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('Score')
axes[0, 0].set_ylabel('Number of Students')
axes[0, 0].legend()

# Box plot
bp = axes[0, 1].boxplot(student_scores, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightgreen')
axes[0, 1].set_title('Box Plot of Scores', fontsize=13, fontweight='bold')
axes[0, 1].set_ylabel('Score')
axes[0, 1].set_xticklabels(['All Students'])

# Cumulative distribution
sorted_scores = np.sort(student_scores)
cumulative = np.arange(1, len(sorted_scores) + 1) / len(sorted_scores) * 100
axes[1, 0].plot(sorted_scores, cumulative, linewidth=2)
axes[1, 0].set_title('Cumulative Distribution', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('Score')
axes[1, 0].set_ylabel('Cumulative Percentage')
axes[1, 0].grid(True, alpha=0.3)

# Grade distribution (A, B, C, D, F)
grades = pd.cut(student_scores, bins=[0, 60, 70, 80, 90, 100], 
                labels=['F', 'D', 'C', 'B', 'A'])
grade_counts = grades.value_counts().sort_index()
axes[1, 1].bar(grade_counts.index, grade_counts.values, edgecolor='black', alpha=0.7)
axes[1, 1].set_title('Grade Distribution', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('Grade')
axes[1, 1].set_ylabel('Number of Students')

plt.tight_layout()
plt.show()

## 7. Practice Exercises

### Exercise 1: Calculate Descriptive Statistics

Given the following monthly salaries (in thousands) of 15 employees:

`[45, 52, 48, 51, 49, 47, 53, 46, 50, 48, 51, 49, 52, 47, 120]`

Calculate:
1. Mean, median, and mode
2. Range, variance, and standard deviation
3. Q1, Q2, Q3, and IQR
4. Which measure of central tendency best represents the typical salary? Why?

In [None]:
# Your code here
salaries_ex = np.array([45, 52, 48, 51, 49, 47, 53, 46, 50, 48, 51, 49, 52, 47, 120])

print("=== Exercise 1 Solution ===")
print(f"Salaries: {salaries_ex}")

# 1. Central tendency
mean = np.mean(salaries_ex)
median = np.median(salaries_ex)
mode_result = stats.mode(salaries_ex, keepdims=True)
mode = mode_result.mode[0]

print(f"\n1. Central Tendency:")
print(f"   Mean: ${mean:.2f}k")
print(f"   Median: ${median:.2f}k")
print(f"   Mode: ${mode:.2f}k")

# 2. Dispersion
data_range = np.ptp(salaries_ex)
variance = np.var(salaries_ex)
std_dev = np.std(salaries_ex)

print(f"\n2. Dispersion:")
print(f"   Range: ${data_range:.2f}k")
print(f"   Variance: {variance:.2f}")
print(f"   Standard Deviation: ${std_dev:.2f}k")

# 3. Quartiles
q1 = np.percentile(salaries_ex, 25)
q2 = np.percentile(salaries_ex, 50)
q3 = np.percentile(salaries_ex, 75)
iqr = q3 - q1

print(f"\n3. Quartiles:")
print(f"   Q1: ${q1:.2f}k")
print(f"   Q2: ${q2:.2f}k")
print(f"   Q3: ${q3:.2f}k")
print(f"   IQR: ${iqr:.2f}k")

# 4. Best measure
print(f"\n4. Best Measure of Central Tendency:")
print(f"   The MEDIAN (${median:.2f}k) best represents the typical salary.")
print(f"   Reason: The value 120k is an outlier that pulls the mean up to ${mean:.2f}k,")
print(f"   which doesn't represent the typical employee. The median is resistant to outliers.")

### Exercise 2: Identify the Distribution Type

Create visualizations for the following datasets and identify their distribution types:

1. `dataset_a = np.random.normal(50, 15, 1000)`
2. `dataset_b = np.random.exponential(scale=30, size=1000)`
3. `dataset_c = np.random.uniform(0, 100, 1000)`

For each, create a histogram and box plot, then describe the distribution shape.

In [None]:
# Your code here
np.random.seed(42)

dataset_a = np.random.normal(50, 15, 1000)
dataset_b = np.random.exponential(scale=30, size=1000)
dataset_c = np.random.uniform(0, 100, 1000)

fig, axes = plt.subplots(3, 2, figsize=(14, 12))

datasets = [dataset_a, dataset_b, dataset_c]
labels = ['Dataset A', 'Dataset B', 'Dataset C']

for i, (data, label) in enumerate(zip(datasets, labels)):
    # Histogram
    axes[i, 0].hist(data, bins=30, edgecolor='black', alpha=0.7)
    axes[i, 0].axvline(np.mean(data), color='red', linestyle='--', 
                       label=f'Mean={np.mean(data):.1f}')
    axes[i, 0].axvline(np.median(data), color='blue', linestyle='--', 
                       label=f'Median={np.median(data):.1f}')
    axes[i, 0].set_title(f'{label} - Histogram', fontweight='bold')
    axes[i, 0].legend()
    
    # Box plot
    bp = axes[i, 1].boxplot(data, vert=True, patch_artist=True)
    bp['boxes'][0].set_facecolor('lightblue')
    axes[i, 1].set_title(f'{label} - Box Plot', fontweight='bold')
    axes[i, 1].set_xticklabels([label])

plt.tight_layout()
plt.show()

print("=== Distribution Types ===")
print("Dataset A: NORMAL (symmetric, bell-shaped, mean ≈ median)")
print("Dataset B: RIGHT-SKEWED (long tail on right, mean > median, exponential)")
print("Dataset C: UNIFORM (flat, all values equally likely, mean ≈ median)")

### Exercise 3: Outlier Detection in Practice

You're analyzing daily website traffic (visitors per day) for the past month:

```python
traffic = [1523, 1489, 1567, 1601, 1534, 1578, 1612, 1545, 1590, 1523, 
           1567, 1598, 1534, 1589, 1601, 1578, 1545, 1623, 1590, 1556,
           1601, 1578, 1612, 1545, 1589, 1623, 5234, 1590, 1567, 1601]
```

Tasks:
1. Detect outliers using the IQR method
2. Create a box plot to visualize the outliers
3. What might the outlier represent? Should it be removed?

In [None]:
# Your code here
traffic = np.array([1523, 1489, 1567, 1601, 1534, 1578, 1612, 1545, 1590, 1523, 
                    1567, 1598, 1534, 1589, 1601, 1578, 1545, 1623, 1590, 1556,
                    1601, 1578, 1612, 1545, 1589, 1623, 5234, 1590, 1567, 1601])

print("=== Exercise 3 Solution ===")

# 1. Detect outliers using IQR
outliers, lower, upper = detect_outliers_iqr(traffic)

print(f"1. IQR Outlier Detection:")
print(f"   Lower bound: {lower:.0f} visitors")
print(f"   Upper bound: {upper:.0f} visitors")
print(f"   Number of outliers: {len(outliers)}")
print(f"   Outlier values: {traffic[outliers]} visitors")
print(f"   Outlier date: Day {outliers[0] + 1}")

# 2. Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Line plot showing the spike
axes[0].plot(range(1, len(traffic) + 1), traffic, marker='o', linewidth=2)
axes[0].scatter(outliers + 1, traffic[outliers], color='red', s=100, 
                zorder=5, label='Outlier')
axes[0].set_title('Daily Website Traffic', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Day of Month')
axes[0].set_ylabel('Visitors')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Box plot
bp = axes[1].boxplot(traffic, vert=True, patch_artist=True)
bp['boxes'][0].set_facecolor('lightcoral')
axes[1].set_title('Traffic Distribution', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Visitors')
axes[1].set_xticklabels(['Daily Traffic'])

plt.tight_layout()
plt.show()

# 3. Interpretation
print(f"\n3. Interpretation:")
print(f"   The outlier (5,234 visitors on Day {outliers[0] + 1}) is 3x the normal traffic.")
print(f"   Possible causes:")
print(f"   - Viral content or successful marketing campaign")
print(f"   - Media coverage or social media mention")
print(f"   - Bot traffic or data error")
print(f"\n   Should it be removed?")
print(f"   - NO if it represents a real event (viral spike, campaign success)")
print(f"   - YES if it's a data collection error or bot attack")
print(f"   - INVESTIGATE first before deciding!")

## 8. Summary and Key Takeaways

In this module, you learned:

✅ **Measures of Central Tendency**
- Mean: Average of all values (sensitive to outliers)
- Median: Middle value (robust to outliers)
- Mode: Most frequent value

✅ **Measures of Dispersion**
- Range: Max - Min
- Variance: Average squared deviation from mean
- Standard Deviation: Square root of variance
- IQR: Range of middle 50% of data

✅ **Distribution Shapes**
- Normal, skewed (left/right), uniform, bimodal
- Relationship between mean and median indicates skewness

✅ **Visualization Tools**
- Histograms: Show distribution shape
- Box plots: Show five-number summary and outliers

✅ **Outlier Detection**
- IQR method: Values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
- Z-score method: Values with |z| > 3

### What's Next?

In **Module 02: Probability Fundamentals**, you'll learn:
- Basic probability concepts and rules
- Conditional probability and Bayes' Theorem
- Probability distributions (binomial, normal, Poisson)
- Random variables and expected values

### Additional Resources

- [Khan Academy - Statistics and Probability](https://www.khanacademy.org/math/statistics-probability)
- [Practical Statistics for Data Scientists](https://www.oreilly.com/library/view/practical-statistics-for/9781491952955/)
- [StatQuest YouTube - Statistics](https://www.youtube.com/c/joshstarmer)

---

**Excellent work!** You now understand how to describe and summarize data using statistics.

**Next**: Proceed to `02_probability_fundamentals.ipynb`