# Advanced Exercises: Python `statistics` Module

This notebook contains a collection of **advanced, realistic problems** built around the standard-library [`statistics`](https://docs.python.org/3/library/statistics.html) module.

Each problem is followed by a **worked solution** that:

- Emphasizes good software engineering practices (clear naming, small functions, docstrings, type hints).
- Uses `statistics` tools when they are appropriate.
- Highlights numerical and statistical subtleties (e.g. population vs sample variance, robustness to outliers, modeling with `NormalDist`).

In [1]:
from __future__ import annotations
from dataclasses import dataclass
from typing import Iterable, List, Sequence, Tuple, Dict, Any
import math
import random
import statistics as stats


## Problem 1 — Robustness of Mean vs Median

A data scientist is analyzing **annual incomes (in thousands of $)** for a group of people in a city district.

They collected the following sample (already sorted):

```python
incomes = [25, 27, 28, 29, 30, 32, 33, 34, 36, 40]
```

1. Compute the **mean**, **median**, **first quartile (Q1)**, **third quartile (Q3)** and the **interquartile range (IQR)** for this sample, using the `statistics` module only.
2. Now, suppose a wealthy entrepreneur joins the district with an income of `500` (thousands). Recompute the same statistics for the updated list.
3. Implement a function `summarize_incomes(incomes: Sequence[float]) -> dict[str, float]` that returns a dictionary with keys `"mean"`, `"median"`, `"q1"`, `"q3"`, and `"iqr"`. Use `statistics.quantiles` appropriately.
4. Based on your results, briefly explain *numerically* why the **median** is more robust to outliers than the **mean** in this context.

### Solution 1

In [2]:
def summarize_incomes(incomes: Sequence[float]) -> Dict[str, float]:
    """
    Compute basic robust statistics for a sequence of numeric incomes.

    Returns a dictionary with:
        - 'mean': arithmetic mean
        - 'median': median
        - 'q1': first quartile
        - 'q3': third quartile
        - 'iqr': interquartile range = q3 - q1
    """
    if not incomes:
        raise ValueError("incomes must be a non-empty sequence")

    mean_value = stats.fmean(incomes)
    median_value = stats.median(incomes)

    quartiles = stats.quantiles(incomes, n=4, method="inclusive")
    q1 = quartiles[0]
    q3 = quartiles[2]
    iqr = q3 - q1

    return {
        "mean": mean_value,
        "median": median_value,
        "q1": q1,
        "q3": q3,
        "iqr": iqr
    }

incomes = [25, 27, 28, 29, 30, 32, 33, 34, 36, 40]
summary_original = summarize_incomes(incomes)

incomes_with_outlier = incomes + [500]
summary_with_outlier = summarize_incomes(incomes_with_outlier)

summary_original, summary_with_outlier

({'mean': 31.4, 'median': 31.0, 'q1': 28.25, 'q3': 33.75, 'iqr': 5.5},
 {'mean': 74.0, 'median': 32, 'q1': 28.5, 'q3': 35.0, 'iqr': 6.5})

## Problem 2 — Modes and Multimodality in Survey Data

You run a survey asking people which **operating system** they primarily use:

```python
responses = [
    "Windows", "Linux", "macOS", "Windows", "Linux",
    "Windows", "Windows", "macOS", "Linux", "Linux",
    "Linux", "Linux", "macOS", "Windows", "Other"
]
```

1. Use `statistics.mode` to find the single most common OS.
2. Use `statistics.multimode` to find *all* operating systems that share the maximum frequency.
3. Write a function

```python
def count_modes(values: Sequence[Any]) -> dict[str, Any]:
    ...
```

that returns a dictionary containing:
- `'mode'`: the value returned by `stats.mode`
- `'multimode'`: the list returned by `stats.multimode`
- `'frequency'`: the frequency of the most common value(s)

Your implementation must **not** manually compute frequency tables.
4. Explain what happens if the list contains *no* repeated values. What does `multimode` return?

### Solution 2

In [3]:
def count_modes(values: Sequence[Any]) -> Dict[str, Any]:
    """
    Compute mode-related statistics for a sequence of (hashable) values.
    Returns:
        {
            'mode': the single mode (statistics.mode),
            'multimode': list of all modes (statistics.multimode),
            'frequency': frequency of any of the modes
        }
    """
    if not values:
        raise ValueError("values must be non-empty")

    single_mode = stats.mode(values)
    all_modes = stats.multimode(values)
    representative_mode = all_modes[0]
    freq = values.count(representative_mode)

    return {
        'mode': single_mode,
        'multimode': all_modes,
        'frequency': freq
    }

responses = [
    'Windows', 'Linux', 'macOS', 'Windows', 'Linux',
    'Windows', 'Windows', 'macOS', 'Linux', 'Linux',
    'Linux', 'Linux', 'macOS', 'Windows', 'Other'
]

mode_info = count_modes(responses)
mode_info

{'mode': 'Linux', 'multimode': ['Linux'], 'frequency': 6}

#### Discussion

- `mode` returns one most common value.
- `multimode` returns **all** equally common values.

If all values occur once, `multimode` returns the entire list.

## Problem 3 — Sample vs Population Variance

A factory measures widget lengths:

```python
lengths = [100.1, 99.8, 100.4, 100.0, 99.9, 100.2, 100.3, 99.7]
```

1. Compute population variance/stdev.
2. Compute sample variance/stdev.
3. Implement a summarizing function.
4. Explain why sample variance ≥ population variance.

### Solution 3

In [4]:
def summarize_variability(values: Sequence[float]) -> Dict[str, float]:
    """
    Summarize variability using population and sample formulas.
    """
    if len(values) < 2:
        raise ValueError("Need at least two values.")

    pstd = stats.pstdev(values)
    pvar = stats.pvariance(values)
    sstd = stats.stdev(values)
    svar = stats.variance(values)

    return {
        'pstdev': pstd,
        'pvariance': pvar,
        'stdev': sstd,
        'variance': svar
    }

lengths = [100.1, 99.8, 100.4, 100.0, 99.9, 100.2, 100.3, 99.7]
var_summary = summarize_variability(lengths)
difference_variance = var_summary['variance'] - var_summary['pvariance']

var_summary, difference_variance

({'pstdev': 0.22912878474779216,
  'pvariance': 0.052500000000000074,
  'stdev': 0.24494897427831797,
  'variance': 0.06000000000000008},
 0.007500000000000007)

#### Discussion

Sample variance divides by *n − 1*, so it is always ≥ population variance.

## Problem 4 — Modeling Exam Scores with `NormalDist`

### Solution 4

In [5]:
exam_dist = stats.NormalDist(mu=72, sigma=10)
prob_at_least_90 = 1 - exam_dist.cdf(90)
score_95th = exam_dist.inv_cdf(0.95)
curved_dist = 1.1 * (exam_dist + 5)
new_mean = curved_dist.mean
new_sigma = curved_dist.stdev
curved_score_95th = curved_dist.inv_cdf(0.95)

prob_at_least_90, score_95th, (new_mean, new_sigma), curved_score_95th

(0.03593031911292577, 88.44853626951472, (84.7, 11.0), 102.79338989646618)

## Problem 5 — Separation of Two Normal Populations

### Solution 5

In [6]:
healthy_dist = stats.NormalDist(mu=0, sigma=1)
patient_dist = stats.NormalDist(mu=2, sigma=1.5)
overlap_hp = healthy_dist.overlap(patient_dist)
t = healthy_dist.inv_cdf(0.90)
prob_patient_above_t = 1 - patient_dist.cdf(t)

overlap_hp, t, prob_patient_above_t

(0.4098862963387525, 1.2815515655446008, 0.684018457558181)

## Problem 6 — Estimating a Normal Model from Data

### Solution 6

In [7]:
random.seed(42)
samples = [random.gauss(5.0, 2.0) for _ in range(500)]
fitted_dist = stats.NormalDist.from_samples(samples)
sample_mean = stats.fmean(samples)
sample_stdev = stats.stdev(samples)
params_comparison = {
    'sample_mean': sample_mean,
    'sample_stdev': sample_stdev,
    'fitted_mean': fitted_dist.mean,
    'fitted_stdev': fitted_dist.stdev
}
prob_less_than_3 = fitted_dist.cdf(3)

params_comparison, prob_less_than_3

({'sample_mean': 5.0823505198505305,
  'sample_stdev': 1.997686398908547,
  'fitted_mean': 5.082350519850531,
  'fitted_stdev': 1.997686398908547},
 0.14861751577771526)