# Data 6: Summary Statistics

* Percentiles
* Center: Mean, Median
* Spread: Range, IQR, Standard Deviation
* Box plots

Source: [Data 8 Fall 2025 Lecture 08](https://github.com/data-8/materials-fa25/blob/main/lec/lec08/lec08.ipynb) and [Lecture 04](https://github.com/data-8/materials-fa25/blob/main/lec/lec04/lec04.ipynb)

In [None]:
from datascience import *
import numpy as np
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
plots.rcParams["patch.force_edgecolor"] = True

import seaborn as sns

## Histograms: Practice

`top_movies`:

* `Title`: title of the movie
* `Studio`: name of the studio that produced the movie
* `Gross`: domestic box office gross in dollars
* `Gross (Adjusted)`: the gross amount that would have been earned from ticket sales at 2016 prices
* `Year`: release year of the movie.
* `Age`: years since movie was released

In [None]:
top_movies = Table.read_table('data/top_movies_2017.csv')
ages = 2025 - top_movies.column('Year')
top_movies = top_movies.with_column('Age', ages)
top_movies

In [None]:
equally_spaced_bins = np.arange(0, 120, 20)
unequally_spaced_bins = make_array(0, 20, 40, 80, 100)

### Which of these visualizations is a histogram? Select all.

In [None]:
top_movies_binned = top_movies.bin('Age', bins=equally_spaced_bins)
top_movies_binned

In [None]:
top_movies.hist('Age', bins=equally_spaced_bins)

In [None]:
top_movies.hist('Age', bins=equally_spaced_bins, density=False)

In [None]:
top_movies_binned = top_movies.bin('Age', bins=unequally_spaced_bins)
top_movies_binned

In [None]:
top_movies.hist('Age', bins=unequally_spaced_bins)

In [None]:
top_movies.hist('Age', bins=unequally_spaced_bins, density=False)

## Percentiles

### Percentiles help us describe ordered lists

**Discussion Questions** 

- Which statements are true when `s = array([1, 5, 7, 3, 9])`?

1. The 50th percentile of `s` is 5.
2. The 10th percentile of `s` is 6.
3. The 39th percentile of `s` is the same as the 40th percentile of `s`. 
4. The 40th percentile of `s` is the same as the 41st percentile of `s`. 

In [None]:
s = make_array(1,5,7,3,9)

In [None]:
percentile(50, s) == 5

In [None]:
percentile(50, s) == 6

In [None]:
percentile(39, s) == percentile(40, s)

In [None]:
percentile(40, s) == percentile(41, s)

---

**Bonus practice**

In [None]:
t = make_array(1,3,3,7,9)

In [None]:
percentile(40, t)

In [None]:
percentile(60, t)

## Box Plots

From your text:

> [SF OpenData](https://datasf.org/opendata/) is a website where the City and County of San Francisco make some of their data publicly available. One of the data sets contains compensation data for employees of the City. These include medical professionals at City-run hospitals, police officers, fire fighters, transportation workers, elected officials, and all other employees of the City.

We will consider this dataset to be the **population** today for teaching purposes. In general, when using the inference method of today, you will not have the population: only a (random) sample from it.

We will consider everyone above the minimum salary for part-time workers:
                                                   
    $15/hr, 20 hr/wk, 50 weeks

In [None]:
population = Table.read_table('data/san_francisco_2019.csv') 

In [None]:
min_salary = 15 * 20 * 50
min_salary

In [None]:
# "are" predicates to be covered next time!
population = population.where('Salary', are.above(min_salary))
population

In [None]:
population.hist('Total Compensation', bins = np.arange(0, 800000, 25000))

- Population parameter for today: The *median* total compensation of all City employees of San Francisco (in 2019).
- If you have the entire population, just calculate the parameter. 

In [None]:
population.num_rows

In [None]:
pop_median = percentile(50, population.column('Total Compensation'))
pop_median

In [None]:
public_health = population.where("Department", "Public Health")
public_health.num_rows

In [None]:
public_health_median = percentile(50, public_health.column("Total Compensation"))
public_health_median

Hm...

## Disaggregated by total compensation

In [None]:
population.group("Department").sort("count", descending=True)

In [None]:
pop_top5 = population.where("Department", are.contained_in(["Public Health", "Municipal Transportation Agcy", "Police"]))
pop_top5.show(5)

In [None]:
pop_top5.group("Department", np.mean)

In [None]:
pop_top5.group("Department", np.median)

### Box plot

You do not need to know how to graph box plots. Just how to interpret them.
* Minimum
* First quartile (25th percentile)
* Median (50th percentile)
* Third quartile (75th percentile)
* Maximum

In [None]:
# just run this cell
plots.figure(figsize=(12, 5))
sns.boxplot(data=pop_top5, x="Total Compensation", hue="Department", whis=(0, 100))
print("Boxplot of x values in mystery datasets")

In [None]:
pop_top5.where("Department", "Police").hist("Total Compensation")
print("Police Total Compensation")

In [None]:
pop_top5.where("Department", "Municipal Transportation Agcy").hist("Total Compensation")
print("Municipal Transportation Agency Total Compensation")

In [None]:
pop_top5.where("Department", "Public Health").hist("Total Compensation")
print("Public Health Total Compensation")

## Center: The mean/average helps us quantify "center"

In [None]:
values = make_array(2, 3, 3, 9)
values

In [None]:
np.sum(values)/len(values)

In [None]:
np.average(values)

In [None]:
np.mean(values)

In [None]:
(2 + 3 + 3 + 9)/4

In [None]:
2*(1/4) + 3*(2/4) + 9*(1/4)

In [None]:
2*0.25 + 3*0.5 + 9*0.25

In [None]:
values_table = Table().with_columns('value', values)
values_table

In [None]:
bins_for_display = np.arange(0.5, 10.6, 1)
values_table.hist('value', bins = bins_for_display)

Averages are not necessarily dependent on the number of items in the collection!

In [None]:
## Make array of 10 2s, 20 3s, and 10 9s

new_vals = make_array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
                      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
                      9, 9, 9, 9, 9, 9, 9, 9, 9, 9)

In [None]:
Table().with_column('value', new_vals).hist(bins = bins_for_display)

In [None]:
np.average(new_vals)

In [None]:
Table().with_column('value', new_vals).hist(bins = bins_for_display)
plots.ylim(-0.04, 0.5)
plots.plot([0, 10], [0, 0], color='grey', lw=2)
plots.scatter(4.25, -0.015, marker='^', color='red', s=100)
plots.title('Average as a Center of Mass');

Bonus material (optional) - weighted means!

In [None]:
np.average(make_array(2, 3, 3, 9))

In [None]:
np.average(make_array(2, 3, 9), weights=(1, 2, 1))

## Spread: Standard deviation helps us quantify “variability”

For this class, we do not focus much on standard deviation. We'd prefer IQR where possible. Take Data 8 to learn more about standard deviations.

In [None]:
sd_table = Table().with_columns('Value', values)
sd_table

In [None]:
average_value = np.mean(values)
average_value

In [None]:
deviations = values - average_value
sd_table = sd_table.with_column('Deviation', deviations)
sd_table

In [None]:
sum(deviations)

In [None]:
np.mean(np.abs(deviations))

In [None]:
sd_table = sd_table.with_column('Squared Deviation', deviations ** 2)
sd_table

**Variance** of the data -  mean squared deviation from average

In [None]:
variance = np.mean(deviations ** 2)
variance

**Standard Deviation** (SD): root mean squared deviation from average = square root of the variance

In [None]:
sd = variance ** 0.5
sd

In [None]:
np.std(values)

#### Compare to IQR

Which is more "stable," i.e., less impacted by outliers?

In [None]:
first_quart = percentile(25, values)
first_quart

In [None]:
third_quart = percentile(75, values)
third_quart

In [None]:
iqr = third_quart - first_quart
iqr

In [None]:
values