[Allen Downey: The Inspection Paradox is Everywhere | PyData New York 2019](https://www.youtube.com/watch?v=cXWTHfvycyM)

https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709

## What is the inspection paradox?

Form of biased sampling

The Inspection Paradox accounts for
- train companies losing money on empty trains while passengers complain about how their train is always full
- call centers always experiencing higher than normal call volumes
- why you always wait longer for the train
- why your friends have more friends than you

## Example - determining average family size

A reasonable approach would be to ask the question "How many children does your mother have?"

There are two problems here
- families with no children are not represented
- families with many children are over-represented

Families of $x$ children are over-represented by a factor of $x$!

We can show this using a simple dataset:

In [None]:
from collections import namedtuple

Person = namedtuple('person', ['name', 'family', 'family_size'])

data = pd.DataFrame([
    Person('eve', 'zero', 1),
    
    Person('adam', 'one', 2),
    Person('rebecca', 'one', 2),
    
    Person('jane', 'two', 3),
    Person('james', 'two', 3),
    Person('john', 'two', 3),
    
    Person('bob', 'three', 3),
    Person('bella', 'three', 3),
    Person('Bianca', 'three', 3),
])

data

What is the size of the average family?

In [None]:
(1 + 2 + 3 + 3) / 4

In [None]:
data.loc[:, 'family_size'].mean()

In [None]:
data.groupby('family').mean().mean()

## Length biased sampling

Sampling processes where the population is sampled in proportion to size, duration, length etc

Probability of sampling depends on size
- this means the probability of sampling is not independent
- -> biased sampling!

The Inspection Paradox is the effect of small changes in sampling giving different results

The degree of oversampling is proportional to the class size

## Example - class sizes

Ask teachers how big there classes are - 31

Ask students how big there classes are - 56

Both are correct!
- they measure different things
- they are reporting averages over different populations
- one describes the teachers experience, the other the students

In [None]:
unbiased_sample = data.groupby('family').mean()

unbiased_sample

In [None]:
def resample_weighted(sample, weights):
    sample = np.array(sample).reshape(-1)
    weights = np.array(weights).reshape(-1)
    
    prob = weights / np.sum(weights)
    return np.random.choice(sample, len(sample), p=prob)

In [None]:
# min 8:44
np.mean(resample_weighted(unbiased_sample.values, unbiased_sample.values))

In [None]:
np.mean(resample_weighted(unbiased_sample.values, np.full_like(unbiased_sample.values.astype(float), 1/len(unbiased_sample.values)).astype(float)))

# Exercise

It is possible to use a biased sample to resample the unbiased
- use the inverse weights

##  Takeaway 

If the class size is x, it will be overrepresented by a factor of x!