In [None]:
from collections import namedtuple

import numpy as np
import pandas as pd 

# The Introspection Paradox

References for Allen Downey - The Inspection Paradox is Everywhere:
- [PyData New York 2019 talk](https://www.youtube.com/watch?v=cXWTHfvycyM)
- [Medium post](https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709)

## What is the inspection paradox?

Form of **biased sampling**
- biased by the size / length / duration of the data

The Inspection Paradox accounts for
- train companies losing money on empty trains while passengers complain about how their train is always full
- why you always wait longer for the train
- call centers always experiencing higher than normal call volumes when you call
- why your friends have more friends than you

## Example - determining average family size

A reasonable approach would be to ask the question "How many children does your mother have?"

There are two problems here
- families with no children are not represented
- families with many children are over-represented

Families of $x$ children are over-represented by a factor of $x$!

We can show this using a simple dataset:

In [None]:
Person = namedtuple('person', ['name', 'family', 'family_size'])

data = pd.DataFrame([
    Person('eve', 'zero', 1),
    
    Person('adam', 'one', 2),
    Person('rebecca', 'one', 2),
    
    Person('jane', 'two', 3),
    Person('james', 'two', 3),
    Person('john', 'two', 3),
    
    Person('bob', 'three', 3),
    Person('bella', 'three', 3),
    Person('Bianca', 'three', 3),
])

data.loc[:, 'family_size'] = data.loc[:, 'family_size'].astype(float)

data

What is the size of the average family?  

We can ask each person and report the mean
- families with more children will be over-represented!

In [None]:
data.loc[:, 'family_size'].mean()

Or we can have one sample per family:

In [None]:
data.groupby('family').mean().mean().values[0]

## Length biased sampling

Sampling processes where the population is sampled in proportion to size, duration, length etc

Probability of sampling depends on size
- this means the probability of sampling is not independent
- -> biased sampling!
- degree of oversampling is proportional to the class size

Returning to our example - which average family size is correct?

Both are correct!
- they measure different things
- they reporting averages over different populations
- one describes the experience of children, the other the average family size

Our unbiased sample is the average family size:

In [None]:
unbiased_sample = data.groupby('family').mean()

unbiased_sample

In [None]:
unbiased_sample.mean()

Our biased sample is the experience of children:

In [None]:
biased_sample = data.loc[:, 'family_size']

biased_sample.mean()

##  Takeaway 

If the class size is x, it will be overrepresented by a factor of x!