# Exercise 2 (Sampling Schemes)

A child is selected at random from a group of children. What is the probability that
it is the first born in its family? For simplicity, we will only consider families that have
 children. (Who is the firstborn in a childless family?) (Note that even with multiple births, e.g., twins, one will be born before the others.)
The answer depends on the sampling scheme. Consider the two following schemes:

1. There is an urn for each family, which contains all the children in that family. A family is selected at random, and then a child is selected randomly from the family urn.

Let $N$ be the number of families, and $s_i$ the size of family $i$.
\begin{align}
p_1=\sum_{i=1}^{N}\mathbb{P}(\text{family $i$ is selected})\times \frac{1}{s_i}\\
=\sum_{i=1}^{N}\frac{1}{N}\times \frac{1}{s_i}\\
=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{s_i}. \square
\end{align}

2. All the children are put into one urn, and a child is selected at random.
One of these schemes always yields a greater probability of finding a firstborn child than the other, with equality only if all families are the same size. Which one? Prove it.

Note that there is a 1-to-1 correspondence between the set of families and the set of first-borns.  Letting $S:=\sum_{i=1}^{N}s_i$ equal the total number of children, we arrive at the simple expression
\begin{equation}
p_2=\frac{N}{S}. \square
\end{equation}

In [149]:
import random

In [242]:
# empirical check
num_families = 10
max_family_size = 2
families = {i+1:list(range(1,random.randint(1,max_family_size)+1)) for i in range(num_families)}
#print(families)
#print([1./len(value) for value in families.values()])

In [243]:
num_experiments = 10000
# scheme 1
counter = 0
for experiment in range(num_experiments):
    # choose family first
    family_index = random.randint(1,num_families)
    # probability that child is first born is 1/family_size
    counter += int(random.choice(families[family_index])==1)
empirical_p = counter/num_experiments
print("Empirical result = {0}".format(empirical_p))
theoretical_p = (1/num_families)*sum([1./len(value) for value in families.values()])
print("Theoretical result = {0}".format(theoretical_p))

Empirical result = 0.8001
Theoretical result = 0.8


In [244]:
# scheme 2
child_labels = [label for value in families.values() for label in value]
num_experiments = 10000
# scheme 1
counter = 0
for experiment in range(num_experiments):
    # sample child from single urn with all children
    counter += int(random.choice(child_labels)==1)
empirical_p = counter/num_experiments
print("Empirical result = {0}".format(empirical_p))
theoretical_p = num_families/len(child_labels)
print("Theoretical result = {0}".format(theoretical_p))

Empirical result = 0.7149
Theoretical result = 0.7142857142857143


The empirically and theoretically derived results above suggest that scheme 1 always leads to a greater probability, i.e. $p_1\geq p_2$  Let's prove this.
\begin{align}
\frac{p_1}{p_2}&=\frac{\frac{1}{N}\sum_{i=1}^{N}\frac{1}{s_i}}{\frac{N}{S}}\\
&=\frac{S\sum_{i=1}^{N}\frac{1}{s_i}}{N^{2}}.
\end{align}

Our aim now is to show $S\sum_{i=1}^{N}\frac{1}{s_i}\geq N^2$.
\begin{align}
S\sum_{i=1}^{N}\frac{1}{s_i}&=\sum_{i=1}^{N}s_i\sum_{j=1}^{N}\frac{1}{s_j}\\
&=\sum_{i=1}^{N}\sum_{j=1}^{N}\frac{s_i}{s_j}\\
&=\sum_{i\neq j}\frac{s_i}{s_j}+\sum_{i=j}\frac{s_i}{s_j}\\
&=\sum_{i\neq j}\frac{s_i}{s_j}+N.
\end{align}

Our goal now is to calculate $\sum_{i\neq j}\frac{s_i}{s_j}$.
\begin{align}
\sum_{i\neq j}\frac{s_i}{s_j}&=\sum_{i>j}\frac{s_i}{s_j}+\frac{s_j}{s_i}\\
&=\sum_{i>j}\frac{s_is_i}{s_js_i}+\frac{s_js_j}{s_is_j}\\
&=\sum_{i>j}\frac{s_i^2+s_j^2}{s_js_i}.
\end{align}

Observe from the the following simple calculation that $0\leq (s_i-s_j)^2=s_i^2+s_j^2-2s_is_j\Rightarrow s_i^2+s_j^2\geq 2s_is_j$, and hence we obtain a bound for the expression above
\begin{align}
\sum_{i>j}\frac{s_i^2+s_j^2}{s_js_i}&\geq \sum_{i>j}\frac{2s_is_j}{s_js_i}\\
&=\sum_{i>j}2\\
&=2\left(\frac{N^2-N}{2}\right)\\
&=N^2-N.
\end{align}

Putting it all together, we finally get
\begin{align}
\frac{p_1}{p_2}=\frac{\overbrace{S\sum_{i=1}^{N}\frac{1}{s_i}}^{\geq N^2-N+N}}{N^2}\geq1.
\end{align}

Note that equality is achieved when $s_i\equiv s$.  In that case, $S=sN$, and $\sum_{i=1}^{N}\frac{1}{s_i}=\frac{N}{s}$, in which case the above expression yields
\begin{align}
\frac{p_1}{p_2}=\frac{Ns\times\frac{N}{s}}{N^2}=1.\square
\end{align}