# Statistical Data Management Session 8: Central Limit Theorem and Confidence Intervals (chapters 6-7 in McClave & Sincich)

## 1. Illustrating the Central Limit Theorem

1. Run the code below. A uniform distribution is defined, and 1000 samples of size ``n = 1`` are drawn. Then, the means of these size 1 samples (i.e., these values themselves in this case!) are collected in the list ``means``. Then a histogramme of these is generated. This closely resembles the probability distribution of the uniform distribution, which is to be expected.
2. Set ``n = 2`` and re-run. Now, 1000 samples of size 2 are drawn and the sample means recalculated. The histogramme tends to resemble a normal curve, the central limit theorem in action. 
3. Uncomment another distribution. Rerun the code with ``n = 2``. Because (unlike the uniform distribution) these are asymmetric, a sample size of 2 is not sufficient to see the effect of the CLT. Increase the sample size to see the effect. Note the smaller range of the histogramme categories, because the sample mean follows a normal distribution with a standard deviation divided by $\sqrt{n}$.

In [None]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as sts
%matplotlib inline

distr = sts.uniform(0,2)
# alternatives, don't worry about them, we won't use them further, just here for illustration purposes!
#distr = sts.expon()
#distr = sts.gilbrat()
#distr = sts.triang(0.99)
print(distr.mean())

interval = np.linspace(-0.5,2.5,10000)
plt.plot(interval, distr.pdf(interval), linewidth=3)
plt.show()
plt.close()

n = 1

means = np.empty(1000)  # empty array to save sample means

for i in range(1000):
    vals = distr.rvs(size = n) # rvs = random variates => draw sample of given size at random from distribution
    means[i] = vals.mean() # save sample mean

plt.hist(means, bins='sqrt')
plt.show()
plt.close()

## 2. Central Limit Theorem *(based on ex 6.62 from the book)*

**In order to fully grasp the CLT, we will abstract away from the concrete story given in the book.**

A certain population quantity has an unknown distribution with a mean of 5.1 and a standard deviation of 6.1. Consider a random sample of size $n=150$. Let $\bar{x}$ represent the sample mean.
1. Give the expected value and standard deviation of the sampling distribution of $\bar{x}$.

2. Will the sampling distribution of $\bar{x}$ be approximately normal? Explain.

3. Find $P(\bar{x} > 5.5)$.

4. Find $P(4< \bar{x} <5)$.

A second population has a distribution with a mean of 5.4 and a standard deviation of 0.5. Again, consider a random sample of size $n=150$.

5. Say you have a sample with sample mean 5.5. Is it more likely that this sample was drawn from the first or the second population?

## 3. Confidence interval: proportion of smokers

In the 2018 Belgian Health Interview Survey (HIS) (https://www.sciensano.be/nl/projecten/gezondheidsenquete-0#levensstijl), 19.4% of the people interviewed answered "yes" to the question whether they smoke or not. There were 10700 people interviewed. 

1. Construct a 95% confidence interval for the true proportion of smokers. Do this both using the z-table on Toledo and using Python. Hint: use the method ``ppf(...)`` on a standard normal distribution to find the z-value given a confidence level. ``ppf`` is the inverse of ``cdf``.

2. For prevention purposes, this accuracy is not strictly necessary. Say that for their next survey, the government wants an estimate up to 2% of the true proportion. How many interviews are needed, starting from the estimate of this year?

## 4. Confidence interval: birthweight

The file ``birthweights.csv`` in the ``shared`` folder contains the birthweight in kg of a sample of 42 new-born babies (https://www.sheffield.ac.uk/mash/statistics/datasets). Run the following cell of code and construct a 90% confidence interval for the true birthweight of new-born babies.

In [None]:
df_birthweights = pd.read_csv("../../shared/birthweights.csv")
print(df_birthweights["Birthweight"])

## 5. Confidence interval: birthweight smokers

The file ``birthweight_smokers.csv`` in the ``shared`` folder contains the birthweight in kg of a sample of 22 new-born babies from mothers who smoked during pregnancy (a subset of the sample from exercise 4). 

1. Run the following cell of code and construct a 90% confidence interval for the true birthweight of new-born babies from smokers. You may assume that birthweights follow a normal distribution. Show with a QQ plot and/or histogram that this assumption is not unrealistic.

In [None]:
df_birthweights_smokers = pd.read_csv("../../shared/birthweights_smokers.csv")
print(df_birthweights_smokers["Birthweight"])

2. Repeat the analysis but this time, find the t-values with the aid of the tables provided on Toledo.

3. What would change if you knew the population standard deviation of birthweights?

## 6. SQL recap

The file ``birthweights.sql`` provided on Toledo contains the information used in exercises 4 and 5: a table listing baby IDs and their corresponding birthweight, and a table listing the smoker status of their mother (0 for non-smoking, 1 smoking). Import the file using MySQL Workbench and write the appropriate queries to retrieve the relevant information. Re-run your analysis (without running the cell which defined the dataframe!) to check whether you have the correct information.

In [None]:
conn = sqlite3.connect("../../shared/birthweights.db")
df_birthweights = pd.read_sql_query(" <your query here> ", conn)

query = """
<your query here>
"""
df_birthweights_smokers = pd.read_sql_query(query, conn)