---
title: "Topics in Econometrics and Data Science: Tutorial 5"

---

#### General Note

You will very likely find the solution to these exercises online. We, however, strongly encourage you to work on these exercises without doing so. Understanding someone else’s solution is very different from coming up with your own. Use the lecture notes and try to solve the exercises independently.

# Section 1: Basic Statistics

## Exercise 1: Normal distribution

### A)
A hard-drive manufacturer would like to ensure that the mean time between failures (MTBF) for its new hard drive is $1$ million hours. A stress test is designed that can simulate the workload at a much faster pace. The testers assume that a test lasting $10$ days correlates with the failure time exceeding the $1$- million-hour mark. In stress tests of $15$ hard drives they found an average of $9.5$ days, with a standard deviation of $1$ day. Does a $90\%$ confidence level include $10$ days?

**Hint**: You can use  [`scipy.stats.norm.ppf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) to get the percent point function (critical value) for the normal distribution for a given confidence level.

In [None]:
import scipy.stats as stats
import numpy as np

### B)
Let $X\sim \mathcal{N}(\mu,\sigma^2)$. We have seen that
$$\sqrt{n}\frac{\bar{X}_n-\mu}{\sigma}\sim \mathcal{N}(0,1)$$ 
and that by replacing the unknown parameter $\sigma^2$ by $\hat{S}^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X}_n)^2$ we obtain
$$Z_n:=\sqrt{n}\frac{\bar{X}_n-\mu}{\hat{S}} \xrightarrow[]{\mathcal{D}} Z\sim \mathcal{N}(0,1).$$
Therefore we can construct asymptotic confidence intervals for $\mu$. This means that the confidence interval satisfies the confidence level when $n$ is 'large'. But what happens when $n$ is 'small' and what do we mean by 'large'?
The distribution of $Z_n$ is called $t$-distribution with $n-1$ degrees of freedom.
Compare the densities of the normal distribution and the $t$-distribution for $n=3,10,25$ and $50$. What value of $n$ seems 'large' enough to say that the two distributions are essentially the same? 

To answer these questions, first, save $\mu$ and $\sigma$ and use [`np.linspace`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html) to generate evenly spread x-values from -3 to 3, along which you will plot the distributions. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
from scipy.stats import norm

Next, use [`scipy.stats.norm.pdf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) and [`scipy.stats.t.pdf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) to create density functions along the x-values for the normal distribution and t-distributions for the different n. Save them as different y-variables, e.g. `y1` to `y5`.

Finally, with [`plt.figure`](https://matplotlib.org/stable/api/figure_api.html) and its suboptions, you can plot the distributions.

### C)
Our goal is to use a simulation to compare the resulting confidence intervals of Exercise 1B) for $\mu=0$ (choose $\sigma^2=5$) with confidence level $95\%$. We will go through it step by step:

1. Load the necessary modules and save $\mu$, $\sigma$, and the critical value from the normal distribution for our given $\alpha$ (1-confidence level).

In [None]:
import scipy.stats as stats
import numpy as np
import math

2. Calculate the critical value derived from the t-distribution with $n=3$ and our given $\alpha$.
Use the [`stats.t.ppf`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html) function.

3. Set the random seed[`np.random.seed`](https://numpy.org/doc/2.0/reference/random/generated/numpy.random.seed.html) to 123. Create one normally distributed sample for $\mu$, $\sigma$, and sample size $n=3$ with [`np.random.normal`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html). Calculate the mean and standard deviation. \
**Hint**: Remark that `np.var()` calculates $\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X}_n)^2$.

In [None]:
np.random.seed(123)

4. Calculate confidence intervals based on the normal distribution and the t-distribution for $n=3$. Check whether the true parameter $\mu$ is contained in the confidence intervals.

5. Repeat the procedure for $1000$ samples. How often does the mean lie within the confidence interval?

6. Finally, repeat the simulation for each $n$: Calculate $1000$ independent samples and count how often the true parameter $\mu$ is contained in the confidence interval.\
**Hint**: Loop over n (outer loop) and then loop over l (inner loop).

## Exercise 2: Hypothesis Testing (I)

Load the [`brain_size.csv`](https://alexandragibbon.github.io/StatProg-HHU/data/brain_size.csv) data set from last week and repeat the data cleaning steps (remove column `'Unnamed: 0'` and missing values)

In [None]:
import os
os.chdir("[INSERT YOUR PATH HERE!]")
import pandas as pd

brain_size = pd.read_csv('data/brain_size.csv', sep=';', na_values=".")
brain_size.drop(['Unnamed: 0'], axis = 1, inplace = True)

print(brain_size.head())
print(f"Dimensions of dataset: {brain_size.shape}")

At first we calculate some more descriptive statistics.

### 1)
Plot the weight as a histogram using [`plt.hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html).

In [None]:
import matplotlib.pyplot as plt

Assume that the VIQ, weight and height are normally distributed.

### 2) 
What can you say about the hypothesis that the mean of the VIQ is 100? Perform a one-sample t-test. 

**Hint**: Use [`stats.ttest_1samp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_1samp.html).

In [None]:
import scipy.stats as stats
import numpy as np

### 3)
Create two variables containing VIQ only for men and for women, respectively. Can you reject the hypothesis that the mean of the VIQ is the same for women and men? Test the same hypothesis for weight and height.

**Hint**: Perform the T-test for the means of two independent samples of scores using [`ttest_ind`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind). Familiarise yourself with the function using `help(stats.ttest_ind)`. 

In [None]:
help(stats.ttest_ind)

### 4) 
Visually inspect the VIQ data. Is the assumption of the normal distribution of the VIQ justified? 

**Hint**: Use [`stats.shapiro`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html) to perform the Shapiro-Wilk test for normality. Here, the null hypothesis is that the population is distributed normally.

### 5) 
Is the result for VIQ in 2. still valid if we drop the assumption of normality? Which null hypothesis are we actually testing?

**Hint**: Use [`stats.wilcoxon`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html) to calculate the Wilcoxon signed-rank test. This test needs equal sample sizes.

### 6)
Can we reject the assumption that males and females have the same distribution of VIQ? Compare two histograms to interpret your result!\
**Hint**: Perform the two-sample Kolmogorov-Smirnov test for goodness of fit [`stats.ks_2samp`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html).

## Exercise 3: Hypothesis Testing (II)
Now we want to test whether the proportion (probability of success) in a group equals a certain given value.

1. Use `help(stats.binomtest)` to make yourself familiar with the [`binomtest`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binomtest.html) function.

In [None]:
import scipy.stats as stats
help(stats.binomtest)

2. A poll taken in $2003$ among $200$ Europeans found that only $16\%$ favored the policies of the United States. Perform a test of significance to see whether this is significantly different from the $50\%$ proportion of Americans in favor of these policies.

3. A new drug therapy is tested. Of $50$ patients in the study, $40$ had no recurrence in their illness after $18$ months. With no drug therapy, the expected percentage of no recurrence would have been $75\%$. Does the data support the hypothesis that this percentage has increased? What is the $p$-value? Why is useful to consider a one-sided test?

## Exercise 4: Bootstrap
Load the [`lightbulb_lifetime.csv`](https://alexandragibbon.github.io/StatProg-HHU/data/lightbulb_lifetime.csv) data, which contains a (simulated) sample of lifespans of lightbulbs measured in hours. 

In [None]:
import pandas as pd
import numpy as np
import math
import scipy.stats as stats
import os
os.chdir("[INSERT YOUR PATH HERE!]")

lightbulb_lifetime = pd.read_csv('data/lightbulb_lifetime.csv', sep=';', na_values=".")
lightbulb_lifetime.head()

1. Estimate the median lifetime of a lightbulb.\
**Hint:** you can use `.values` to convert pandas dataframes into arrays and `.flatten()` to create vectors from arrays.

2. The sample median is an asymptotically normal distributed estimator for the median. Try to estimate the variance of the sample median with bootstrap.  
**Hint**: Use `np.random.choice()`, additionally `np.empty()` could be helpful.

3. Show with a simulation that your confidence interval is able to asymptotically satisfy the confidence level. Use standard normal distributed and exponential ( with parameter 0.01) distributed random variables in your simulation.  
**Hint**:`np.random.normal()` and `np.random.exponential()`