<a href="https://colab.research.google.com/github/Gabrielhj17/Database-Labs/blob/main/Copy_of_Lab_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 4: _SciPy_, statistical inference and hypothesis testing

In this lab, we are going to explore statistical inference and hypothesis testing techniques with _SciPy_. Since you are reading this notebook, make sure the following file is in the same folder as this notebook:
"basketball_heights.csv"


If you have not already done so, I would recommend going through the
1. [_NumPy_ tutorial](https://numpy.org/numpy-tutorials/)
2. [_matplotlib_ tutorial](https://matplotlib.org/stable/tutorials/index.html)
3. [_seaborn_ tutorial](https://seaborn.pydata.org/tutorial.html)
4. [_pandas_ tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)

If there is something you are not familiar with, remember to consult the
1. [_NumPy_ documentation](https://numpy.org/doc/stable/reference/index.html#reference)
2. [_matplotlib_ documentation](https://matplotlib.org/stable/api/index)
3. [_seaborn_ documentation](https://seaborn.pydata.org/api.html)
4. [_pandas_ documentation](https://pandas.pydata.org/docs/reference/index.html)
5. [_SciPy_ documentation](https://docs.scipy.org/doc/scipy/reference/index.html#scipy-api)

- In the lecture, I focused on the key concepts in statistical inference and hypothesis testing. These are the important parts, and are often under-appreciated / hard to understand/google/chatgpt, leading to bad data interpretation.
- The other side is actually doing the hypothesis tests/inference. We haven't explicitly covered how to do everything in the questions in lectures. No problem! You have the internet at your fingers.
- The skill of going from concept to implementation without a helping hand to guide you is an important one for a programmer / data scientist. Remember, there are hundreds of hypothesis test algorithms, and you might have to use any one of them in real life.

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

## Exercise 1 - testing mean with known variance

basketball_heights.csv contains 3 columns of 3 different regions. We want to see if their means are different. The population parameters are

\begin{align*}
X_1 &\sim N(1.80, 0.10) \\
X_2 &\sim N(1.75, 0.05) \\
X_3 &\sim N(1.90, 0.10)
\end{align*}

### Exercise 1a) Read in basketball_heights.csv

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

file_path = '/content/drive//My Drive/Colab Notebooks/basketball_heights.csv'
df = pd.read_csv(file_path)
print(df)

Mounted at /content/drive
    region_1  region_2  region_3
0   1.834604  1.817755  1.833082
1   1.809238  1.713023  1.993077
2   1.643073  1.786917  1.910538
3   1.821316  1.802915  1.891940
4   1.819650  1.707106  1.758786
5   1.870947  1.723106  2.025066
6   1.675446  1.663705  1.868202
7   1.915565  1.769477  2.022259
8   1.879801  1.734851  1.646696
9   1.825850  1.763273  1.989867
10       NaN  1.738801  1.988110
11       NaN  1.732850  1.942570
12       NaN  1.850859  1.951662
13       NaN  1.706773  1.971812
14       NaN  1.702660  2.075395
15       NaN  1.639003       NaN
16       NaN  1.790037       NaN
17       NaN  1.711083       NaN
18       NaN  1.833889       NaN
19       NaN  1.656471       NaN


### Exercise 1b) Calculate the sample mean and sample size for each region

In [None]:
mean = df.mean()
sampleSize = df.sample()
print(mean)
print(sampleSize)

region_1    1.809549
region_2    1.742228
region_3    1.924604
dtype: float64
    region_1  region_2  region_3
17       NaN  1.711083       NaN


### Exercise 1c) Save the sample mean, sample size, true mean and true standard deviation into a dictionary

In [None]:
region_stats = {
    'region1': {
        'sample_mean': df['region_1'].mean(),
        'sample_size': len(df['region_1'].dropna()),
        'true_mean': 1.80,
        'true_std': 0.10
    },
    'region2': {
        'sample_mean': df['region_2'].mean(),
        'sample_size': len(df['region_2'].dropna()),
        'true_mean': 1.75,
        'true_std': 0.05
    },
    'region3': {
        'sample_mean': df['region_3'].mean(),
        'sample_size': len(df['region_3'].dropna()),
        'true_mean': 1.90,
        'true_std': 0.10
    }
}

print(region_stats['region1'])
print(region_stats['region2'])
print(region_stats['region3'])

{'sample_mean': 1.8095492091389798, 'sample_size': 10, 'true_mean': 1.8, 'true_std': 0.1}
{'sample_mean': 1.7422276451800596, 'sample_size': 20, 'true_mean': 1.75, 'true_std': 0.05}
{'sample_mean': 1.92460411015202, 'sample_size': 15, 'true_mean': 1.9, 'true_std': 0.1}


### Exercise 1d) Calculate the standard error for each region and save it into the dictionary

In [None]:
stdError1 = df['region_1'].std() / np.sqrt(len(df['region_1']))
stdError2 = df['region_2'].std() / np.sqrt(len(df['region_2']))
stdError3 = df['region_3'].std() / np.sqrt(len(df['region_3']))

region_stats['region1']['standard_error'] = stdError1
region_stats['region2']['standard_error'] = stdError2
region_stats['region3']['standard_error'] = stdError3

print(region_stats['region1'])
print(region_stats['region2'])
print(region_stats['region3'])

{'sample_mean': 1.8095492091389798, 'sample_size': 10, 'true_mean': 1.8, 'true_std': 0.1, 'standard_error': 0.0192608215221018}
{'sample_mean': 1.7422276451800596, 'sample_size': 20, 'true_mean': 1.75, 'true_std': 0.05, 'standard_error': 0.013142196178080219}
{'sample_mean': 1.92460411015202, 'sample_size': 15, 'true_mean': 1.9, 'true_std': 0.1, 'standard_error': 0.025084273139467914}


### Exercise 1e) Two-tail z-test on the mean

I want to conduct a two-tailed z-test on region 1, checking to see if the sample mean is different to the region mean at the 1% significance level. Fill in the correct alternative hypothesis in the following mathematical statement:

\begin{align*}
\mathbf{H}_0 \quad &: \quad \mu = 1.8 \\
\mathbf{H}_1 \quad &: \quad \mu \neq 1.8
\end{align*}

_You'll need to double click on this cell and edit the latex using one of the following latex commands:_
1. \lt
2. \gt
3. \neq

Now conduct the test

In [None]:
mu_0 = region_stats['region1']['true_mean']
sample_mean = region_stats['region1']['sample_mean']
sample_std_dev = region_stats['region1']['true_std']
sample_size = region_stats['region1']['sample_size']

z_statistic = (sample_mean - mu_0) / (sample_std_dev / np.sqrt(sample_size))

p_value = 2 * (1 - stats.norm.cdf(abs(z_statistic)))

#1% significance level
alpha = 0.01

# Output results
print(f"Z-statistic: {z_statistic}")
print(f"P-value: {p_value}")
if p_value < alpha:
    print("Reject the null hypothesis (H₀)")
else:
    print("Fail to reject the null hypothesis (H₀)")

Z-statistic: 0.3019725073247129
P-value: 0.7626730211598407
Fail to reject the null hypothesis (H₀)


### Exercise 1f) One-tail z-test on the mean

I want to conduct a one-tailed z-test on region 2, checking to see if the sample mean is smaller than the region mean at the 1% significance level. Fill in the correct alternative hypothesis in the following mathematical statement:

\begin{align*}
\mathbf{H}_0 \quad &: \quad \mu = 1.75 \\
\mathbf{H}_1 \quad &: \quad \mu
\end{align*}

_You'll need to double click on this cell and edit the latex using one of the following latex commands:_
1. \lt
2. \gt
3. \neq

Now conduct the test

### Exercise 1g) One-tail z-test on the mean

I want to conduct a one-tailed z-test on region 3, checking to see if the sample mean is larger than the region mean at the 1% significance level. Fill in the correct alternative hypothesis in the following mathematical statement:

\begin{align*}
\mathbf{H}_0 \quad &: \quad \mu = 1.9 \\
\mathbf{H}_1 \quad &: \quad \mu
\end{align*}

_You'll need to double click on this cell and edit the latex using one of the following latex commands:_
1. \lt
2. \gt
3. \neq

Now conduct the test

### Exercise 1h) Two-tail z-test on the difference of means

I want to conduct a two-tailed z-test on whether the difference of the mean of region 1 to region 2 is different to 0, using a significance level of 0.1%. Fill in the correct alternative hypothesis in the following mathematical statement:

\begin{align*}
\mathbf{H}_0 \quad &: \quad \mu_1 - \mu_2 = 0 \\
\mathbf{H}_1 \quad &: \quad \mu_1 - \mu_2
\end{align*}

_You'll need to double click on this cell and edit the latex using one of the following latex commands:_
1. \lt
2. \gt
3. \neq

Now conduct the test.

### Exercise 1i) One-tail z-test on the difference of means

I want to conduct a one-tailed z-test on whether the difference of the mean of region 2 to region 3 is smaller than 0, using a significance level of 0.1%. Fill in the correct alternative hypothesis in the following mathematical statement:

\begin{align*}
\mathbf{H}_0 \quad &: \quad \mu_2 - \mu_3 = 0 \\
\mathbf{H}_1 \quad &: \quad \mu_2 - \mu_3
\end{align*}

_You'll need to double click on this cell and edit the latex using one of the following latex commands:_
1. \lt
2. \gt
3. \neq

Now conduct the test.

### Exercise 1j) One-tail z-test on the difference of means

I want to conduct a one-tailed z-test on whether the difference of the mean of region 3 to region 1 is larger than 0, using a significance level of 0.1%. Fill in the correct alternative hypothesis in the following mathematical statement:

\begin{align*}
\mathbf{H}_0 \quad &: \quad \mu_3 - \mu_1 = 0 \\
\mathbf{H}_1 \quad &: \quad \mu_3 - \mu_1
\end{align*}

_You'll need to double click on this cell and edit the latex using one of the following latex commands:_
1. \lt
2. \gt
3. \neq

Now conduct the test.

### Exercise 1h) Bootstrap confidence intervals

Build a 95% confidence interval on the heights of basketball players from region 2, using the bootstrap method. The following functions might help you:
- [sampling with replacement](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html)
- [removing NaNs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)
- [finding percentiles of a vector](https://numpy.org/doc/2.0/reference/generated/numpy.percentile.html)

Comment on the difference between the upper and lower bounds of the bootstrap confidence interval, and the 2.5 / 97.5 percentiles of the raw data for region 2. Why are they so different?