## Q

Import `numpy`, `pandas`, `pingouin`, `seaborn`, and the `stats` module from `scipy`.

## A

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
import pingouin as pg

# Descriptive statistics

## Q

Load the `mi.csv` data file located in the `../data` directory into a DataFrame (the first column in the file is an index column), and inspect its content printing:
* the first rows, with column names,
* a summary table of all the variables,
* a summary table of the categorical variables only.

## A

In [None]:
df = pd.read_csv('../data/mi.csv', index_col=0)

In [None]:
df.head()

In [None]:
df.describe().T

In [None]:
df.describe(exclude='number').T

## Q

Inspect the relationship between variables `Age` and `OwnsHouse`. What type of plots is most suitable?

## A

In [None]:
# categorical vs continuous => boxplot, violinplot
sns.boxplot(x='OwnsHouse', y='Age', data=df);

In [None]:
sns.boxplot(x='OwnsHouse', y='Age', data=df);
sns.swarmplot(x='OwnsHouse', y='Age', hue='OwnsHouse', data=df, linewidth=1, size=3, legend=False);

The "Yes" and "No" levels might have been interchanged.

## Q

Draw a box plot (or violin plot) of variable `Age` for two categorical variables, say `OwnsHouse` and `LivesWithKids`.

## A

In [None]:
ax = sns.boxplot(x='OwnsHouse', y='Age', data=df, hue="LivesWithKids")
#sns.move_legend(ax, bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

## Q

Isolate the house-owners group from the others, and report their mean age(s) as $99\%$ confidence interval(s).

## A

In [None]:
owns_house = df.groupby('OwnsHouse').groups
house_owners_age = df.loc[owns_house['Yes'], 'Age']
others_age = df.loc[owns_house['No'], 'Age']

In [None]:
mean = np.mean(house_owners_age)
sem = stats.sem(house_owners_age)
distribution_of_the_mean = stats.norm(mean, sem)

In [None]:
distribution_of_the_mean.interval(.99)

### Bonus

Alternative calculation, for both groups:

First we need to evaluate the inverse survival function of the standard normal distribution at $0.005$.

In [None]:
alpha = 0.01
z = stats.norm().isf(alpha / 2)

In [None]:
for group_name, group_age in (
    ('House owners', house_owners_age),
    ('Others', others_age),
):
    m = np.mean(group_age)
    z_times_sem = z * stats.sem(group_age)
    print(f'{group_name}: {m:.2f} ± {z_times_sem:.2f} years old on average')

# Tests on single variables

Let us consider the logarithm of variable `HeartRate`.

In [None]:
df['logHeartRate'] = np.log(df['HeartRate'])
sns.histplot(df, x='logHeartRate', hue='Sex');

## Q

Draw Q-Q plots for variables `HeartRate` and `logHeartRate`.

## A

In [None]:
pg.qqplot(df['HeartRate']);

In [None]:
pg.qqplot(df['logHeartRate']);

## Q

Perform an omnibus normality test (`normaltest`) on the `logHeartRate` variable for the different levels of variable `Sex`.

## A

In [None]:
pg.normality(df, dv='logHeartRate', group='Sex', method='normaltest')

### Bonus

Normality test with SciPy (more convenient if your data are a NumPy array):

In [None]:
sex = df.groupby('Sex').groups
logHeartRate_female = df.loc[sex['Female'], 'logHeartRate']
stats.normaltest(logHeartRate_female)

## Q

Perform a Welch *t*-test on `logHeartRate` between males and females.

## A

In [None]:
# define your significance level first!
significance_level = 0.05

# define the two groups whose means are to be compared
logHeartRate_female = df.loc[sex['Female'], 'logHeartRate']
logHeartRate_male = df.loc[sex['Male'], 'logHeartRate']

# test whether the group means equal or differ
pg.ttest(logHeartRate_female, logHeartRate_male, confidence=1-significance_level)

Note: T>0 implies the first group's mean is greater than the second group's mean.

With SciPy, usage is very similar, although per default group variances are assumed equal. Welch *t*-test can be selected with `equal_var=False`:

In [None]:
test_result = stats.ttest_ind(logHeartRate_female, logHeartRate_male, equal_var=False)
test_result

In [None]:
test_result.pvalue <= significance_level

## Q

Instead of taking the log of `HeartRate` and perform a parametric *t*-test, we could have performed a non-parametric Mann-Whitney *U* Test, for example.

Check we also find a difference between males and females' means with this latter test.

## A

In [None]:
HeartRate_female = df.loc[sex['Female'], 'HeartRate']
HeartRate_male = df.loc[sex['Male'], 'HeartRate']
pg.mwu(HeartRate_female, HeartRate_male)

# Comparing two distributions

Now let proceed to comparing age between people living with kids and those living without kids.

In [None]:
sns.boxplot(x='LivesWithKids', y='Age', data=df)
sns.swarmplot(x='LivesWithKids', y='Age', hue='LivesWithKids', data=df, linewidth=1, size=3, legend=False);

## Q

Looking at the distributions, it does not make sense to compare the group means. Let us perform a two-sample goodness-of-fit test instead.

Bin the two groups from 20 to 70 years (included) with 5-year-wide bins (hint: use Pandas' `cut` function) and proceed to performing a $\chi^2$ test of homogeneity.

## A

In [None]:
lives_with_kids = df.groupby('LivesWithKids').groups
bins = np.arange(20, 70+1, 5) # note the increment for value 70 to be included; whatever value between 0 (excluded) and 5 is alright

Using Pandas' `cut`, we create a new column `AgeBin` in the dataframe:

In [None]:
labels= [f"{lower_bound}-{lower_bound + 5}" for lower_bound in bins[:-1]]
df['BinnedAge'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
df['BinnedAge']

We perform the test (note the warning):

In [None]:
_, frequencies, results = pg.chi2_independence(df, x='BinnedAge', y='LivesWithKids')
results

We have too few parents younger than 25:

In [None]:
frequencies

Before we move on, to achieve a similar result, we can use SciPy's `chi2_contingency`. We can reuse the above `frequencies` dataframe, or compute it using NumPy's `histogram` instead:

In [None]:
parents_age = df.loc[lives_with_kids['Yes'], 'Age']
others_age = df.loc[lives_with_kids['No'], 'Age']
parents_age_freqs, _ = np.histogram(parents_age, bins)
others_age_freqs, _ = np.histogram(others_age, bins)
freqs = np.stack((parents_age_freqs, others_age_freqs))
freqs

In [None]:
chi2, pvalue, dof, _ = stats.chi2_contingency(freqs)
print(f'χ²({dof}) = {chi2:.1f}, p-value = {pvalue:.3g}')

## Q

Repeat the procedure with 10-year bins.

## A

In [None]:
bins = np.arange(20, 70+1, 10)
labels= [f"{lower_bound}-{lower_bound + 10}" for lower_bound in bins[:-1]]
df['BinnedAge'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
_, frequencies, results = pg.chi2_independence(df, x='BinnedAge', y='LivesWithKids')
frequencies

In [None]:
results

### Bonus

To avoid binning, we can also perform a two-sample Kolmogorov-Smirnov test, available in SciPy only:

In [None]:
stats.ks_2samp(parents_age, others_age)

# Multiway ANOVA

## Q

Explain variations in heart rate using age and sex as factors (beware: there is a trap!)

## A

In [None]:
pg.anova(data=df, dv='HeartRate', between=['BinnedAge', 'Sex'])

## Q

We find a significant interaction while the effect of age fails to come up as significant by a short margin. Let us first draw an interaction plot.

## A

In [None]:
pg.plot_paired(df, 'HeartRate', 'Sex', 'BinnedAge');

In [None]:
pg.plot_paired(df, 'HeartRate', 'BinnedAge', 'Sex');

## Q

For the purpose of performing multiple comparisons and some *p*-value correction, let us conduct separate $t$-tests for each age interval and organise the results into a dataframe.

If comfortable enough with Python, define a "Pingouin-like" function `stratified_ttests` that takes a dataframe `data`, a dependent variable name `dv`, a between factor name `between` and a stratum factor `strata`.

## A

In [None]:
def stratified_ttests(data, dv, between, strata):
    # convert `strata` into a dictionary with the levels as keys and the corresponding rows as values
    strata = data.groupby(strata).groups
    # list the levels of the between factor, so that we input the groups in `ttest` in a consistent order
    levels = data[between].unique()
    # the present function only supports binary between factors, because we call `ttest`
    assert len(levels) == 2
    level1, level2 = levels
    # loop over the different strata
    results = []
    for stratum, rows in strata.items():
        # pick the corresponding rows of data
        stratum_data = data.loc[rows]
        # make the two groups
        group1 = stratum_data.loc[stratum_data[between]==level1, dv]
        group2 = stratum_data.loc[stratum_data[between]==level2, dv]
        # perform the test
        result = pg.ttest(group1, group2)
        # `result` is a single-row dataframe; set the index label so that we can concatenate the rows afterwards
        result.index = [stratum]
        results.append(result)
    # concatenate the rows and return the resulting dataframe
    return pd.concat(results)

In [None]:
results = stratified_ttests(df, 'HeartRate', 'Sex', 'BinnedAge')
results

Simpler but not-reusable implementation:

In [None]:
results = []
for stratum in ('20-30', '30-40', '40-50', '50-60', '60-70'):
    stratum_data = df[df['BinnedAge']==stratum]
    group1 = stratum_data.loc[stratum_data['Sex']=='Female', 'HeartRate']
    group2 = stratum_data.loc[stratum_data['Sex']=='Male', 'HeartRate']
    result = pg.ttest(group1, group2)
    result.index = [stratum]
    results.append(result)
results = pd.concat(results)
results

## Q

Correct the *p*-values, for example using the Holm procedure.

## A

In [None]:
significance, corrected_pvalues = pg.multicomp(results['p-val'], method='holm')
results['corrected p-val'], results['significance'] = corrected_pvalues, significance
results