## Q

Import `numpy`, `pandas`, the `pyplot` module from `matplotlib`, `seaborn`, and the `stats` module from `scipy`.

## A

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from matplotlib import pyplot as plt
import seaborn as sns

# Comparison of two group means

## Q

Load the `mi.csv` data file, located in the `data` directory of the course repository, into a DataFrame, and show the column names and 5 first rows.

## A

In [None]:
df = pd.read_csv('../data/mi.csv', index_col=0)

In [None]:
# in Jupyter-lab, pandas is set to display dataframes with a limited number of columns
pd.options.display.max_columns = None
pd.options.display.max_rows = 6

df.head()

## Q

Show a summary table for these data, numerical AND categorical.

Hint: look for optional arguments to the method you know, to get info about the categorical variables.

## A

In [None]:
df.describe()

In [None]:
df.describe(exclude='number')

## Q

Inspect the relationship between variables `Age` and `OwnsHouse`. What type of plots is most suitable?

## A

In [None]:
# categorical vs continuous => boxplot, violinplot
sns.boxplot(x='OwnsHouse', y='Age', data=df);

In [None]:
sns.boxplot(x='OwnsHouse', y='Age', data=df);
sns.swarmplot(x='OwnsHouse', y='Age', hue='OwnsHouse', data=df, linewidth=1, size=3, legend=False);

In [None]:
!cowsay "Where on Earth do younger people own a house while elder people do not?"

Alternative representation:

In [None]:
sns.histplot(hue='OwnsHouse', y='Age', data=df, kde=True);

## Q

Draw a boxplot or violinplot of `Age` for two categorical variables, say `OwnsHouse` and `LivesWithKids`.

## A

In [None]:
ax = sns.boxplot(x='OwnsHouse', y='Age', data=df, hue="LivesWithKids")
#sns.move_legend(ax, bbox_to_anchor=(1.02, 1), loc='upper left', borderaxespad=0);

## Q

Isolate the house-owners group from the others, and report their mean age(s) as $99\%$ confidence interval(s).

## A

In [None]:
group = df.groupby('OwnsHouse').groups
house_owners = group['Yes']
others = group['No']
house_owners_age = df.loc[house_owners, 'Age']
others_age = df.loc[others, 'Age']

In [None]:
mean = np.mean(house_owners_age)
sem = stats.sem(house_owners_age)
distribution_of_the_mean = stats.norm(mean, sem)

In [None]:
distribution_of_the_mean.interval(.99)

Alternative calculation, for both groups:

First we need to evaluate the inverse survival function of the standard normal distribution at $0.5\%$.

In [None]:
alpha = 0.01
z = stats.norm().isf(alpha / 2)

In [None]:
for group_name, group_age in (
    ('House owners', house_owners_age),
    ('Others', others_age),
):
    m = np.mean(group_age)
    z_times_sem = z * stats.sem(group_age)
    print(f'{group_name}: {m:.2f} ± {z_times_sem:.2f} years old on average')

## Q

Check whether the age is normally distributed in a group, first following a graphical approach.

## A

In [None]:
stats.probplot(house_owners_age, fit=True, plot=plt);

`probplot` does not allow customizing the plot, but conveniently provides the elements to reproduce the plot with lower-level functions.

In [None]:
help(stats.probplot)

In [None]:
(theoretical_quantiles, observed_quantiles), (slope, intercept, _) = stats.probplot(house_owners_age, fit=True)
# blue crosses
plt.scatter(theoretical_quantiles, observed_quantiles, marker='+', color='b')
# red line
plt.axline((0, intercept), slope=slope, color='r')
# axis labels
plt.xlabel('theoretical quantiles')
plt.ylabel('ordered observations (age)');

The red line is fitted to the blue points and does not align well on the linear part.

We can seek confirmation with a normality test, although it is already clear the age is not normally distributed in our sample:

In [None]:
stats.normaltest(house_owners_age)

Here, we have comfortable sample sizes and these departures from normality may not affect the power of the statistical test.

## Q

Are the sample size and variance of the two groups similar enough for running a standard $t$ test?

## A

In [None]:
len(house_owners_age), len(others_age), np.std(house_owners_age), np.std(others_age)

`ttest_ind` allows standard deviation ratios [up to $2$](https://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes,_similar_variances_(1/2_%3C_sX1/sX2_%3C_2)).
The groups can have different sample sizes.

## Q

Test whether the group mean ages equal.

## A

In [None]:
# define your significance level first!
significance_level = 0.05

# run a t-test for independent samples
stats.ttest_ind(house_owners_age, others_age)

In [None]:
_, pvalue = stats.ttest_ind(house_owners_age, others_age)
pvalue <= significance_level

## Q

How would you report the result of this test, with extensive information about the effect?

## A

In [None]:
# we need:
# * the number of degrees of freedom, to give a complete report of the outcome of the t-test,
n1, n2 = len(house_owners_age), len(others_age)
degrees_of_freedom = n1 + n2 - 2

# * the mean difference (this is almost an effect size, not compared with the associated variability),
mean_difference = np.mean(house_owners_age) - np.mean(others_age)

# * and the effect size.
t, _ = stats.ttest_ind(house_owners_age, others_age)
cohen_d = t * np.sqrt(1/n1 + 1/n2)

#   alternatively, for the lazy people:
import pingouin as pg
cohen_d_again = pg.compute_effsize(house_owners_age, others_age)

degrees_of_freedom, mean_difference, cohen_d, cohen_d_again

«**In our study**, house owners ($n=288$) were found to be significantly younger than the other surveyed people ($n=528$; $10.3$ years younger on average, $t(814)=-10.9$, $p<0.05$). This effect was found to be large (Cohen's $d \approx 0.8$).»

Note: as we report the sample size for each group, we may omit the (still nice-to-have) information of the number of degrees of freedom.

In [None]:
help(pg.compute_effsize)

In [None]:
pg.compute_effsize(house_owners_age, others_age, eftype='hedges')

# Comparing two distributions

Now let proceed to comparing age between people living with kids and those living without kids.
Plot the data.

In [None]:
sns.boxplot(x='LivesWithKids', y='Age', data=df)
sns.swarmplot(x='LivesWithKids', y='Age', hue='LivesWithKids', data=df, linewidth=1, size=3, legend=False);

## Q

How do the common descriptive statistics (mean, variance) compare?

## A

In [None]:
lives_with_kids = df['Age'][df['LivesWithKids']=='Yes']
lives_without_kids = df['Age'][df['LivesWithKids']=='No']
np.mean(lives_without_kids), np.mean(lives_with_kids), np.std(lives_without_kids), np.std(lives_with_kids)

A difference in group means is very unlikely, and the ratio of the group standard deviations is large but $<2$.

The main feature to notice is the double mode in the *lives without kids* group.
Similar samples drawn from the same population could be (more) biased towards elder or younger people, and this could result in mean differences, in a direction or another, even to the point such differences become significant.

In [None]:
sns.histplot(lives_without_kids);

As a consequence, there is no point in comparing the two groups in terms of central tendency (means). A $t$-test is not suitable.

## Q

How can we compare the two groups to state they differ from one another?

## A (with nested Q&A)

We need a two-sample goodness-of-fit test.

This can be done in two ways:

* with a $\chi^2$ test of homogeneity, binning the age;
* with a two-sample Kolmogorov-Smirnov test.

### Q

Bin the two groups, first with 5-year-wide bins, extract frequencies and proceed to performing a $\chi^2$ test.

### A

In [None]:
bins = np.arange(20, 70+1, 5)
lives_without_kids_freqs, _ = np.histogram(lives_without_kids, bins)
lives_with_kids_freqs, _ = np.histogram(lives_with_kids, bins)
lives_without_kids_freqs, lives_with_kids_freqs

Let us check we did not miss any observation:

In [None]:
assert np.sum(lives_without_kids_freqs) + np.sum(lives_with_kids_freqs) == len(df)
len(df)

Note that we have less than 5 observations in one combination of factor levels. In principle we should revise the binning so that all bins contain at least 5 observations.

For the purpose of comparing the impact of the binning, let us run the test anyway.

In [None]:
chi2, pvalue, dof, _ = stats.chi2_contingency(np.stack((lives_with_kids_freqs, lives_without_kids_freqs), axis=0))
print(f'χ²({dof}) = {chi2:.1f}, p-value = {pvalue:.3g}')

### Q

Are all the assumptions met? Adjust the procedure if necessary. Any interpretation?

### A

In [None]:
bins = np.arange(20, 70+1, 10)
lives_without_kids_freqs, _ = np.histogram(lives_without_kids, bins)
lives_with_kids_freqs, _ = np.histogram(lives_with_kids, bins)
lives_without_kids_freqs, lives_with_kids_freqs

In [None]:
chi2, pvalue, dof, _ = stats.chi2_contingency(np.stack((lives_with_kids_freqs, lives_without_kids_freqs), axis=0))
print(f'χ²({dof}) = {chi2:.1f}, p-value = {pvalue:.3g}')

The low frequency in one group, in the first case, did not affect the outcome of the test because of the relatively large number of bins.

Although there is no doubt we do have an effect here, we can also run a two-sample Kolmogorov-Smirnov test. It is good practice to seek confirmation with different but equivalent tests.

In [None]:
stats.ks_2samp(lives_with_kids, lives_without_kids)

# ...