# How do I make inferential statistics?

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

We will use the cats dataset from the MASS package. Python does not have MASS, so we will load a CSV version of the dataset. Here is a small embedded version for convenience:

In [None]:
# Load cats dataset
cats = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/MASS/cats.csv")
cats = cats.drop(columns=['rownames'])

When we conduct inferential statistics, we are simply formulating the same types of questions that we asked in descriptive statistics, but in a mathematically rigorous way.

For example, to test whether body weights differ between male and female cats, we can run a two-sample t-test.

In [None]:
cats_male = cats[cats["Sex"] == "M"]
cats_female = cats[cats["Sex"] == "F"]

stats.ttest_ind(cats_male["Bwt"], cats_female["Bwt"], equal_var=False)

## Exercise 1

Is there a difference in mean heart weight between male and female cats? Perform a Welch t-test (the Python default when equal_var=False).

In [None]:
# Exercise 1 - Your Answer


You may have realised that the function performed the Welch t-test (with the setting ```equal_var=False```). Why might this be good?

## Exercise 2

Re-run the t-test assuming equal variances (```equal_var=True```). Compare the confidence intervals of both tests. Which is more conservative?

In [None]:
# Exercise 2 - Your Answer


To compute confidence intervals, Python requires doing it manually:

In [None]:
# 95% CI for difference in means (Welch)
mean_diff = cats_male["Hwt"].mean() - cats_female["Hwt"].mean()
se = np.sqrt(
    cats_male["Hwt"].var(ddof=1)/len(cats_male) +
    cats_female["Hwt"].var(ddof=1)/len(cats_female)
)

tcrit = stats.t.ppf(0.975, df=min(len(cats_male), len(cats_female))-1)
(lower, upper) = (mean_diff - tcrit*se, mean_diff + tcrit*se)
(lower, upper)

## Exercise 3

Write out the hypotheses formally.

## Exercise 4

We assumed that the sampling distribution of the mean is approximately normal. Which mathematical theorem allows this?

## Exercise 5

Do you expect an association between body and heart weight? (Write your reasoning.)

## Exercise 6

Loook at the scatter plot below. Describe the relationship.

In [None]:
sns.scatterplot(data=cats, x="Bwt", y="Hwt")
plt.xlabel("Cat body weight (kg)")
plt.ylabel("Cat heart weight (g)")
plt.show()

## Exercise 7

Run the Pearson's correlation using the ```stats.pearsonr``` function. Write out your hypotheses and interpret the p-value below.

In [None]:
# Exercise 7 - Your Answer
