## Penguins

The `seaborn` package has a bunch of included data sets, such as the `penguins`data set.

In [None]:
import seaborn as sns
penguins = sns.load_dataset("penguins")
penguins.head()

In [None]:
penguins.shape

In [None]:
from collections import Counter
Counter(penguins.island)

We can plot this using `countplot`.

In [None]:
sns.countplot(x="island", data=penguins)

Using `groupby` we can calculate summary statistics across groups.

In [None]:
penguins.groupby("island").mean()

This also works for other functions, e.g. `std`.

In [None]:
penguins.groupby("island").std()

Now let's run a regression! Our categorical covariates are the islands; our response is the bill length.

In [None]:
fit = smf.ols("""bill_length_mm ~ I(1 * (island == 'Biscoe'))
  + I(1 * (island == 'Dream')) 
  + I(1 * (island == 'Torgersen')) - 1""", 
  data = penguins).fit()
  
fit.summary()

In [None]:
fit = smf.ols("bill_length_mm ~ C(island) - 1", data = penguins).fit()
fit.summary()

In [None]:
fit.pvalues

Here we see that all of the beta coefficients are highly significant. But what does that mean? It means that the mean length of the bills are unlikely to be  for every island! What we typically care about is whether there is a difference between the islands, say Biscoe and Dream.

In [None]:
fit2 = smf.ols("bill_length_mm ~ C(island)", data = penguins).fit()
fit2.summary()

In this run of the regression model, there is no `C(island)[Biscoe]` because this coefficient has been absorbed into a *baseline*.



In [None]:
fit.params

In [None]:
fit2.params


How do we find `Dream`? Add the baseline!

In [None]:
fit2.params[0] + fit2.params[1]


In [None]:
fit.params[1]


## Titanic

In [None]:
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.head()

In [None]:
titanic.shape

In [None]:
titanic.groupby("sex").mean()

In [None]:
smf.ols("survived ~ C(sex)", data = titanic).fit().summary()

How about $k$ categories then?

In [None]:
from collections import Counter
Counter(titanic["class"])

In [None]:
smf.ols("survived ~ Q('class')", data = titanic).fit().summary()

Since both *p*-values are truly small, class has an effect. 

In [None]:
smf.ols("survived ~ Q('class')", data = titanic.sample(n = 20, random_state=1)).fit().summary()

## The $F$ test.

In [None]:
import pandas as pd
import statsmodels.formula.api as smf
# Example taken from https://towardsdatascience.com/anova-test-with-python-cfbf4013328b.
students = pd.read_csv("https://userpage.fu-berlin.de/soga/200/2010_data_sets/students.csv")
students.head()

In [None]:
students.major.dtype

This is already encoded as categorical data!

In [None]:
smf.ols("salary ~ major - 1", data = students).fit().summary()

This data contains more categorical variables, e.g. `minor`.

In [None]:
smf.ols("salary ~ major + minor - 1", data = students).fit().summary()

Can we find the $F$ statistic for the minor too? Yes, by using `anova_lm`! (With argumen `type = 3`.)

In [None]:
from statsmodels.stats.anova import anova_lm
anova_lm(smf.ols("salary ~ major + minor - 1", data = students).fit(), type = 3)

We can continue on with this, testing, e.g., the influence of religion.

In [None]:
smf.ols("salary ~ major + minor + religion - 1", data = students).fit().summary()

In [None]:
anova_lm(smf.ols("salary ~ major + minor + religion - 1", data = students).fit(), type = 3)

It also works for numerical covariates.

In [None]:
anova_lm(smf.ols("salary ~ major + minor + religion + age - 1", data = students,).fit(), type = 3)