# In-class exercises

### 1. t-test

Consider the data in `birthweight.csv`.

|Variable|Description|Data type|
|---|---|---|
|ID|Baby number| (meta) |
|length|Length of baby (cm)|Scale|
|Birthweight|Weight of baby (kg)|Scale|
| headcirumference|Head Circumference|Scale|
| Gestation|Gestation (weeks)|Scale|
| smoker|Mother smokes 1 = smoker 0 = non-smoker| Binary|
| motherage|Maternal age|Scale|
| mnocig|Number of cigarettes smoked per day by mother|Scale|
| mheight|Mothers height (cm)| Scale|
| mppwt|Mothers pre-pregnancy weight (kg)| Scale|
| fage|Father's age| Scale|
|fedyrs|Father’s years in education|Scale|
| fnocig|Number of cigarettes smoked per day by father|Scale|
| fheight|Father's height (kg)| Scale|
| lowbwt|Low birth weight, 0 = No and 1 = yes| Binary|
|mage35|Mother over 35, 0 = No and 1 = yes|Binary|


Does a mother's smoking have any influence on their baby's birthweight?

In [3]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

In [3]:
data = pd.read_csv("birthweight.csv")

We want a two-tailed test (no prior expectations about change).

Set $\alpha$ = 0.05

In [7]:
bw_smoker = data[ data.smoker==1 ].Birthweight
bw_nonsmoker = data[ data.smoker==0 ].Birthweight

mean_smoker = bw_smoker.mean()
mean_nonsmoker = bw_nonsmoker.mean()

print('smoker:', mean_smoker)
print('nonsmoker:', mean_nonsmoker)
print('difference:', mean_smoker - mean_nonsmoker)

smoker: 3.1340909090909093
nonsmoker: 3.5095000000000005
difference: -0.37540909090909125


In [8]:
stats.ttest_ind( bw_smoker, bw_nonsmoker )

Ttest_indResult(statistic=-2.093431541991207, pvalue=0.04269624654559367)

### 2. ANOVA


Consider the data in `diet.csv`.

|Variable|Description|Data type|
|---|---|---|
|Person|Participant number||
|Age|Age (years)|Scale|
|Height|Height (cm)|Scale|
|preweight|Weight before the diet (kg)|Scale|
|Diet|Diet|Nominal|
|weight6weeks|Weight after 6 weeks (kg)|Scale|

Does weight change depend on the diet followed?

H0: mean weight changes for all diets are equal

H1: mean weight changes are not all equal

$\alpha$ = 0.05

In [14]:
data = pd.read_csv("diet.csv")

diet1 = data[ data.Diet==1 ]
diet2 = data[ data.Diet==2 ]
diet3 = data[ data.Diet==3 ]

change1 = diet1.weight6weeks - diet1.preweight
change2 = diet2.weight6weeks - diet2.preweight
change3 = diet3.weight6weeks - diet3.preweight

In [22]:
stats.f_oneway(change1, change2, change3 )

F_onewayResult(statistic=6.197447453165349, pvalue=0.0032290142385893524)

### 3. Chi-squared test

Consider the data in `crime.csv`.

|Variable|Description|Data type|
|---|---|---|
|CrimeRate|Crime rate (number of offences per million population)|Continuous|
|Youth|Young males (number of males aged 18-24 per 1000)|Discrete|
|Southern|Southern state 1 = yes, 0 = no|Binary|
|Education|Education time (average number of years schooling up to 25)|Discrete|
|ExpenditureYear0|Expenditure (per capita expenditure on police)  skewed|Continuous|
|LabourForce|Youth labour force (males employed 18-24 per 1000)|Discrete|
|Males|Males (per 1000 females)|Discrete|
|MoreMales|More males identified per 1000 females 1 = yes, 0 = no|Binary|
|StateSize|State size (in hundred thousands)|Discrete|
|YouthUnemployment|Youth Unemployment (number of males aged 18-24 per 1000) skewed|Discrete|
|MatureUnemployment|Mature Unemployment (number of males aged 35-39 per 1000)|Discrete|
|HighYouthUnemploy|High Youth Unemployment 1 = yes, 0 = no (high if Youth >3*Mature )|Binary|
|Wage|Wage (median weekly wage)|Continuous|
|BelowWage|Below Wage (number of families below half wage per 1000)|Discrete|


Is there a relationship between Southern states and high youth unemployment?

H0: High youth unemployment is independent of Southern

H1: High youth unemployment is not independent of Southern

$\alpha$ = 0.05

In [30]:
data = pd.read_csv("crime.csv")

obs = pd.crosstab( data.Southern, data.HighYouthUnemploy )
obs

HighYouthUnemploy,0,1
Southern,Unnamed: 1_level_1,Unnamed: 2_level_1
0,17,14
1,15,1


In [31]:
stats.contingency.expected_freq(obs)

array([[21.10638298,  9.89361702],
       [10.89361702,  5.10638298]])

In [32]:
p_value = stats.chi2_contingency(obs)[1]
p_value

0.017240599419156625

In the case of a 2x2 table, we can avoid having to meet the chi-squared test requirements by applying [Fisher's exact test](https://en.wikipedia.org/wiki/Fisher%27s_exact_test).


In [33]:
p_value = stats.fisher_exact(obs)[1]
p_value

0.00804930920522039