# Lab 4.05 - Bivariate Analysis of Qualitative Data


In [1]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns                               # Advanced data visualisation

## Exercise 5 - Survey of Australian Students

Load the data file data/survey.csv. It contains the result of a survey of students from an Australian university.

We want to investigate the relationship between some discrete (nominal or ordinal) variables in this dataset. For any pairs of variables listed below, follow these steps:

* First, think about what exactly you expect for the given combination of variables.
* Make a frequency table for the two variables. The (presumably) independent variable comes first.
* Plot a graph visualizing the relationship between the two variables.
- Looking at the chart, do you expect a rather high or rather low value for the $\chi^2$ statistic? Why?
* Run the $\chi^2$ test to determine whether there is a relationship between the two variables. Calculate the $\chi^2$ statistic, the critical limit $g$ and the $p$ value, each for significance level $\alpha = 0.05$.
* Should we accept or reject the null hypothesis? What exactly does that mean for the relationship between the two variables? In other words, formulate an answer to the research question.
* Calculate Cramér's V. Do you come to a similar conclusion as with the $\chi^2$ test?


The variables to be investigated:

| Independent variabele          | Dependent variabele                        |
|:------------------------------ |:-------------------------------------------|
| `Exer` (practicing sports)     | `Smoke`                                    |
| `Sex` (gender)                 | `Smoke`                                    |
| `W.Hnd` (dominant hand)        | `Fold` (top hand when you cross your arms) |
| `Sex`                          | `W.Hnd`                                    |

Results of the main calculations (rounded up to 3 decimal places):

- `Exer/Smoke`: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483
- `W.Hnd/Fold`: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454
- `Sex/Smoke`: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314
- `Sex/W.Hnd`: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

Read the dataset.

In [24]:
df = pd.read_csv('../data/survey.csv', index_col=0)
df.Sex = df.Sex.astype('category')
df['W.Hnd'] = df['W.Hnd'].astype('category')
df['Fold'] = df['Fold'].astype('category')
df['Clap'] = df['Clap'].astype('category')
df['M.I'] = df['M.I'].astype('category')
df.head()

Unnamed: 0,Sex,Wr.Hnd,NW.Hnd,W.Hnd,Fold,Pulse,Clap,Exer,Smoke,Height,M.I,Age
1,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.25
2,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
3,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
4,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
5,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667


What are the different values for Exer and Smoke?  
Change both variables to ordinal variables with a specific order.

In [26]:
df.Exer = df.Exer.fillna('None')
df.Exer = df.Exer.astype(CategoricalDtype(['None', 'Some', 'Freq'], ordered=True))
df.Smoke = df.Smoke.astype(CategoricalDtype(['Heavy', 'Regul', 'Occas', 'Never'], ordered=True))

* Make a frequency table for the two variables. The (presumably) independent variable comes first.
* Plot a graph visualizing the relationship between the two variables.
* Looking at the chart, do you expect a rather high or rather low value for the  χ2  statistic? Why?
* Run the  χ2  test to determine whether there is a relationship between the two variables. Calculate the  χ2  statistic, the critical limit  g  and the  p  value, each for significance level  α=0.05 .
* Should we accept or reject the null hypothesis? What exactly does that mean for the relationship between the two variables? In other words, formulate an answer to the research question.
* Calculate Cramér's V. Do you come to a similar conclusion as with the  χ2  test?

The variables to be investigated:

| Independent variabele          | Dependent variabele                        |
|:------------------------------ |:-------------------------------------------|
| `Exer` (practicing sports)     | `Smoke`                                    |
| `Sex` (gender)                 | `Smoke`                                    |
| `W.Hnd` (dominant hand)        | `Fold` (top hand when you cross your arms) |
| `Sex`                          | `W.Hnd`                                    |

Results of the main calculations (rounded up to 3 decimal places):

- `Exer/Smoke`: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483
- `W.Hnd/Fold`: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454
- `Sex/Smoke`: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314
- `Sex/W.Hnd`: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

In [37]:
alpha = 0.05

Exer/Smoke: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483

In [43]:
observed = pd.crosstab(df.Exer, df.Smoke)
chi2, p, dof, _ = stats.chi2_contingency(observed)
g = stats.chi2.ppf(1 - alpha, dof)
print(f"χ² = {chi2:.3f}")
print(f"g = {g:.3f}")
print(f"p-value = {p:.3f}")

χ² = 5.489
g = 12.592
p-value = 0.483


W.Hnd/Fold: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454

In [39]:
observed = pd.crosstab(df['W.Hnd'], df.Fold)
chi2, p, dof, _ = stats.chi2_contingency(observed)
g = stats.chi2.ppf(1 - alpha, dof)
print(f"χ² = {chi2:.3f}")
print(f"g = {g:.3f}")
print(f"p-value = {p:.3f}")

χ² = 1.581
g = 5.991
p-value = 0.454


Sex/Smoke: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314

In [40]:
observed = pd.crosstab(df.Sex, df.Smoke)
chi2, p, dof, _ = stats.chi2_contingency(observed)
g = stats.chi2.ppf(1 - alpha, dof)
print(f"χ² = {chi2:.3f}")
print(f"g = {g:.3f}")
print(f"p-value = {p:.3f}")

χ² = 3.554
g = 7.815
p-value = 0.314


Sex/W.Hnd: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

In [41]:
observed = pd.crosstab(df.Sex, df['W.Hnd'])
chi2, p, dof, _ = stats.chi2_contingency(observed)
g = stats.chi2.ppf(1 - alpha, dof)
print(f"χ² = {chi2:.3f}")
print(f"g = {g:.3f}")
print(f"p-value = {p:.3f}")

χ² = 0.236
g = 3.841
p-value = 0.627
