# Lab 4.05 - Bivariate Analysis of Qualitative Data


In [1]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns                               # Advanced data visualisation

## Exercise 5 - Survey of Australian Students

Load the data file data/survey.csv. It contains the result of a survey of students from an Australian university.

We want to investigate the relationship between some discrete (nominal or ordinal) variables in this dataset. For any pairs of variables listed below, follow these steps:

* First, think about what exactly you expect for the given combination of variables.
* Make a frequency table for the two variables. The (presumably) independent variable comes first.
* Plot a graph visualizing the relationship between the two variables.
- Looking at the chart, do you expect a rather high or rather low value for the $\chi^2$ statistic? Why?
* Run the $\chi^2$ test to determine whether there is a relationship between the two variables. Calculate the $\chi^2$ statistic, the critical limit $g$ and the $p$ value, each for significance level $\alpha = 0.05$.
* Should we accept or reject the null hypothesis? What exactly does that mean for the relationship between the two variables? In other words, formulate an answer to the research question.
* Calculate Cramér's V. Do you come to a similar conclusion as with the $\chi^2$ test?


The variables to be investigated:

| Independent variabele          | Dependent variabele                        |
|:------------------------------ |:-------------------------------------------|
| `Exer` (practicing sports)     | `Smoke`                                    |
| `Sex` (gender)                 | `Smoke`                                    |
| `W.Hnd` (dominant hand)        | `Fold` (top hand when you cross your arms) |
| `Sex`                          | `W.Hnd`                                    |

Results of the main calculations (rounded up to 3 decimal places):

- `Exer/Smoke`: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483
- `W.Hnd/Fold`: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454
- `Sex/Smoke`: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314
- `Sex/W.Hnd`: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

Read the dataset.

In [12]:
df = pd.read_csv('https://raw.githubusercontent.com/HoGentTIN/dsai-labs/main/data/survey.csv', keep_default_na= False)
df.head(10)

Unnamed: 0.1,Unnamed: 0,Sex,Wr.Hnd,NW.Hnd,W.Hnd,Fold,Pulse,Clap,Exer,Smoke,Height,M.I,Age
0,1,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.25
1,2,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
2,3,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
3,4,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
4,5,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667
5,6,Female,18.0,17.7,Right,L on R,64.0,Right,Some,Never,172.72,Imperial,21.0
6,7,Male,17.7,17.7,Right,L on R,83.0,Right,Freq,Never,182.88,Imperial,18.833
7,8,Female,17.0,17.3,Right,R on L,74.0,Right,Freq,Never,157.0,Metric,35.833
8,9,Male,20.0,19.5,Right,R on L,72.0,Right,Some,Never,175.0,Metric,19.0
9,10,Male,18.5,18.5,Right,R on L,90.0,Right,Some,Never,167.0,Metric,22.333


What are the different values for Exer and Smoke?  
Change both variables to ordinal variables with a specific order.

In [13]:
print(df.Exer.value_counts())
print(df.Smoke.value_counts())
exer_type = CategoricalDtype(categories=['Freq', 'Some', 'None'], ordered=True)
smoke_type = CategoricalDtype(categories=['Never', 'Occas', 'Regul', 'Heavy'], ordered=True)
df.Exer = df.Exer.astype(exer_type)
df.Smoke = df.Smoke.astype(smoke_type)

Exer
Freq    115
Some     98
None     24
Name: count, dtype: int64
Smoke
Never    189
Occas     19
Regul     17
Heavy     11
NA         1
Name: count, dtype: int64


* Make a frequency table for the two variables. The (presumably) independent variable comes first.
* Plot a graph visualizing the relationship between the two variables.
* Looking at the chart, do you expect a rather high or rather low value for the  χ2  statistic? Why?
* Run the  χ2  test to determine whether there is a relationship between the two variables. Calculate the  χ2  statistic, the critical limit  g  and the  p  value, each for significance level  α=0.05 .
* Should we accept or reject the null hypothesis? What exactly does that mean for the relationship between the two variables? In other words, formulate an answer to the research question.
* Calculate Cramér's V. Do you come to a similar conclusion as with the  χ2  test?

The variables to be investigated:

| Independent variabele          | Dependent variabele                        |
|:------------------------------ |:-------------------------------------------|
| `Exer` (practicing sports)     | `Smoke`                                    |
| `Sex` (gender)                 | `Smoke`                                    |
| `W.Hnd` (dominant hand)        | `Fold` (top hand when you cross your arms) |
| `Sex`                          | `W.Hnd`                                    |

Results of the main calculations (rounded up to 3 decimal places):

- `Exer/Smoke`: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483
- `W.Hnd/Fold`: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454
- `Sex/Smoke`: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314
- `Sex/W.Hnd`: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

Exer/Smoke: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483

In [18]:
# Chi-squared test for independence based on a contingency table
observed = pd.crosstab(df.Exer, df.Smoke)
chi2, p, dof, expected = stats.chi2_contingency(observed)

print("Chi-squared       : %.4f" % chi2)
print("Degrees of freedom: %d" % dof)
print("P-value           : %.4f" % p)
g = stats.chi2.isf(0.05, df = dof)
print("Critical value     : %.4f" % g)

Chi-squared       : 5.4885
Degrees of freedom: 6
P-value           : 0.4828
Critical value     : 12.5916


In [None]:
# P > 0.05 dus H0 wordt niet verworpen

W.Hnd/Fold: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454

In [19]:
# Chi-squared test for independence based on a contingency table
observed = pd.crosstab(df['W.Hnd'], df['Fold'])
chi2, p, dof, expected = stats.chi2_contingency(observed)

print("Chi-squared       : %.4f" % chi2)
print("Degrees of freedom: %d" % dof)
print("P-value           : %.4f" % p)
g = stats.chi2.isf(0.05, df = dof)
print("Critical value     : %.4f" % g)

Chi-squared       : 2.9786
Degrees of freedom: 4
P-value           : 0.5614
Critical value     : 9.4877


Sex/Smoke: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314

In [20]:
# Chi-squared test for independence based on a contingency table
observed = pd.crosstab(df.Sex, df.Smoke)
chi2, p, dof, expected = stats.chi2_contingency(observed)

print("Chi-squared       : %.4f" % chi2)
print("Degrees of freedom: %d" % dof)
print("P-value           : %.4f" % p)
g = stats.chi2.isf(0.05, df = dof)
print("Critical value     : %.4f" % g)


Chi-squared       : 3.8161
Degrees of freedom: 6
P-value           : 0.7015
Critical value     : 12.5916


Sex/W.Hnd: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627