# Lab 4.05 - Bivariate Analysis of Qualitative Data


In [1]:
# Importing the necessary packages
import numpy as np                                  # "Scientific computing"
import scipy.stats as stats                         # Statistical tests

import pandas as pd                                 # Data Frame
from pandas.api.types import CategoricalDtype

import matplotlib.pyplot as plt                     # Basic visualisation
from statsmodels.graphics.mosaicplot import mosaic  # Mosaic diagram
import seaborn as sns                               # Advanced data visualisation

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd                                 # Data Frame


## Exercise 5 - Survey of Australian Students

Load the data file data/survey.csv. It contains the result of a survey of students from an Australian university.

We want to investigate the relationship between some discrete (nominal or ordinal) variables in this dataset. For any pairs of variables listed below, follow these steps:

* First, think about what exactly you expect for the given combination of variables.
* Make a frequency table for the two variables. The (presumably) independent variable comes first.
* Plot a graph visualizing the relationship between the two variables.
- Looking at the chart, do you expect a rather high or rather low value for the $\chi^2$ statistic? Why?
* Run the $\chi^2$ test to determine whether there is a relationship between the two variables. Calculate the $\chi^2$ statistic, the critical limit $g$ and the $p$ value, each for significance level $\alpha = 0.05$.
* Should we accept or reject the null hypothesis? What exactly does that mean for the relationship between the two variables? In other words, formulate an answer to the research question.
* Calculate Cramér's V. Do you come to a similar conclusion as with the $\chi^2$ test?


The variables to be investigated:

| Independent variabele          | Dependent variabele                        |
|:------------------------------ |:-------------------------------------------|
| `Exer` (practicing sports)     | `Smoke`                                    |
| `Sex` (gender)                 | `Smoke`                                    |
| `W.Hnd` (dominant hand)        | `Fold` (top hand when you cross your arms) |
| `Sex`                          | `W.Hnd`                                    |

Results of the main calculations (rounded up to 3 decimal places):

- `Exer/Smoke`: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483
- `W.Hnd/Fold`: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454
- `Sex/Smoke`: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314
- `Sex/W.Hnd`: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

Read the dataset.

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/JelleLeus1996/data_science/main/data/survey.csv', keep_default_na=False)
df.head()

Unnamed: 0.1,Unnamed: 0,Sex,Wr.Hnd,NW.Hnd,W.Hnd,Fold,Pulse,Clap,Exer,Smoke,Height,M.I,Age
0,1,Female,18.5,18.0,Right,R on L,92.0,Left,Some,Never,173.0,Metric,18.25
1,2,Male,19.5,20.5,Left,R on L,104.0,Left,,Regul,177.8,Imperial,17.583
2,3,Male,18.0,13.3,Right,L on R,87.0,Neither,,Occas,,,16.917
3,4,Male,18.8,18.9,Right,R on L,,Neither,,Never,160.0,Metric,20.333
4,5,Male,20.0,20.0,Right,Neither,35.0,Right,Some,Never,165.0,Metric,23.667


What are the different values for Exer and Smoke?  
Change both variables to ordinal variables with a specific order.

In [7]:
# Verband tussen exercising & smoking?
# 2 kwalitatieve variabelen => Chi-kwadraat
# Moeten er categorische variabelen van maken. Wat zijn hun waarden?
df.Exer.value_counts()



Exer
Freq    115
Some     98
None     24
Name: count, dtype: int64

In [23]:
df.Smoke.value_counts()

Smoke
Never    189
Occas     19
Regul     17
Heavy     11
NA         1
Name: count, dtype: int64

In [8]:
exer_type = CategoricalDtype(categories=['None','Some','Freq'], ordered=True)
df.Exer = df.Exer.astype(exer_type)

In [9]:
smoke_type = CategoricalDtype(categories=['Never','Occas','Regul','Heavy'], ordered=True)
df.Smoke = df.Smoke.astype(smoke_type)

In [11]:
# Chi-squared test for independence based on a contingency table
# Dit is de korte Python versie!!!! Vervangt al het bovenstaande op een snelle wijze
observed = pd.crosstab(df.Smoke, df.Exer)
chi2, p, dof, expected = stats.chi2_contingency(observed)

print("Chi-squared       : %.4f" % chi2)
print("Degrees of freedom: %d" % dof)
print("P-value           : %.4f" % p)

Chi-squared       : 5.4885
Degrees of freedom: 6
P-value           : 0.4828


In [None]:
# Conclusie
# p-value: 48% > 5% => Er is niet voldoende reden om H0 te verwerpen
# Er is geen associatie tussen smoke en Exer

* Make a frequency table for the two variables. The (presumably) independent variable comes first.
* Plot a graph visualizing the relationship between the two variables.
* Looking at the chart, do you expect a rather high or rather low value for the  χ2  statistic? Why?
* Run the  χ2  test to determine whether there is a relationship between the two variables. Calculate the  χ2  statistic, the critical limit  g  and the  p  value, each for significance level  α=0.05 .
* Should we accept or reject the null hypothesis? What exactly does that mean for the relationship between the two variables? In other words, formulate an answer to the research question.
* Calculate Cramér's V. Do you come to a similar conclusion as with the  χ2  test?

The variables to be investigated:

| Independent variabele          | Dependent variabele                        |
|:------------------------------ |:-------------------------------------------|
| `Exer` (practicing sports)     | `Smoke`                                    |
| `Sex` (gender)                 | `Smoke`                                    |
| `W.Hnd` (dominant hand)        | `Fold` (top hand when you cross your arms) |
| `Sex`                          | `W.Hnd`                                    |

Results of the main calculations (rounded up to 3 decimal places):

- `Exer/Smoke`: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483
- `W.Hnd/Fold`: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454
- `Sex/Smoke`: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314
- `Sex/W.Hnd`: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

Exer/Smoke: χ² ≈ 5.489, g ≈ 12.592, p ≈ 0.483

In [27]:
w_hand_type = CategoricalDtype(categories=['Right','Left'],ordered=False)
df['W.Hnd'] = df['W.Hnd'].astype(w_hand_type)
observed = pd.crosstab(df['W.Hnd'], df.Fold)
chi2, p, dof, expected = stats.chi2_contingency(observed)
g = stats.chi2.isf(0.05,df=dof)
print(g)
print("Chi squared is : %.4f"% chi2)
print("p value is : %.4f" % p)
print("Degrees of freedom : %d"% g)


5.991464547107983
Chi squared is : 1.5814
p value is : 0.4535
Degrees of freedom : 5


In [29]:
sex_type = CategoricalDtype(categories=['Female','Male'])
df.Sex = df.Sex.astype(sex_type)
observed = pd.crosstab(df.Sex, df.Smoke)
chi2, p, dof, expected = stats.chi2_contingency(observed)
alpha = 0.05
g = stats.chi2.isf(alpha, df = dof)
print(g)
print("Chi squared is : %.4f"% chi2)
print("p value is : %.4f" % p)
print("Degrees of freedom : %d"% g)

7.814727903251178
Chi squared is : 3.5536
p value is : 0.3139
Degrees of freedom : 7


W.Hnd/Fold: χ² ≈ 1.581, g ≈ 5.992, p ≈ 0.454

Sex/Smoke: χ² ≈ 3.554, g ≈ 7.815, p ≈ 0.314

Sex/W.Hnd: χ² ≈ 0.236, g ≈ 3.842, p ≈ 0.627

In [30]:
observed = pd.crosstab(df.Sex, df['W.Hnd'])
chi2, p, dof, expected = stats.chi2_contingency(observed)
alpha = 0.05
g = stats.chi2.isf(alpha, df=dof)
print(g)
print("Chi squared is : %.4f"% chi2)
print("p value is : %.4f" % p)
print("Degrees of freedom : %d"% g)

3.8414588206941285
Chi squared is : 0.2356
p value is : 0.6274
Degrees of freedom : 3
