Download the datafile adult.data from the UCI Machine Learning Repository. This is a selection of the Census data from 1994, and it has 48842 instances over 14 categorial, real and integer attributes.

Compute the contingency matrix for variables education and race, and compute the chi-square statistic using your own function, i.e., write a function that takes as input two categorical column-vectors, and returns the chi-square value and its p-value. At the 99% confidence level, are education and race dependent?

# Dataset Information

In [1]:
!type data\adult.names

| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Donor: Ronny Kohavi and Barry Becker,
|        Data Mining and Visualization
|        Silicon Graphics.
|        e-mail: ronnyk@sgi.com for questions.
| Split into train-test using MLC++ GenCVFiles (2/3, 1/3 random).
| 48842 instances, mix of continuous and discrete    (train=32561, test=16281)
| 45222 if instances with unknown values are removed (train=30162, test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K'  : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database.  A set of
|   reasonably clean records was extracted using the following conditions:
|   ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))
|
| Prediction task is to determine whether a person makes over

# Import Libraries and Dataset

In [2]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import itertools


native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.


In [3]:
df = pd.read_csv('data/adult.data', header=None).drop(columns=14)

In [4]:
df.columns = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education_num',
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital_gain',
    'capital_loss',
    'hours_per_week',
    'native_country'
]

In [5]:
df.nunique()

age                  73
workclass             9
fnlwgt            21648
education            16
education_num        16
marital_status        7
occupation           15
relationship          6
race                  5
sex                   2
capital_gain        119
capital_loss         92
hours_per_week       94
native_country       42
dtype: int64

In [6]:
cols = [
    'workclass', 
    'education', 
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native_country'
]

for col in cols:
    df[col] = df[col].astype('category')

In [7]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


# Contingency Analysis

In [8]:
def contingency_analysis(cat1 = 'education', cat2='race'):
    """
    A function that takes as input two categorical column-vectors, and returns the chi-square value and its p-value.
    
    Parameters:
    ----------
    cat1: str (default='education')
        Name of the first catagorical column in the dataframe
    cat2: str (default='race')
        Name of the second catagorical column in the dataframe
    
    Returns:
    ----------
    Two scalars, the chi-square value and its p-value.
    """
    
    if not isinstance(df[cat1].dtype, pd.CategoricalDtype):
        print('The first column must be categorical')
        return
    if not isinstance(df[cat2].dtype, pd.CategoricalDtype):
        print('The second column must be categorical')
        return
    
    contingency_table = df.pivot_table(
        index=cat1, columns=cat2, aggfunc='count',
        values='age', margins=True, margins_name='Total'
    )
    
    # Calculate number of distinct values
    m1, m2 = contingency_table.shape
    m1 = m1 - 1
    m2 = m2 - 1
    
    q = (m1 - 1) * (m2 - 1)
    
    X = stats.chi2(q)
    
    
    expected_table = contingency_table.apply(lambda x: x[-1] * contingency_table.iloc[:, -1] / contingency_table.iloc[-1, -1])
    chi_square_value = ((contingency_table- expected_table) ** 2 / expected_table).sum().sum()
    
    p_value = 1 - X.cdf(chi_square_value)
    
    return chi_square_value, p_value
    
    

In [9]:
contingency_analysis()

(730.6712962254584, 0.0)

In [10]:
cols = [
    'workclass', 
    'education', 
    'marital_status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'native_country'
]

In [11]:
for col1, col2 in itertools.combinations(cols, 2):
    chi, p = contingency_analysis(col1, col2)
    print(f'{col1} {col2} {chi} {p}')

workclass education 2572.1048319192323 0.0
workclass marital_status 1413.5586972450483 0.0
workclass occupation 41676.64797359808 0.0
workclass relationship 1586.371842937078 0.0
workclass race 412.5365923831655 0.0
workclass sex 768.9139234794312 0.0
workclass native_country 546.641620533581 3.65152352799214e-13
education marital_status 1638.1373573227854 0.0
education occupation 15997.777225542155 0.0
education relationship 2449.2301947700535 0.0
education race 730.6712962254584 0.0
education sex 297.71500372503687 0.0
education native_country 8592.458433297217 0.0
marital_status occupation 3466.8872612903842 0.0
marital_status relationship 38765.198041121424 0.0
marital_status race 923.8086971524608 0.0
marital_status sex 6944.747255715985 0.0
marital_status native_country 1043.1135058773248 0.0
occupation relationship 5194.643196156871 0.0
occupation race 850.8664424960738 0.0
occupation sex 5863.7547861768935 0.0
occupation native_country 2437.2965238719535 0.0
relationship race 1

# Validate

In [12]:
contingency_table = df.pivot_table(
    index='education', columns='race', 
    aggfunc='count', values='age'
)

In [13]:
contingency_table

race,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10th,16,13,133,9,762
11th,14,21,153,10,977
12th,5,9,70,14,335
1st-4th,4,5,16,9,134
5th-6th,2,18,21,13,279
7th-8th,9,11,56,17,553
9th,5,9,89,8,403
Assoc-acdm,8,29,107,8,915
Assoc-voc,19,38,112,6,1207
Bachelors,21,289,330,33,4682


In [14]:
stat, p, dof, expected = stats.chi2_contingency(contingency_table)

In [15]:
p

5.547319569858434e-116

In [16]:
stat

730.6712962254584