# CHI-SQUARE TEST OF INDEPENDENCE

The table below is an exit poll which displays the joint responses to 2 categorical variables: people in categories from consider from 18–29, 30–44, 45–64 and >65 years, and their political affiliation, which is “Conservative”, “Socialist” and “Other”. Is there any evidence of a relationship between the age group and their political affiliation, at 5% significant level?

In [12]:
import pandas as pd
import scipy.stats as stats

In [3]:
# Create sample data
data = [['18-29', 'Conservative'] for i in range(141)] + \
        [['18-29', 'Socialist'] for i in range(68)] + \
        [['18-29', 'Other'] for i in range(4)] + \
        [['30-44', 'Conservative'] for i in range(179)] + \
        [['30-44', 'Socialist'] for i in range(159)] + \
        [['30-44', 'Other'] for i in range(7)] + \
        [['45-65', 'Conservative'] for i in range(220)] + \
        [['45-65', 'Socialist'] for i in range(216)] + \
        [['45-65', 'Other'] for i in range(4)] + \
        [['65 & older', 'Conservative'] for i in range(86)] + \
        [['65 & older', 'Socialist'] for i in range(101)] + \
        [['65 & older', 'Other'] for i in range(4)]
df = pd.DataFrame(data, columns = ['Age Group', 'Political Affiliation']) 

In [5]:
# Create contingency table
data_crosstab = pd.crosstab(df['Age Group'],
                            df['Political Affiliation'],
                           margins=True, margins_name="Total")
data_crosstab

Political Affiliation,Conservative,Other,Socialist,Total
Age Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18-29,141,4,68,213
30-44,179,7,159,345
45-65,220,4,216,440
65 & older,86,4,101,191
Total,626,19,544,1189


In [7]:
# significance level
alpha = 0.05

In [9]:
# Calcualtion of Chisquare
chi_square = 0
rows = df['Age Group'].unique()
columns = df['Political Affiliation'].unique()
for i in columns:
    for j in rows:
        O = data_crosstab[i][j]
        E = data_crosstab[i]['Total'] * data_crosstab['Total'][j] / data_crosstab['Total']['Total']
        chi_square += (O-E)**2/E

In [14]:
# The p-value approach
print("Approach 1: The p-value approach to hypothesis testing in the decision rule")
print("\n")
p_value = 1 - stats.chi2.cdf(chi_square, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if p_value <= alpha:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and p value is:", p_value)
print("\n")
print(conclusion)

Approach 1: The p-value approach to hypothesis testing in the decision rule


chisquare-score is: 24.367421717305202  and p value is: 0.0004469083391495099


Null Hypothesis is rejected.


In [16]:
# The critical value approach
print("\n")
print("Approach 2: The critical value approach to hypothesis testing in the decision rule")
critical_value = stats.chi2.ppf(1-alpha, (len(rows)-1)*(len(columns)-1))
conclusion = "Failed to reject the null hypothesis."
if chi_square > critical_value:
    conclusion = "Null Hypothesis is rejected."



Approach 2: The critical value approach to hypothesis testing in the decision rule


# CHI-SQUARE GODNESS OF FIT-TEST

The table below displays the more than 44 million people voting result for 2013 German Federal Election. 41.5% of German vote for the Christian Democratic Union (CDU), 25.7% for the Social Democratic Party (SPD) and the remaining 32.8% as Others.

Assume the researcher take a random sample and pick 123 students of FU Berlin about their party affiliation. Out of them 57 vote for CDU, 26 vote for SPD and 40 for Others. These number corresponds to the observed frequencies.

In [19]:
# Creation of data
data = [['CDU', 0.415, 57], ['SPD', 0.257, 26], ['Others', 0.328, 40]] 
df = pd.DataFrame(data, columns = ['Varname', 'prob_dist', 'observed_freq']) 
df['expected_freq'] = df['observed_freq'].sum() * df['prob_dist']
df

Unnamed: 0,Varname,prob_dist,observed_freq,expected_freq
0,CDU,0.415,57,51.045
1,SPD,0.257,26,31.611
2,Others,0.328,40,40.344


In [20]:
# significance level
alpha = 0.05


In [21]:
# Calcualtion of Chisquare
chi_square = 0
for i in range(len(df)):
    O = df.loc[i, 'observed_freq']
    E = df.loc[i, 'expected_freq']
    chi_square += (O-E)**2/E

In [22]:
# The p-value approach
print("Approach 1: The p-value approach to hypothesis testing in the decision rule")
p_value = 1 - stats.chi2.cdf(chi_square, df['Varname'].nunique() - 1)
conclusion = "Failed to reject the null hypothesis."
if p_value <= alpha:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and p value is:", p_value)
print(conclusion)
    

Approach 1: The p-value approach to hypothesis testing in the decision rule
chisquare-score is: 1.693614940576721  and p value is: 0.42878164729702506
Failed to reject the null hypothesis.


In [23]:
# The critical value approach
print("\n--------------------------------------------------------------------------------------")
print("Approach 2: The critical value approach to hypothesis testing in the decision rule")
critical_value = stats.chi2.ppf(1-alpha, df['Varname'].nunique() - 1)
conclusion = "Failed to reject the null hypothesis."
if chi_square > critical_value:
    conclusion = "Null Hypothesis is rejected."
        
print("chisquare-score is:", chi_square, " and critical value is:", critical_value)
print(conclusion)


--------------------------------------------------------------------------------------
Approach 2: The critical value approach to hypothesis testing in the decision rule
chisquare-score is: 1.693614940576721  and critical value is: 5.991464547107979
Failed to reject the null hypothesis.
