## Task 2 - Chi Squared

The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table below as an example,stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell.


### Introduction

"The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite." [1] In the exampl used from Wikipedia, a random sample of 650 residents of a city was taken and their occupation was recorded as "white collar" "blue collar" or "no collar". The null hypothesis is that each person's neighbourhood of residence is independent of the person's occupation classification.


![Image](https://github.com/NiamhOL/Machine-Learning-and-Statistics-2020-Assignments/blob/main/ChiSq.JPG)


Under the hypotjesis we should "expect" the number of "white collar" workds in neighbourhood A to be;
![Image](https://github.com/NiamhOL/Machine-Learning-and-Statistics-2020-Assignments/commit/07b0f7268eda4bcd09c95f93c1d64565338c975e)

In that "cell" of the table we have

![Image](https://github.com/NiamhOL/Machine-Learning-and-Statistics-2020-Assignments/blob/main/chistep2.JPG)

The sum of these quantities over all of the cells is the test statistic; in this case, {\displaystyle \approx 24.6}{\displaystyle \approx 24.6}. Under the null hypothesis, this sum has approximately a chi-squared distribution whose number of degrees of freedom are

![Image](https://github.com/NiamhOL/Machine-Learning-and-Statistics-2020-Assignments/blob/main/chistep3.JPG) 

If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.[2]

### 

In [1]:
# Import module for storing datafram
import pandas as pd # for dataframe work
import scipy.stats as stats # for calculate chi-square value

# Create df with table data. Code adapted from
# https://www.geeksforgeeks.org/different-ways-to-create-pandas-dataframe/ and
# https://stackoverflow.com/a/60909202
data = {'A':[90, 30, 30, 150], 'B':[60, 50, 40, 150], 'C':[104, 51, 45, 200],
        'D':[95, 20, 35, 150], 'Total':[349, 151, 150, 650]}
df = pd.DataFrame(data,index=['White collar', 'Blue collar', 'No Collar', 'Total'])

# Display the Observed results dataframe table
print("Data Table")
df

Data Table


Unnamed: 0,A,B,C,D,Total
White collar,90,60,104,95,349
Blue collar,30,50,51,20,151
No Collar,30,40,45,35,150
Total,150,150,200,150,650


In [2]:
# Create observed results df. Code adapted from [2.3]
df_obs = df.iloc[0:3, 0:4].copy()

# Display the Observed results dataframe table
print("Observed Results Table")
df_obs

Observed Results Table


Unnamed: 0,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No Collar,30,40,45,35


In [3]:
# Copy the data table to form the expected results table dataframe. Code adapted from
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html
df_exp = df.copy()

i, j = 0, 0
# Loop for 3 rows
while i < 3:
    # Loop for 4 columns
    while j < 4:
        df_exp.iloc[i,j] = df_exp.iloc[-1,j]*df_exp.iloc[i,-1]/df_exp.iloc[-1,-1]
        j += 1
    j = 0
    i += 1
    
# Drop total column and total row. Code adapted from
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
df_exp = df_exp.drop(['Total'], axis=1).drop(['Total'], axis=0)

# Display Expected results dataframe Table
print("Expected Results Table")
df_exp.round(2)

Expected Results Table


Unnamed: 0,A,B,C,D
White collar,80.54,80.54,107.38,80.54
Blue collar,34.85,34.85,46.46,34.85
No Collar,34.62,34.62,46.15,34.62


In [4]:
# Calculate Partial Chi-squared values
df_chi = ((df_obs - df_exp)**2)/df_exp

# Display partial Chi-squared value dataframe Table
print("Partial Chi-squared value Results Table")  
df_chi.round(2)

Partial Chi-squared value Results Table


Unnamed: 0,A,B,C,D
White collar,1.11,5.24,0.11,2.6
Blue collar,0.67,6.59,0.44,6.33
No Collar,0.62,0.84,0.03,0.0


In [5]:
# Calculate full Chi-squared value. Code adapted from
# https://www.geeksforgeeks.org/python-pandas-dataframe-sum
#  https://medium.com/@nhan.tran/the-chi-square-statistic-p3-programming-with-python-87eb079f36af
chi2_man = df_chi.sum().sum()

print(f"The manually calculated chi-square value is ~{chi2_man:.1f}")

The manually calculated chi-square value is ~24.6


In [6]:
# Calculate Chi-square value using scipy.stats library. Code adapted from
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
chi2_ss, p_ss, dof_ss, exp_ss = stats.chi2_contingency(df_obs)

# Disply the results of the Chi-square test
print(f"""The scipy.stats calculated chi-squared test value is ~{chi2_ss:.1f}
with a p-value of {p_ss:.6f} and degrees of freedom of {dof_ss}.""")

The scipy.stats calculated chi-squared test value is ~24.6
with a p-value of 0.000410 and degrees of freedom of 6.


In [7]:
# Caluclate critical value for a probability level of 5% (0.95). Code adapted from 
#  https://medium.com/@nhan.tran/the-chi-square-statistic-p3-programming-with-python-87eb079f36af
crit_ss = stats.chi2.ppf(q=0.95, df=dof_ss)
print(f"The scipy.stats calculated critical value is {crit_ss:.1f}")

The scipy.stats calculated critical value is 12.6


The Chi-squared value and p-value can be calculated using the scipy.stats.chi2_contingency method and the critical Chi-squared value can be calculated using the stats.chi2.ppf method. The calculate Chi-squared value (24.6) is higher than the critical value (12.6) for a 5% significance level and degrees of freedom in the sampled data. As such, we can reject the null hypotheses that the categories are independent of each other. 

### References

1. https://machinelearningmastery.com/chi-squared-test-for-machine-learning/
2. https://en.wikipedia.org/wiki/Chi-squared_test