# Pearson's Chi-Squared Test of Independence

This test is primarly to compare the categorical values of both the observational classification and the results of the K-Means cluster analysis to determine the validity of the C ratio measure. Following [this guide](https://pareonline.net/getvn.asp?v=20&n=8).

## Assumptions and Requirements

MAJOR REQUIREMENT: No Na or NaN values. This breaks the test.

In [21]:
import pandas as pd
import numpy as np
from scipy import stats

In [22]:
df = pd.read_csv('./DataOutput/ClusterCrosstab.csv', header=0, index_col=0)

In [23]:
df = df.fillna(0)
df

Unnamed: 0_level_0,0.0,1.0,2.0,3.0,4.0,5.0,6.0
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Barrier Estuary,7.0,0.0,0.0,0.0,4.0,2.0,0.0
Coastal Plain,1.0,0.0,0.0,0.0,0.0,0.0,0.0
Fjord,1.0,0.0,1.0,0.0,0.0,0.0,0.0
LSE,7.0,0.0,0.0,0.0,18.0,5.0,2.0
Macrotidal,3.0,1.0,5.0,2.0,1.0,1.0,0.0
Open,1.0,2.0,10.0,7.0,4.0,3.0,0.0
Ria,0.0,0.0,0.0,0.0,6.0,0.0,0.0
Tidal Inlet,1.0,0.0,1.0,0.0,3.0,0.0,0.0


In [24]:
dfc = stats.chi2_contingency(df)
dfc

(79.25954969594676,
 0.0004463640635045948,
 42,
 array([[ 2.75757576,  0.39393939,  2.23232323,  1.18181818,  4.72727273,
          1.44444444,  0.26262626],
        [ 0.21212121,  0.03030303,  0.17171717,  0.09090909,  0.36363636,
          0.11111111,  0.02020202],
        [ 0.42424242,  0.06060606,  0.34343434,  0.18181818,  0.72727273,
          0.22222222,  0.04040404],
        [ 6.78787879,  0.96969697,  5.49494949,  2.90909091, 11.63636364,
          3.55555556,  0.64646465],
        [ 2.75757576,  0.39393939,  2.23232323,  1.18181818,  4.72727273,
          1.44444444,  0.26262626],
        [ 5.72727273,  0.81818182,  4.63636364,  2.45454545,  9.81818182,
          3.        ,  0.54545455],
        [ 1.27272727,  0.18181818,  1.03030303,  0.54545455,  2.18181818,
          0.66666667,  0.12121212],
        [ 1.06060606,  0.15151515,  0.85858586,  0.45454545,  1.81818182,
          0.55555556,  0.1010101 ]]))

# Post Hoc

The next step is to calculate the _*'Standardized Residuals'*_ (different from the _'raw residuals'_) for each cell of the contingency table. This is calculated by:

\begin{align}
\frac{Std Residual = (O -E)}{\sqrt(E)}
\end{align}

This shows us difference between the Expected value $E$ and the Obeserved value $O$. The greater the difference between $E$ and $O$ is the greater the contribution to the overall $\chi^2$ value.

## Extracting the Expected Values

The expected values are returned in an array from stat.chi2_contingency(), we will extract those into a dataframe.

In [25]:
exfreq = pd.DataFrame(dfc[3])
exfreq

Unnamed: 0,0,1,2,3,4,5,6
0,2.757576,0.393939,2.232323,1.181818,4.727273,1.444444,0.262626
1,0.212121,0.030303,0.171717,0.090909,0.363636,0.111111,0.020202
2,0.424242,0.060606,0.343434,0.181818,0.727273,0.222222,0.040404
3,6.787879,0.969697,5.494949,2.909091,11.636364,3.555556,0.646465
4,2.757576,0.393939,2.232323,1.181818,4.727273,1.444444,0.262626
5,5.727273,0.818182,4.636364,2.454545,9.818182,3.0,0.545455
6,1.272727,0.181818,1.030303,0.545455,2.181818,0.666667,0.121212
7,1.060606,0.151515,0.858586,0.454545,1.818182,0.555556,0.10101


## Calculating Standardized residuals

Now we will calculate the standardized residuals using the formula above. To do this we will create a function ***stdcalc*** which converts the expected and observed DataFrames into numpy arrays so that we can perform arithmetic operations on them. 

In [26]:
def stdcalc(ex, ob):

    def stdres (E, O):
        z = (O - E)/E**(1/2)
        return z
    npframe = stdres(np.array(ex), np.array(ob))
    frame = pd.DataFrame(npframe)
    frame.rename({0:'Barrier Estuary',
                  1: 'Coastal Plain',
                  2: 'Fjord',
                  3: 'LSE',
                  4: 'Macrotidal',
                  5: 'Open',
                  6: 'Ria',
                  7: 'Tidal Inlet'}, inplace=True)
    return frame

In [27]:
out = stdcalc(exfreq, df)
out

Unnamed: 0,0,1,2,3,4,5,6
Barrier Estuary,2.554762,-0.627646,-1.494096,-1.087115,-0.334497,0.46225,-0.512471
Coastal Plain,1.710674,-0.174078,-0.414388,-0.301511,-0.603023,-0.333333,-0.142134
Fjord,0.88396,-0.246183,1.120357,-0.426401,-0.852803,-0.471405,-0.201008
LSE,0.081417,-0.984732,-2.344131,-1.705606,1.865506,0.766032,1.683438
Macrotidal,0.145986,0.965609,1.852409,0.752618,-1.714296,-0.3698,-0.512471
Open,-1.975317,1.306549,2.490982,2.901294,-1.856828,0.0,-0.738549
Ria,-1.128152,-0.426401,-1.015038,-0.738549,2.584921,-0.816497,-0.348155
Tidal Inlet,-0.058849,-0.389249,0.152616,-0.6742,0.87646,-0.745356,-0.317821


In [28]:
np.savetxt('./DataOutput/Chi_2_numpyarr.csv', [dfc], delimiter=',', fmt='%s')

# Results

As you can see here there are a lot of insignificant ($Std Residual < 2$) values. Notable significant values are ***Ria*** and ***Tidal Inlet*** types in cluster **4**. This can be a little confusing so I will show the important values here:

In [29]:
def notimportant(x):
    if abs(x) < 2:
        return 0
    else:
        return x
df2 = out.applymap(notimportant)
df2

Unnamed: 0,0,1,2,3,4,5,6
Barrier Estuary,2.554762,0,0.0,0.0,0.0,0,0
Coastal Plain,0.0,0,0.0,0.0,0.0,0,0
Fjord,0.0,0,0.0,0.0,0.0,0,0
LSE,0.0,0,-2.344131,0.0,0.0,0,0
Macrotidal,0.0,0,0.0,0.0,0.0,0,0
Open,0.0,0,2.490982,2.901294,0.0,0,0
Ria,0.0,0,0.0,0.0,2.584921,0,0
Tidal Inlet,0.0,0,0.0,0.0,0.0,0,0


In [20]:
df2.to_csv('./DataOutput/Chi_2_STDResiduals.csv')