# Pearson's Chi-Squared Test of Independence

This test is primarly to compare the categorical values of both the observational classification and the results of the K-Means cluster analysis to determine the validity of the C ratio measure. Following [this guide](https://pareonline.net/getvn.asp?v=20&n=8).

## Assumptions and Requirements

MAJOR REQUIREMENT: No Na or NaN values. This breaks the test.

In [111]:
import pandas as pd
import numpy as np
from scipy import stats

In [112]:
df = pd.read_csv('./DataOutput/ClusterCrosstab.csv', header=0, index_col=0)

In [113]:
df = df.fillna(0)
df

Unnamed: 0_level_0,0.0,1.0,2.0,3.0,4.0,5.0,6.0
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Barrier Estuary,7.0,0.0,5.0,0.0,0.0,0.0,1.0
Coastal Plain,0.0,0.0,0.0,0.0,1.0,0.0,0.0
Fjord,0.0,0.0,1.0,0.0,1.0,0.0,0.0
LSE,19.0,0.0,4.0,1.0,1.0,0.0,4.0
Macrotidal,2.0,2.0,1.0,0.0,7.0,1.0,0.0
Open,6.0,7.0,2.0,0.0,8.0,2.0,1.0
Ria,6.0,0.0,0.0,0.0,0.0,0.0,0.0
Tidal Inlet,8.0,0.0,0.0,0.0,1.0,0.0,0.0


In [119]:
dfc = stats.chi2_contingency(df)
dfc

(76.58416336444049,
 0.0008823881331705495,
 42,
 array([[6.30303030e+00, 1.18181818e+00, 1.70707071e+00, 1.31313131e-01,
         2.49494949e+00, 3.93939394e-01, 7.87878788e-01],
        [4.84848485e-01, 9.09090909e-02, 1.31313131e-01, 1.01010101e-02,
         1.91919192e-01, 3.03030303e-02, 6.06060606e-02],
        [9.69696970e-01, 1.81818182e-01, 2.62626263e-01, 2.02020202e-02,
         3.83838384e-01, 6.06060606e-02, 1.21212121e-01],
        [1.40606061e+01, 2.63636364e+00, 3.80808081e+00, 2.92929293e-01,
         5.56565657e+00, 8.78787879e-01, 1.75757576e+00],
        [6.30303030e+00, 1.18181818e+00, 1.70707071e+00, 1.31313131e-01,
         2.49494949e+00, 3.93939394e-01, 7.87878788e-01],
        [1.26060606e+01, 2.36363636e+00, 3.41414141e+00, 2.62626263e-01,
         4.98989899e+00, 7.87878788e-01, 1.57575758e+00],
        [2.90909091e+00, 5.45454545e-01, 7.87878788e-01, 6.06060606e-02,
         1.15151515e+00, 1.81818182e-01, 3.63636364e-01],
        [4.36363636e+00, 8.1818181

# Post Hoc

The next step is to calculate the _*'Standardized Residuals'*_ (different from the _'raw residuals'_) for each cell of the contingency table. This is calculated by:

\begin{align}
\frac{Std Residual = (O -E)}{\sqrt(E)}
\end{align}

This shows us difference between the Expected value $E$ and the Obeserved value $O$. The greater the difference between $E$ and $O$ is the greater the contribution to the overall $\chi^2$ value.

## Extracting the Expected Values

The expected values are returned in an array from stat.chi2_contingency(), we will extract those into a dataframe.

In [120]:
exfreq = pd.DataFrame(dfc[3])
exfreq

Unnamed: 0,0,1,2,3,4,5,6
0,6.30303,1.181818,1.707071,0.131313,2.494949,0.393939,0.787879
1,0.484848,0.090909,0.131313,0.010101,0.191919,0.030303,0.060606
2,0.969697,0.181818,0.262626,0.020202,0.383838,0.060606,0.121212
3,14.060606,2.636364,3.808081,0.292929,5.565657,0.878788,1.757576
4,6.30303,1.181818,1.707071,0.131313,2.494949,0.393939,0.787879
5,12.606061,2.363636,3.414141,0.262626,4.989899,0.787879,1.575758
6,2.909091,0.545455,0.787879,0.060606,1.151515,0.181818,0.363636
7,4.363636,0.818182,1.181818,0.090909,1.727273,0.272727,0.545455


## Calculating Standardized residuals

Now we will calculate the standardized residuals using the formula above. To do this we will create a function ***stdcalc*** which converts the expected and observed DataFrames into numpy arrays so that we can perform arithmetic operations on them. 

In [121]:
def stdcalc(ex, ob):

    def stdres (E, O):
        z = (O - E)/E**(1/2)
        return z
    npframe = stdres(np.array(ex), np.array(ob))
    frame = pd.DataFrame(npframe)
    frame.rename({0:'Barrier Estuary',
                  1: 'Coastal Plain',
                  2: 'Fjord',
                  3: 'LSE',
                  4: 'Macrotidal',
                  5: 'Open',
                  6: 'Ria',
                  7: 'Tidal Inlet'}, inplace=True)
    return frame

In [122]:
out = stdcalc(exfreq, df)
out

Unnamed: 0,0,1,2,3,4,5,6
Barrier Estuary,0.277613,-1.087115,2.520326,-0.362372,-1.579541,-0.627646,0.238976
Coastal Plain,-0.696311,-0.301511,-0.362372,-0.100504,1.844572,-0.174078,-0.246183
Fjord,-0.984732,-0.426401,1.43886,-0.142134,0.994536,-0.246183,-0.348155
LSE,1.31726,-1.623688,0.098348,1.306416,-1.935285,-0.937437,1.691456
Macrotidal,-1.713956,0.752618,-0.541174,-0.362372,2.852127,0.965609,-0.887625
Open,-1.860599,3.015693,-0.765336,-0.512471,1.34752,1.365577,-0.458664
Ria,1.812206,-0.738549,-0.887625,-0.246183,-1.073087,-0.426401,-0.603023
Tidal Inlet,1.740777,-0.904534,-1.087115,-0.301511,-0.553372,-0.522233,-0.738549


In [131]:
np.savetxt('./DataOutput/Chi_2_numpyarr.csv', [dfc], delimiter=',', fmt='%s')

# Results

As you can see here there are a lot of insignificant ($Std Residual < 2$) values. Notable significant values are ***Ria*** and ***Tidal Inlet*** types in cluster **4**. This can be a little confusing so I will show the important values here:

In [135]:
def notimportant(x):
    if abs(x) < 2:
        return 0
    else:
        return x
df2 = out.applymap(notimportant)
df2

Unnamed: 0,0,1,2,3,4,5,6
Barrier Estuary,0,0.0,2.520326,0,0.0,0,0
Coastal Plain,0,0.0,0.0,0,0.0,0,0
Fjord,0,0.0,0.0,0,0.0,0,0
LSE,0,0.0,0.0,0,0.0,0,0
Macrotidal,0,0.0,0.0,0,2.852127,0,0
Open,0,3.015693,0.0,0,0.0,0,0
Ria,0,0.0,0.0,0,0.0,0,0
Tidal Inlet,0,0.0,0.0,0,0.0,0,0


In [136]:
df2.to_csv('./DataOutput/Chi_2_STDResiduals.csv')