# Pearson's Chi-Squared Test of Independence

This test is primarly to compare the categorical values of both the observational classification and the results of the K-Means cluster analysis to determine the validity of the C ratio measure. |

## Assumptions and Requirements

MAJOR REQUIREMENT: No Na or NaN values. This breaks the test.

In [33]:
import pandas as pd
import numpy as np
from scipy import stats

In [34]:
df = pd.read_csv('./DataOutput/ClusterCrosstab.csv', header=0, index_col=0)

In [35]:
df = df.fillna(0)
df

Unnamed: 0_level_0,0,1,2,3,4,5,6
Type 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Barrier Estuary,0.0,7.0,2.0,0.0,4.0,0.0,0.0
Coastal Plain,0.0,1.0,0.0,0.0,0.0,0.0,0.0
Fjord,0.0,1.0,0.0,0.0,0.0,1.0,0.0
LSE,0.0,7.0,6.0,0.0,14.0,1.0,1.0
Macrotidal,2.0,3.0,1.0,1.0,1.0,5.0,0.0
Open,7.0,1.0,3.0,2.0,4.0,9.0,0.0
Ria,0.0,0.0,0.0,0.0,6.0,0.0,0.0
Tidal Inlet,0.0,1.0,0.0,0.0,7.0,1.0,0.0


In [36]:
dfc = stats.chi2_contingency(df)
dfc

(77.63798335426124,
 0.0006763321122357378,
 42,
 array([[1.18181818e+00, 2.75757576e+00, 1.57575758e+00, 3.93939394e-01,
         4.72727273e+00, 2.23232323e+00, 1.31313131e-01],
        [9.09090909e-02, 2.12121212e-01, 1.21212121e-01, 3.03030303e-02,
         3.63636364e-01, 1.71717172e-01, 1.01010101e-02],
        [1.81818182e-01, 4.24242424e-01, 2.42424242e-01, 6.06060606e-02,
         7.27272727e-01, 3.43434343e-01, 2.02020202e-02],
        [2.63636364e+00, 6.15151515e+00, 3.51515152e+00, 8.78787879e-01,
         1.05454545e+01, 4.97979798e+00, 2.92929293e-01],
        [1.18181818e+00, 2.75757576e+00, 1.57575758e+00, 3.93939394e-01,
         4.72727273e+00, 2.23232323e+00, 1.31313131e-01],
        [2.36363636e+00, 5.51515152e+00, 3.15151515e+00, 7.87878788e-01,
         9.45454545e+00, 4.46464646e+00, 2.62626263e-01],
        [5.45454545e-01, 1.27272727e+00, 7.27272727e-01, 1.81818182e-01,
         2.18181818e+00, 1.03030303e+00, 6.06060606e-02],
        [8.18181818e-01, 1.9090909

# Post Hoc

The next step is to calculate the _*'Standardized Residuals'*_ (different from the _'raw residuals'_) for each cell of the contingency table. This is calculated by:

\begin{align}
\frac{Std Residual = (O -E)}{\sqrt(E)}
\end{align}

This shows us difference between the Expected value $E$ and the Obeserved value $O$. The greater the difference between $E$ and $O$ is the greater the contribution to the overall $\chi^2$ value.

## Extracting the Expected Values

The expected values are returned in an array from stat.chi2_contingency(), we will extract those into a dataframe.

In [37]:
exfreq = pd.DataFrame(dfc[3])
exfreq

Unnamed: 0,0,1,2,3,4,5,6
0,1.181818,2.757576,1.575758,0.393939,4.727273,2.232323,0.131313
1,0.090909,0.212121,0.121212,0.030303,0.363636,0.171717,0.010101
2,0.181818,0.424242,0.242424,0.060606,0.727273,0.343434,0.020202
3,2.636364,6.151515,3.515152,0.878788,10.545455,4.979798,0.292929
4,1.181818,2.757576,1.575758,0.393939,4.727273,2.232323,0.131313
5,2.363636,5.515152,3.151515,0.787879,9.454545,4.464646,0.262626
6,0.545455,1.272727,0.727273,0.181818,2.181818,1.030303,0.060606
7,0.818182,1.909091,1.090909,0.272727,3.272727,1.545455,0.090909


## Calculating Standardized residuals

Now we will calculate the standardized residuals.

In [59]:
def stdcalc(ex, ob):

    def stdres (E, O):
        z = (E - O)/E**(1/2)
        return z
    def indnames(df): # creates a dictionary with key(index number) value (index value/name)
        dict ={}
        for i in range(df.index.values):
            dict.update({i: df.index.values[i]})
        return dict
    npframe = stdres(np.array(ex), np.array(ob))
    frame = pd.DataFrame(npframe)
    indname = indnames(ob)
    frame.rename(index=indname)
    return frame

In [60]:
out = stdcalc(exfreq, df)
out

TypeError: only integer scalar arrays can be converted to a scalar index