# Correspondance Analysis - Hair color and eye color

Reproducing [Izenman's](http://ce.aut.ac.ir/~shiry/lecture/Advanced%20Machine%20Learning/Manifold_Modern_Multivariate%20Statistical%20Techniques%20-%20Regres.pdf) results from page 638 to page 651.

## Dataset

Let's first initialize the contingency table Izenman calls $N$.

In [6]:
import pandas as pd

hair_colors = ['Fair', 'Red', 'Medium', 'Dark', 'Black']
eye_colors = ['Blue', 'Light', 'Medium', 'Dark']
values = [
    [326, 38, 241, 110, 3],
    [688, 116, 584, 188, 4],
    [343, 84, 909, 412, 26],
    [98, 48, 403, 681, 85]
]

df = pd.DataFrame(data=values, index=eye_colors, columns=hair_colors)

df

Unnamed: 0,Fair,Red,Medium,Dark,Black
Blue,326,38,241,110,3
Light,688,116,584,188,4
Medium,343,84,909,412,26
Dark,98,48,403,681,85


## Compute the CA

In [40]:
import prince

ca = prince.CA(df, n_components=4)

### Row and column sums

What Izenman denotes $XX^t$ and $YY^t$ are nothing more than the row and column sums stored in a diagonal matrix.

In [12]:
import numpy as np

np.diag(ca.row_sums * ca.N)

array([[  718.,     0.,     0.,     0.],
       [    0.,  1580.,     0.,     0.],
       [    0.,     0.,  1774.,     0.],
       [    0.,     0.,     0.,  1315.]])

In [11]:
np.diag(ca.column_sums * ca.N)

array([[ 1455.,     0.,     0.,     0.,     0.],
       [    0.,   286.,     0.,     0.,     0.],
       [    0.,     0.,  2137.,     0.,     0.],
       [    0.,     0.,     0.,  1391.,     0.],
       [    0.,     0.,     0.,     0.,   118.]])

### Correspondance matrix

$P$ is simply a matrix containing the row/cell frequency.

In [17]:
ca.P

Unnamed: 0,Fair,Red,Medium,Dark,Black
Blue,0.060516,0.007054,0.044737,0.02042,0.000557
Light,0.127715,0.021533,0.108409,0.034899,0.000743
Medium,0.063672,0.015593,0.16874,0.07648,0.004826
Dark,0.018192,0.00891,0.07481,0.126415,0.015779


By definition the correspondance matrix sums up to 1.

In [19]:
np.sum(ca.P.values)

1.0

### Row and column profiles

The row profiles matrix is obtained by normalizing $P$ by row. 

In [20]:
ca.row_profiles

Unnamed: 0,Fair,Red,Medium,Dark,Black
Blue,0.454039,0.052925,0.335655,0.153203,0.004178
Light,0.435443,0.073418,0.36962,0.118987,0.002532
Medium,0.193348,0.047351,0.512401,0.232244,0.014656
Dark,0.074525,0.036502,0.306464,0.517871,0.064639


In [24]:
ca.row_profiles.sum(axis='columns')

Blue      1.0
Light     1.0
Medium    1.0
Dark      1.0
dtype: float64

The column profiles matrix are obtained in the same fashion, by normalizing $P$ by column. 

In [21]:
ca.column_profiles

Unnamed: 0,Fair,Red,Medium,Dark,Black
Blue,0.224055,0.132867,0.112775,0.07908,0.025424
Light,0.472852,0.405594,0.27328,0.135155,0.033898
Medium,0.235739,0.293706,0.425363,0.29619,0.220339
Dark,0.067354,0.167832,0.188582,0.489576,0.720339


In [25]:
ca.column_profiles.sum(axis='rows')

Fair      1.0
Red       1.0
Medium    1.0
Dark      1.0
Black     1.0
dtype: float64

### Row and column masses

The row and column masses are nothing more than the row and column frequencies, which are obtained by summing $P$ row-wise and column-wise.

In [29]:
ca.row_sums

Blue      0.133284
Light     0.293299
Medium    0.329311
Dark      0.244106
dtype: float64

In [30]:
ca.column_sums

Fair      0.270095
Red       0.053091
Medium    0.396696
Dark      0.258214
Black     0.021905
dtype: float64

### Expected frequencies

In [34]:
ca.expected_frequencies * ca.N

Unnamed: 0,Fair,Red,Medium,Dark,Black
Blue,193.927975,38.119176,284.827548,185.39781,15.727492
Light,426.749582,83.883423,626.779283,407.978467,34.609244
Medium,479.147949,94.183033,703.738259,458.072025,38.858734
Dark,355.174494,69.814368,521.65491,339.551699,28.804529


### Relative frequency matrix

In [44]:
ca.P - ca.expected_frequencies

Unnamed: 0,Fair,Red,Medium,Dark,Black
Blue,0.024517,-2.2e-05,-0.008136,-0.013996,-0.002363
Light,0.048496,0.005962,-0.007941,-0.040835,-0.005682
Medium,-0.025273,-0.00189,0.038103,-0.008552,-0.002387
Dark,-0.04774,-0.004049,-0.022026,0.063384,0.010432


### Residuals matrix

In [47]:
(ca.P - ca.expected_frequencies) * ca.N

Unnamed: 0,Fair,Red,Medium,Dark,Black
Blue,132.072025,-0.119176,-43.827548,-75.39781,-12.727492
Light,261.250418,32.116577,-42.779283,-219.978467,-30.609244
Medium,-136.147949,-10.183033,205.261741,-46.072025,-12.858734
Dark,-257.174494,-21.814368,-118.65491,341.448301,56.195471


### Explained inertia

In [42]:
ca.explained_inertia

[0.86519871487199929,
 0.12945798649306905,
 0.0036760220486457156,
 7.3778217919796355e-36]

In [54]:
np.sqrt(np.diag(ca.row_sums)) @ ca.svd.U

array([[-0.11576377,  0.10998475,  0.27288272,  0.18254029],
       [-0.28404093,  0.12454015, -0.35183023,  0.27078528],
       [ 0.03051955, -0.49560161,  0.02076243,  0.28692826],
       [ 0.38951773,  0.16423449, -0.06620127,  0.24703551]])