# Correspondence Analysis (CA)

PCA is usually applied to a dataset of numeric columns, where each row is an observation. But not all datasets have that shape. In particular, a typical kind of dataset are contingency tables: two-way tables that tally the occurrences between two variables.

CA is PCA specialized for contingency tables. The idea is to look at residuals: the difference between observed and expected counts. Indeed, if the two variables are independent, the residuals should be small.

Let's apply CA to elections data. For instance, here is a contingency table that counts the number of votes for each candidate by region in France:

In [6]:
import prince

tally = prince.datasets.load_french_elections()
tally


candidate,Arthaud,Dupont-Aignan,Hidalgo,Jadot,Lassalle,Le Pen,Macron,Mélenchon,Poutou,Pécresse,Roussel,Zemmour,Abstention,Blank
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Auvergne-Rhône-Alpes,23137,98465,77570,224735,136436,943294,1175085,897434,30596,217906,96409,312916,1228490,70084
Bourgogne-Franche-Comté,10643,38691,26543,60235,49557,409639,394117,277899,12737,76654,33932,107057,456682,26381
Bretagne,12965,35116,43596,122198,58653,385393,647172,407527,19913,92808,51193,96984,543425,31867
Centre-Val de Loire,9256,31759,23162,54401,38659,347845,383851,251259,11226,71690,33590,88575,459528,23216
Corse,455,2600,1589,4801,15408,42283,26795,19779,1374,9363,4553,18936,90636,2521
Grand Est,18658,74918,40031,111960,77442,825219,762282,492439,22243,120931,47425,200265,1008344,42256
Guadeloupe,1084,2114,2266,1927,1033,24204,18137,75862,713,3979,668,3098,174592,2719
Guyane,297,717,535,940,516,6334,5101,18143,462,997,246,1573,65754,825
Hauts-de-France,20977,55439,40856,95234,62548,1015361,773221,577878,21150,107631,94831,179606,1146209,42994
La Réunion,3538,8338,5549,7994,4844,85770,62542,139604,2705,9738,3074,13070,313159,7361


In [7]:
ca = prince.CA(n_components=4)
ca = ca.fit(tally)


The quality of the CA is very good, as >75% of the variance is capture by just two components. Clearly there is a lot of structure in this dataset.

In [23]:
ca.eigenvalues_summary


Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.021,40.82%,40.82%
1,0.018,36.15%,76.97%
2,0.005,10.08%,87.04%
3,0.004,8.07%,95.11%


Here is make senses to do a first plot using the first two components.

In [25]:
ca.plot(
    tally,
    x_component=0,
    y_component=1,
    show_row_markers=True,
    show_row_labels=False,
    show_column_markers=False,
    show_column_labels=True
)


There are several insights. For instance, abstention votes seem to come mainly from ex-colonies. This explains the x-axis in the above chart.

In [35]:
(tally['Abstention'] / tally.sum(axis='columns')).sort_values(ascending=False).map(lambda x: f'{x:.2%}')


region
Guyane                        64.19%
Mayotte                       60.47%
Martinique                    57.90%
Guadeloupe                    55.89%
La Réunion                    46.93%
Corse                         37.59%
Hauts-de-France               27.07%
Provence-Alpes-Côte d'Azur    26.29%
Grand Est                     26.23%
Centre-Val de Loire           25.14%
Normandie                     24.58%
Île-de-France                 24.04%
Pays de la Loire              23.13%
Bourgogne-Franche-Comté       23.06%
Auvergne-Rhône-Alpes          22.20%
Occitanie                     21.87%
Nouvelle-Aquitaine            21.69%
Bretagne                      21.32%
dtype: object

Interestingly, the  y-axis divide is not only due to left-wing vs. right-wing preferences. In fact, it could be due to people living in Paris vs. the rest of France.

Correspondence analysis is not frequently seen. Indeed, it originated in France, and didn't really become a big thing in the USA. It was popularized by Pierre Bourdieu who used for studying social-economical data. For more information, start from page 25 in [this](https://www.math.univ-toulouse.fr/~xgendre/ens/m2se/DataMining.pdf) PDF.