+++
title = "Multiple correspondence analysis"
menu = "main"
weight = 3
toc = true
aliases = ["mca"]
+++

## Resources

- [*Computation of Multiple Correspondence Analysis* by Oleg Nenadić and Michael Greenacre](https://core.ac.uk/download/pdf/6591520.pdf)
- [*Multiple Correspondence Analysis* by Hervé Abdi](https://maths.cnam.fr/IMG/pdf/ClassMCA_cle825cfc.pdf)
- [*Multiple Correspondance Analysis - Introduction* by Vivek Yadav](https://vxy10.github.io/2016/06/10/intro-MCA/)
- [*Multiple Correspondence Analysis* by Julien Duval](https://www.politika.io/en/notice/multiple-correspondence-analysis)

## Data

Multiple correspondence analysis is an extension of correspondence analysis. It should be used when you have more than two categorical variables. The idea is to one-hot encode a dataset, before applying correspondence analysis to it.

As an example, we're going to use the [balloons dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/) taken from the [UCI datasets website](https://archive.ics.uci.edu/ml/datasets.html).

In [1]:
import pandas as pd

dataset = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
dataset.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
dataset.head()

Unnamed: 0,Color,Size,Action,Age,Inflated
0,YELLOW,SMALL,STRETCH,ADULT,T
1,YELLOW,SMALL,STRETCH,CHILD,F
2,YELLOW,SMALL,DIP,ADULT,F
3,YELLOW,SMALL,DIP,CHILD,F
4,YELLOW,LARGE,STRETCH,ADULT,T


## Fitting

In [2]:
import prince

mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(dataset)

The way MCA works is that it one-hot encodes the dataset, and then fits a correspondence analysis. In case your dataset is already one-hot encoded, you can specify `one_hot=False` to skip this step.

In [3]:
one_hot = pd.get_dummies(dataset)

mca_no_one_hot = prince.MCA(one_hot=False)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)

Both Benzécri and Greenacre corrections are available. No correction is applied by default.

In [4]:
mca_without_correction = prince.MCA(correction=None)
mca_with_benzecri_correction = prince.MCA(correction='benzecri')
mca_with_greenacre_correction = prince.MCA(correction='greenacre')

## Eigenvalues

In [5]:
mca.eigenvalues_summary

Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.402,40.17%,40.17%
1,0.211,21.11%,61.28%
2,0.186,18.56%,79.84%


## Coordinates

In [6]:
mca.row_coordinates(dataset).head()

Unnamed: 0,0,1,2
0,0.705387,9.121515e-15,0.758639
1,-0.386586,8.486117e-15,0.626063
2,-0.386586,7.146878e-15,0.626063
3,-0.852014,7.239295e-15,0.562447
4,0.783539,-0.6333333,0.130201


In [7]:
mca.column_coordinates(dataset).head()

Unnamed: 0,0,1,2
Color__PURPLE,0.117308,0.6892024,-0.64127
Color__YELLOW,-0.130342,-0.7657805,0.712523
Size__LARGE,0.117308,-0.6892024,-0.64127
Size__SMALL,-0.130342,0.7657805,0.712523
Action__DIP,-0.853864,-1.320666e-15,-0.07934


## Visualization

In [8]:
mca.plot(
    dataset,
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False
)

## Contributions

In [9]:
mca.row_contributions_.head().style.format('{:.0%}')

Unnamed: 0,0,1,2
0,7%,0%,16%
1,2%,0%,11%
2,2%,0%,11%
3,10%,0%,9%
4,8%,10%,0%


In [10]:
mca.column_contributions_.head().style.format('{:.0%}')

Unnamed: 0,0,1,2
Color__PURPLE,0%,24%,23%
Color__YELLOW,0%,26%,26%
Size__LARGE,0%,24%,23%
Size__SMALL,0%,26%,26%
Action__DIP,15%,0%,0%


## Cosine similarities

In [11]:
mca.row_cosine_similarities(dataset).head()

Unnamed: 0,0,1,2
0,0.461478,7.716676e-29,0.533786
1,0.152256,7.336665e-29,0.399316
2,0.152256,5.203713000000001e-29,0.399316
3,0.653335,4.7166650000000004e-29,0.284712
4,0.592606,0.3871772,0.016363


In [12]:
mca.column_cosine_similarities(dataset).head()

Unnamed: 0,0,1,2
Color__PURPLE,0.01529,0.5277778,0.45692
Color__YELLOW,0.01529,0.5277778,0.45692
Size__LARGE,0.01529,0.5277778,0.45692
Size__SMALL,0.01529,0.5277778,0.45692
Action__DIP,0.530243,1.268479e-30,0.004578


## Controlling the one-hot encoding

- `one_hot_prefix_sep` allows you to specify the separator used to prefix the one-hot encoded columns. By default, it is set to `__`.
- `one_hot_columns_to_drop` allows you to specify which one-hot encoded columns should be dropped before fitting the MCA. This is useful if you want to drop some columns that are not relevant for the analysis, or if you want to avoid multicollinearity issues. It leads to so-called "subset MCA".

In [13]:
mca = prince.MCA(
    one_hot_prefix_sep="@",
    one_hot_columns_to_drop=['Color@PURPLE', 'Action@STRETCH']
)
mca = mca.fit(dataset)
mca.column_coordinates(dataset)

Unnamed: 0,0,1
Color@YELLOW,-0.006804,-0.114754
Size@LARGE,0.196862,-0.938999
Size@SMALL,-0.049813,1.080353
Action@DIP,-0.562462,0.004154
Age@ADULT,0.823161,0.082031
Age@CHILD,-0.941808,-0.071144
Inflated@F,-0.681104,-0.024378
Inflated@T,1.384794,0.089388
