+++
title = "Multiple correspondence analysis"
menu = "main"
weight = 3
toc = true
aliases = ["mca"]
+++

## Resources

- [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf)

## Data

Multiple correspondence analysis is an extension of correspondence analysis. It should be used when you have more than two categorical variables. The idea is to one-hot encode a dataset, before applying correspondence analysis to it.

As an example, we're going to use the [balloons dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/) taken from the [UCI datasets website](https://archive.ics.uci.edu/ml/datasets.html).

In [1]:
import pandas as pd

dataset = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')
dataset.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
dataset.head()

Unnamed: 0,Color,Size,Action,Age,Inflated
0,YELLOW,SMALL,STRETCH,ADULT,T
1,YELLOW,SMALL,STRETCH,CHILD,F
2,YELLOW,SMALL,DIP,ADULT,F
3,YELLOW,SMALL,DIP,CHILD,F
4,YELLOW,LARGE,STRETCH,ADULT,T


## Fitting

In [2]:
import prince

mca = prince.MCA(
    n_components=3,
    n_iter=3,
    copy=True,
    check_input=True,
    engine='sklearn',
    random_state=42
)
mca = mca.fit(dataset)

The way MCA works is that it one-hot encodes the dataset, and then fits a correspondence analysis. In case your dataset is already one-hot encoded, you can specify `one_hot=False` to skip this step.

In [3]:
one_hot = pd.get_dummies(dataset)

mca_no_one_hot = prince.MCA(one_hot=False)
mca_no_one_hot = mca_no_one_hot.fit(one_hot)

## Eigenvalues

In [4]:
mca.eigenvalues_summary

Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.402,40.17%,40.17%
1,0.211,21.11%,61.28%
2,0.186,18.56%,79.84%


## Coordinates

In [5]:
mca.row_coordinates(dataset).head()

Unnamed: 0,0,1,2
0,0.705387,9.509676e-15,0.758639
1,-0.386586,7.593937e-15,0.626063
2,-0.386586,6.106546e-15,0.626063
3,-0.852014,5.547435e-15,0.562447
4,0.783539,-0.6333333,0.130201


In [6]:
mca.column_coordinates(dataset).head()

Unnamed: 0,0,1,2
Color_PURPLE,0.117308,0.6892024,-0.64127
Color_YELLOW,-0.130342,-0.7657805,0.712523
Size_LARGE,0.117308,-0.6892024,-0.64127
Size_SMALL,-0.130342,0.7657805,0.712523
Action_DIP,-0.853864,-2.712409e-15,-0.07934


## Visualization

In [7]:
mca.plot(
    dataset,
    x_component=0,
    y_component=1,
    show_column_markers=True,
    show_row_markers=True,
    show_column_labels=False,
    show_row_labels=False
)

## Contributions

In [8]:
mca.row_contributions_.head().style.format('{:.0%}')

Unnamed: 0,0,1,2
0,7%,0%,16%
1,2%,0%,11%
2,2%,0%,11%
3,10%,0%,9%
4,8%,10%,0%


In [9]:
mca.column_contributions_.head().style.format('{:.0%}')

Unnamed: 0,0,1,2
Color_PURPLE,0%,24%,23%
Color_YELLOW,0%,26%,26%
Size_LARGE,0%,24%,23%
Size_SMALL,0%,26%,26%
Action_DIP,15%,0%,0%


## Cosine similarities

In [10]:
mca.row_cosine_similarities(dataset).head()

Unnamed: 0,0,1,2
0,0.461478,8.387409000000001e-29,0.533786
1,0.152256,5.875091e-29,0.399316
2,0.152256,3.799023e-29,0.399316
3,0.653335,2.769663e-29,0.284712
4,0.592606,0.3871772,0.016363


In [11]:
mca.column_cosine_similarities(dataset).head()

Unnamed: 0,0,1,2
Color_PURPLE,0.01529,0.5277778,0.45692
Color_YELLOW,0.01529,0.5277778,0.45692
Size_LARGE,0.01529,0.5277778,0.45692
Size_SMALL,0.01529,0.5277778,0.45692
Action_DIP,0.530243,5.350665e-30,0.004578


## Handling unknown categories

The MCA implementation in Prince implements sklearn's fit/transfrom API. This means that you can use the `transform` method to transform new data. The latter might differ from the training data, in that it may contain categories that were not present in the training data. By default, the MCA implementation will raise an error if it encounters such a category. You can change this behavior by setting the `handle_unknown` parameter to `'ignore'`. In this case, the unknown categories will be ignored.

In [12]:
dataset = pd.DataFrame({
    'var1': ['c', 'a', 'b', 'c'],
    'var2': ['x', 'y', 'y', 'z']
})

mca = prince.MCA(n_components=2, random_state=42, handle_unknown='ignore')
mca.fit(dataset[:3])
mca.transform(dataset[-1:])

Unnamed: 0,0,1
3,1.414214,3.326586e-16
