#**MCA Example**

In the example below we are using a toy dataset known as the [balloons dataset](http://vxy10.github.io/2016/06/10/intro-MCA/) which was taken from [UCI datasets](https://archive.ics.uci.edu/ml/datasets.html). This dataset follows that the most common format for Categorigical variables, and so hopefully you will grasp the idea fairly quickly.

Install a package named "mca". There is a package called "prince" but I found that a lot of it was depracted, plus I found "mca" easier to use.

In [None]:
#!pip install  prince
!pip install  mca

Now lets import the relevant libraries.

In [None]:
import mca
import pandas as pd
import numpy as np

np.set_printoptions(formatter={'float': '{: 0.4f}'.format})
pd.set_option('display.precision', 5)
pd.set_option('display.max_columns', 25)

Reading the dataset from UCI.

In [None]:
import pandas as pd
X = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/balloons/adult+stretch.data')

This example is going to read the data straight into pandas from the UCI repository. We are also going to use the [LabelBinarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) to covert the muti-class labels to binary variables. If you would like to know more about all this go to [Machine Learning Mastery](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) for a discussion as to how and why we do this.

In [None]:
from sklearn.preprocessing import LabelBinarizer
X.columns = ['Color', 'Size', 'Action', 'Age', 'Inflated']
X.head()

lb = LabelBinarizer()
X['Color']=lb.fit_transform(X['Color'])
X['Size']=lb.fit_transform(X['Size'])
X['Action']=lb.fit_transform(X['Action'])
X['Age']=lb.fit_transform(X['Age'])
X['Inflated']=lb.fit_transform(X['Inflated'])
print(X)

MCA calculations are implemented via MCA object. The default condition applies Benzécri correction for eigen values, therefore, benzecri flag has to be set to false. Below are the list of attributes for the MCA function, - .L Eigen values - .F Factor scores for columns. (components are linear combination of columns) - .G Factor scores for rows. (components are linear combination of rows) - .expl_var Explained variance. - .fs_r Projections onto the factor space, can also be computed by applying fs_r_sup on each of the row elements. - .cos_r Cosine distance between $i^{th}$ vector and $j^{th}$ factor (or row eigen vector) - .cont_r Contribution of individual categorical variable to the factor.

In the code below you will see how we have inserted the dataset into the MCA object. We have selected "benzecri=False" as it when left true it tends to "chop" the dataset down a bit to much.

We also print out  the factor scores for each row, the contribution of each row to the each factor and the eigen values for each factor.

In [None]:
import mca
print(X.columns)
#cols=['Color', 'Size', 'Action','Age', 'Inflated']
mca1 = mca.MCA(X,cols=['Color','Size', 'Action','Age','Inflated'],benzecri=False)
#mca1 = mca.MCA(X,benzecri=False)
#print(mca1.L)
print("Factor scores for each row ")
print(print(mca1.fs_r(N=5)))
print("contribution from each row to the components")
print(mca1.cont_r(N=5))
print(" Eigen values :",mca1.L," Total Accumulated variance:", mca1.L.sum())
#mca_fit.total_inertia_

In the next peice of code we are organizing the data into a convient table.

In [None]:
fs, cos, cont = 'Factor score','Squared cosines', 'Contributions x 1000'
table3 = pd.DataFrame(columns=X.index, index=pd.MultiIndex
                      .from_product([[fs, cos, cont], range(1, 3)]))

table3.loc[fs,    :] = mca1.fs_r(N=2).T
table3.loc[cos,   :] = mca1.cos_r(N=2).T
table3.loc[cont,  :] = mca1.cont_r(N=2).T * 1000

np.round(table3.astype(float), 2)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

points = table3.loc[fs].values
labels = table3.columns.values
print(points)
plt.figure()
plt.margins(0.1)
plt.axhline(0, color='gray')
plt.axvline(0, color='blue')
plt.xlabel('Factor 1')
plt.ylabel('Factor 2')
plt.scatter(*points, s=120, marker='o', c='r', alpha=1, linewidths=0)
for label, x, y in zip(labels, *points):
    plt.annotate(label, xy=(x, y), xytext=(x + .03, y + .03))
plt.show()

#**Conclusions**

We can see from our analysis that the first 4 components contribute to the lions share of the variance. This means we could potenially drop 5th component and cosequently reduce our dataset.

We have also shown how you can create new factors that ensure that correlation within categorical variables can be avoided.