# 2.3 Multiple Correspondency Analysis

---
> Erick Eduardo Aguilar Hernández:
> * mat.ErickAguilar@ciencias.unam.mx
> * isc.ErickAguilar@gmail.com

### Encoding categorical data
____

This technique creates a map of the associations among categories of two or more categorical variables, this dimensionality reduction concerned the interrelationships amongst the variables. The goal is convert the numerical information from a contingency table into a two-dimensional graphical display.

Lets suppose we have a categorical dataset with n obervations and $C$ categoricals columns with $J_i$ possible values for each column $C_i$, then the indicator matrix $\textbf{X}$ have n rows with the sum of all possible $J_i$ the entrie $[\mathbf{X}]_{i,j}=1$ if the category of the j-th variable possessed by the i-th individual and 0 in other case, this encoding is well know as one hot encoding. **Example**:

$$
Dataset = 
\left[ \begin{array}{ccccc}
\textbf{Sex}  & \textbf{Smoke}  \\
\text{Male}  & \text{Yes}  \\
\text{Male}  & \text{No}  \\
\text{Male}  & \text{Yes}  \\
\text{Female}  & \text{No}  \\
\text{Female}  & \text{Yes}  \\
\text{Female}  & \text{No}  \\
\end{array} \right]
\quad
\mathbf{X} = 
\left[ \begin{array}{ccccc}
\textbf{Sex_Male} & \textbf{Sex_Female} & \textbf{Smoke_Yes} & \textbf{Smoke_No}\\
1 & 0 & 1 & 0 \\
1 & 0 & 0 & 1 \\
1 & 0 & 1 & 0 \\
0 & 1 & 0 & 1 \\
0 & 1 & 1 & 0 \\
0 & 1 & 0 & 1 \\
\end{array} \right]
$$

The Burt matrix is the matrix that contains the crossed counts of all categories:

$$\mathbf{B} = \mathbf{X'X}$$

For the last example:

$$
\mathbf{B} = \mathbf{X'X} =
\left[ \begin{array}{ccccc}
& \textbf{Sex_Male} & \textbf{Sex_Female} & \textbf{Smoke_Yes} & \textbf{Smoke_No}\\
\textbf{Sex_Male} & 3 & 0 & 2 & 1\\
\textbf{Sex_Female} & 0 & 3 & 1 & 2\\
\textbf{Smoke_Yes} & 2 & 1 & 3 & 0\\
\textbf{Smoke_Nes} & 1 & 2 & 0 & 3\\
\end{array} \right]
$$

The Burt matrix is semipositive define so it coud be factorized as 

$$\mathbf{B}=\mathbf{\Gamma \Lambda \Gamma'} \implies \mathbf{X}=\mathbf{U D \Gamma'}$$

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import prince as pr
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder
pd.set_option('max_colwidth',100)
plt.style.use('seaborn')

In [None]:
run MvaUtils.py

In [None]:
categoricals = ['sex_numeric','embarked_numeric','cabin_numeric']
#categoricals = ['sex_numeric','cabin_numeric','companions']
titanic_df = pd.read_csv('DataSets/TitanicCleaned.csv').dropna()
titanic_df['companions'] = titanic_df['sibsp']+titanic_df['parch']
titanic_df['companions'] = titanic_df['companions'].apply(lambda row: 4 if row>=4 else row)
titanic_df['cabin_numeric'] = titanic_df['cabin_numeric'].apply(lambda row: 7 if row>=7 else row)
for col in categoricals:
    titanic_df[col] = titanic_df[col].apply(lambda row: str(row).replace('.0',''))
titanic_df[categoricals+['target']]

In [None]:
ohe = OneHotEncoder(sparse=False).fit(titanic_df[categoricals])
ohe_col_names = ohe.get_feature_names(categoricals)
X_df = pd.DataFrame(ohe.transform(titanic_df[categoricals]),columns=ohe_col_names)
X_df

In [None]:
mca_df

In [None]:
X = X_df
n_components = 4
mca_cols = ['mca_'+str(i+1) for i in range(0,n_components)]
mca_model = pr.MCA(n_components=n_components,
                   n_iter=10,
                   copy=True,
                   check_input=True,
                   engine='auto',
                   random_state=42).fit(X)

columns_profiles_df = mca_model.column_coordinates(X).reset_index()
columns_profiles_df.columns = ['category']+mca_cols
display(columns_profiles_df)
plt.figure(figsize=(12,12))
scatterPlot(plt,columns_profiles_df,'mca_1','mca_2',label='category')

mca_df = mca_model.row_coordinates(X)
mca_df.columns = mca_cols
mca_df['target'] = titanic_df['target']

In [None]:
smp = sns.pairplot(mca_df[['mca_1','mca_2','mca_3','mca_4','target']].head(500),hue="target",diag_kind='hist',palette="plasma",height=5)
mca_df

In [None]:
religion_dict={0:'Catholic',1:'Other Christian',2:'Muslim',3:'Buddhist',4:'Hindu',5:'Ethnic',6:'Marxist',7:'Others'}

encoded = ['bars','stripes','red','green','blue','gold','white','black','orange','circles','crosses','saltires','quarters','sunstars','crescent','triangle','icon','animate','text']
flags_df = pd.read_csv('DataSets/Flags.csv',delimiter=",")
flags_df = flags_df.set_index('country')
flags_df['religion'] = flags_df['religion'].apply(lambda row: religion_dict[row])
flags_df['religion'] = np.where(flags_df['religion'].isin(['Other Christian','Catholic']),'Roman',flags_df['religion'])

flags_df = flags_df[flags_df['religion'].isin(['Muslim','Roman'])]
for col in encoded:
    flags_df[col] = flags_df[col].apply(lambda row: 1 if row >=1 else 0)
flags_df[encoded]

In [None]:
X = flags_df[encoded]
n_components = 4
mca_cols = ['mca_'+str(i+1) for i in range(0,n_components)]
mca_model = pr.MCA(n_components=n_components,
                   n_iter=10,
                   copy=True,
                   check_input=True,
                   engine='auto',
                   random_state=42).fit(X)

columns_profiles_df = mca_model.column_coordinates(X).reset_index()
columns_profiles_df.columns = ['category']+mca_cols
display(columns_profiles_df)
plt.figure(figsize=(10,10))
scatterPlot(plt,columns_profiles_df,'mca_1','mca_2',label='category')

mca_df = mca_model.row_coordinates(X)
mca_df.columns = mca_cols
mca_df['religion'] = flags_df['religion']
smp = sns.pairplot(mca_df,hue="religion",diag_kind='hist',palette="plasma",height=5)
mca_df