# Correspondence Analysis with <a href="https://cran.r-project.org/"><img src="https://cran.r-project.org/Rlogo.svg" style="max-width: 40px; display: inline" alt="R"/></a>

In [None]:
library(ggplot2)

With [`R`](https://cran.r-project.org/), the reference packages for factor analysis are (i) [`FactoMineR`](http://factominer.free.fr/index_fr.html) for the analysis and (ii) [`factoextra`](https://cran.r-project.org/web/packages/factoextra/index.html) for data visualization, both of which we have already used. Here, we will use these packages to perform a correspondence analysis (part 1) and a multiple correspondence analysis (part 2).

In [None]:
library(FactoMineR)
library(factoextra)

## **CA** - Correspondence Analysis

In this section, we look at a dataset listing the division of domestic tasks within a couple. This dataset is available in the `R` package `factoextra`. 

### The data

The data is a contingency table containing 13 housetasks and their repartition in the couple:
- rows are the different tasks
- values are the frequencies of the tasks done :
    - by the _wife_ only,
    - _alternatively_,
    - by the _husband_ only,
    - or _jointly_.
    
The data for this example deal with married women’s integration into the labour market in **1977**. Dual-wage earning, newly-wed, childless couples from five German cities were asked who was responsible for each of the following tasks:

1. taking care of the laundry, 
2. preparing main meals, 
3. making dinner, 
4. preparing breakfast, 
5. tidying the house, 
6. washing dishes, 
7. shopping, 
8. taking care of official matters (tax returns, etc.), 
9. automobile driving, 
10. taking care of financial matters, 
11. taking care of insurance matters, 
12. minor household repairs, 
13. planning trips and vacations.

Complete data were obtained from 223 couples. However, the data reported here relates only to couples who have reached an agreement on repartition (request made independently to the two protagonists).


Further (sociological) information can be found in the original article by Thiessen, Rohlinger and Blasius (1994), and a more detailed description is available on page 379 (13) of the article by [Kroonenberg and Lombardo (1999)](https://www.researchgate.net/profile/Rosaria-Lombardo/publication/28648781_Nonsymmetric_Correspondence_Analysis_A_Tool_for_Analysing_Contingency_TablesWith_a_Dependence_Structure/links/55689ad908aec22683032b28/Nonsymmetric-Correspondence-Analysis-A-Tool-for-Analysing-Contingency-TablesWith-a-Dependence-Structure.pdf?origin=journalDetail&_tp=eyJwYWdlIjoiam91cm5hbERldGFpbCJ9): Non-symmetric correspondence analysis, A tool for analysing contingency tables with a dependence structure. Multivariate Behavioral Research 34(3).

In [None]:
data(housetasks)
housetasks

##### <span style="color:purple">**Todo:** Visualize this contingency table.</span>

- Contingency table can be visualized using the function `balloonplot` from the [`gplots`](https://cran.r-project.org/web/packages/gplots/index.html) package. This function draws a graphical matrix where each cell contains a dot whose size reflects the relative magnitude of the corresponding component;
- It is also possible to visualize a contingency table as a mosaic plot. We can, for instance, do this using the `mosaicplot` function from the [`garphics`](https://cran.r-project.org/web/packages/RGraphics/index.html) package.

In [None]:
library(gplots)
library(graphics)

In [None]:
# solutions/ca/visualize.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Question:** Which task did husband and wife agree on most (in terms of distribution)?</span>

Remember that the table only lists the answers for which both members of the couple agreed on the level of allocation.

In [None]:
# solutions/ca/common.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Question:** Is the distribution of household tasks significantly unequal, _i.e._ not evenly distributed?</span>

In [None]:
# solutions/ca/dependence.r

In [None]:
## TO BE COMPLETED ##

### Correspondence Analysis

With the `CA` function of [`FactoMineR`](http://factominer.free.fr/index_fr.html), we can perform Correspondence Analysis.

In [None]:
res.ca = CA(housetasks, graph=FALSE)
print(res.ca)

##### <span style="color:purple">**Question:** How many dimensions should we keep for further analysis?</span>

In [None]:
# solutions/ca/dimensions.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** View the biplot.</span>

- What conclusions can you draw?
- One can use the `fviz_ca_biplot` function.

In [None]:
# solutions/ca/biplot.r

In [None]:
## TO BE COMPLETED ##

### Quality of representation

##### <span style="color:purple">**Question:** Which task is best represented by axis 1?</span>

In [None]:
# solutions/ca/representation.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Question:** Which task is best represented in the CA plan? Which is least well represented?</span>

- View the projection of tasks in the CA plane colored according to the quality of their representation.
- One may use the `fviz_ca_row` function

In [None]:
# solutions/ca/row_representation.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** Carry out the same study with the columns in the table.</span>

In [None]:
# solutions/ca/col_representation.r

In [None]:
## TO BE COMPLETED ##

### Contribution to the dimensions

##### <span style="color:purple">**Question:** Which task contributes most to axis 2?</span>

In [None]:
# solutions/ca/contribution.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** How do each of the tasks contribute to axes 1 and 2?</span>

- Draw a bar plot of row contributions, to dimension 1 on the one hand, to dimension 2 on the other, and to the combination of these two dimensions (CA map).
- One can use the `fviz_contrib` function.

In [None]:
# solutions/ca/row_contribution.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** Interpret the axes</span>

In [None]:
# solutions/ca/axes.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** View the contribution of each column to each dimension.</span>

- use the `corrplot` function the eponymous package to highlight the most contributing row points for each dimension:

In [None]:
library(corrplot)

In [None]:
# solutions/ca/col_contrib.r

In [None]:
## TO BE COMPLETED ##

---

## **MCA** - Multiple Correspondence Analysis

We now focus on the Multiple Correspondence Analysis, extension of the simple Correspondence Analysis (CA).

### The data

We are going to use data on leisure activities. These data are taken from a 2003 INSEE survey on identity construction, known as the "life history" survey, in which 8403 people were interviewed.

This study includes 2 types of variable:
- Which of the following leisure activities do you practice regularly? Reading, Listening to music, Cinema, Shows, Exhibitions, Computer, Sport, Walking, Travel, Playing a musical instrument, Collecting, Voluntary work, Home improvement, Gardening, Knitting, Cooking, Fishing, Number of hours of TV per day on average
- Additional variables (4 questions): sex, gender, profession, marital status.

This data is available in the "hobbies" $\texttt{csv}$ file.

##### <span style="color:purple">**Todo:** Load the data and carry out any necessary transformations.</span>

In [None]:
# solutions/mca/data.r

In [None]:
## TO BE COMPLETED ##

hobbies = ...

##### <span style="color:purple">**Question:** Are some leisure activities more popular than others? Is any age group over-represented in this dataset?</span>

- For each variable, plot the frequency of variable categories.

In [None]:
# solutions/mca/frequency.r

In [None]:
## TO BE COMPLETED ##

> _**Remark**: The graphs above can be used to identify variable categories with a very low frequency. These types of variables can distort the analysis and should be removed._

### Multiple Correspondence Analysis

With the `MCA` function of [`FactoMineR`](http://factominer.free.fr/index_fr.html), we can perform Multiple Correspondence Analysis.

In [None]:
res.mca = MCA(hobbies[c(1:18)], graph=FALSE)
print(res.mca)

##### <span style="color:purple">**Todo:** Visualize the percentages of inertia explained by each MCA dimensions.</span>

- What percentage of the variance explains the first two axes?
- How many dimensions would you recommend retaining?

In [None]:
# solutions/mca/inertia.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** Visualize the mca biplot.</span>

- One may use the `fviz_mca_biplot` function.

In [None]:
# solutions/mca/biplot.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Question:** How do the different variables correlate with the different axes?</span>

- What can you conclude from this?
- One can refer to the `fviz_mca_var` function.

In [None]:
# solutions/mca/var_correlation.r

In [None]:
## TO BE COMPLETED ##

### Quality of representation

The two dimensions 1 and 2 are sufficient to retain only $24\%$ of the total inertia (variation) contained in the data. 
It is reasonable to assume that not all points are displayed with the same quality in these first two dimensions.

##### <span style="color:purple">**Question:** Which hobby is best represented in Plan 1-2? Least well?</span>

In [None]:
# solutions/mca/representation.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** Visualize the variables quality representation.</span>

- Visualize the variable map, colored according the quality of representation $\to$ `fviz_mca_var`
- Create a bar plot of variable $\cos^2$ $\to$ `fviz_cos2`
- What do you think of the quality of the representation of the hobby of "watching TV"?

In [None]:
# solutions/mca/var_representation.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** Would you say that all individuals are uniformly well represented?</span>

In [None]:
# solutions/mca/ind_representation.r

In [None]:
## TO BE COMPLETED ##

### Contribution to the dimensions

##### <span style="color:purple">**Todo:** Evaluate the contribution of the various leisure activities to Axes 1 and 2 of the MCA.</span>

- Which 20 leisure activities contribute most to axis 1 of the MCA? 
- Which of these leisure activities make a significant contribution?
- Same questions for axis 2.

In [None]:
# solutions/mca/var_contrib.r

In [None]:
## TO BE COMPLETED ##

The most important (or, contributing) hobbies can be highlighted on the scatter plot as follow:

In [None]:
fviz_mca_var(res.mca, col.var = "contrib",
             gradient.cols = c("blue", "yellow", "red"), 
             repel = TRUE)

##### <span style="color:purple">**Question:** More generally, can we distinguish between involvement in leisure activities and not?</span>

In [None]:
# solutions/mca/grouped_var.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Todo:** View the individuals on the main map of the MCA according to their hobbies.</span>

- We could concentrate on visiting exhibitions, gardening or watching TV.
- What do you think?

In [None]:
# solutions/mca/grouped_ind.r

In [None]:
## TO BE COMPLETED ##

### Supplementary elements

We will now take into account the additional socio-demographic variables in our analysis

In [None]:
res.mca = MCA(hobbies, quali.sup=19:22, quanti.sup=23, graph=FALSE)

##### <span style="color:purple">**Question:** What do the graphs below represent? What can you tell from them?</span>

In [None]:
fviz_mca_var(res.mca)

# --- #

plot(res.mca, invisible=c("ind","var"),
     hab = "quali", 
     palette = palette(c("blue","maroon","darkgreen","black","red")), 
     title = "Graph of the socio-demographic categories")

# col = c( rep("Sex",length(levels(hobbies$Sex))), rep("Age",length(levels(hobbies$Age))), rep("Marital.status",length(levels(hobbies$Marital.status))), rep("Profession",length(levels(hobbies$Profession))) )
# fviz_mca_var(res.mca, invisible="var", col.var = col)

# --- #

fviz_mca_var(res.mca, choice = "quanti.sup")

##### <span style="color:purple">**Question:** Are the additional socio-demographic variables well represented in the MCA space?</span>

- Assessing the quality of the representation of additional qualitative variables.
- Draw a bar chart to compare these different qualities.

In [None]:
library(tidyverse)
library(ggpubr)

In [None]:
# solutions/mcs/sup_representation.r

In [None]:
## TO BE COMPLETED ##

##### <span style="color:purple">**Question:** How are the additional variables correlated with dimensions 1 and 2?</span>

In [None]:
# solutions/mca/sup_correlation.r

In [None]:
## TO BE COMPLETED ##