# Correspondence analysis

Author: O. Roustant, INSA Toulouse. April 2022.

This notebook aims at illustrating the course of correspondence analysis. It is complementary to the course slides.

# Illustration on the velib data

We consider the ‘Vélib’ data set, related to the bike sharing system of Paris. The data are loading profiles of the bike stations over one week, collected every hour, from the period Monday 2nd Sept. - Sunday 7th Sept., 2014. The loading profile of a station, or simply loading, is defined as the ratio of number of available bikes divided by the number of bike docks. A loading of 1 means that the station is fully loaded, i.e. all bikes are available. A loading of 0 means that the station is empty, all bikes have been rent.

From the viewpoint of data analysis, the individuals are the stations. The variables are the 168 time steps (hours in the week). The aim is to detect clusters in the data, corresponding to common customer usages. This clustering should then be used to predict the loading profile.

In [1]:
rm(list = ls())   # erase everything, start from scratch!

load("velib.RData")

“impossible d'ouvrir le fichier compressé 'velib.RData', cause probable : 'Aucun fichier ou dossier de ce type'”


ERROR: Error in readChar(con, 5L, useBytes = TRUE): impossible d'ouvrir la connexion


In [None]:
# data preparation
x <- as.matrix(velib$data)
colnames(x) <- 1:ncol(x)
rownames(x) <- velib$names

n <- nrow(x)
stations <- 1:n 
coord <- velib$position[stations,]

# select exactly 7 days of data (we remove the first 13 dates)
dates <- 14:181
x <- x[stations, dates]
colnames(x) <- 1:length(dates)

In [None]:
timeTick <- 1 + 24*(0:6)  # vector corresponding to the beginning of days
par(mfrow = c(1, 1))

options(repr.plot.width = 15, repr.plot.height = 6)

plot(x[1, ], col = "blue", type = "l", ylim = c(0, 1), 
     xlab = "Time", ylab = "Loading", main = rownames(x)[1])
abline(v = timeTick, lty = "dotted")


In [None]:
# From now on, we use numbers instead of station names, 
# in order to simplify printing
rownames(x) <- 1:nrow(x)

## Basic clustering

In [2]:
# hierarchical clustering
hc <- hclust(dist(x), method = "ward.D2")
plot(hc, xlab = "ward linkage", sub = "", cex.lab = 2, cex.main = 2)
plot(rev(hc$height)[1:15], xlab = "Number of classes", ylab = "height", 
     cex.lab = 2, cex.main = 2, cex.axis = 2, cex = 2, pch = 19)


ERROR: Error in eval(expr, envir, enclos): objet 'x' introuvable


In [None]:
# let us choose K clusters
K <- 5
reshc <- cutree(hc, k = K)

In [None]:
# k-means 
K2 <- 6
km <- kmeans(x, centers = K2, nstart = 20)
reskm <- km$cluster

In [None]:
# Comparison with a contingency table
tab <- table(reskm, reshc)
rownames(tab) <- paste("km", 1:K2, sep = "")
colnames(tab) <- paste("hc", 1:K, sep = "")
tab

## Correspondence analysis

In order to compare the clustering results, we use correspondence analysis, which is composed of two PCAs with the chi2 metric: one on the "profile rows", the other on the "profile columns". 

Let us build the profile rows: for each row, compute the percentage (with respect to the sum of the row values).

In [None]:
cat("Contingency table:\n")
tab
rowProf <- tab
for (i in 1:nrow(tab)){
    rowProf[i, ] <- tab[i, ] / sum(tab[i, ])
}
cat("\nRow profile table:\n\n")
print(rowProf, digits = 2);

cat("\nColumn frequencies (inverse weights for chi2 distance):\n")
rowSums(rowProf)
colsum <- colSums(tab) / sum(tab)
colsum

Let us compute the chi2 distance between the first two rows: sum of squares weighted by the inverse of the column sum. Hence the coordinates associated with the fewest individuals have the largest weight.

In [None]:
for (i in 2:K){
    chi2Dist2 <- sum((rowProf[1, ] - rowProf[i, ])^2 / colsum)
    cat("\nsquared Chi2 distance between rows 1 and", i, ":", chi2Dist2)
}

Do the same for the columns

In [None]:
colProf <- tab
for (j in 1:ncol(tab)){
    colProf[, j] <- tab[, j] / sum(tab[, j])
}
tab; print(colProf, digits = 2)
rowSum <- rowSums(tab)
rowSum

Let us now perform a PCA of the row profiles with the chi2 metric, and a PCA of the colum profiles with the chi2 metric. We plot them simultaneously on the first PCA coordinates. Explain why they correspond. Interpret the results.

In [None]:
library(FactoMineR)
ca <- CA(tab)
ca

# Sociological data

We consider the dataset studied by the sociologist Pierre Bourdieu, presented in the textbook of Xavier Gendre,

https://www.math.univ-toulouse.fr/~xgendre/

about 8869 students. We know the parent job:

    EAG : Exploitant agricole
    SAG : Salarié agricole
    PT : Patron
    PLCS : Profession libérale & cadre supérieur
    CM : Cadre moyen
    EMP : Employé
    OUV : Ouvrier
    OTH : Autre

and the kind of studies of children:

    DR : Droit
    SCE : Sciences économiques
    LET : Lettres
    SC : Sciences
    MD : Médecine ou dentaire
    PH : Pharmacie
    PD : Pluridisciplinaire
    IUT : Institut Universitaire de Technologie

We want to investigate if there is a social reproductibility, i.e. if there is a link between the parent job and the kind of studies of the children.

**Q** Do a correspondence analysis. Conclusions?

**Q** Some levels have small frequencies. What consequence on the results? How to gather levels? Redo the correspondence analysis with the new levels.

**R** Some clusters seem to be linked, like MD and PCLS or IUT and EAG. MD and PCLS can means that doctors make baby doctors.

**R** In the case of PH whis has small frequencies, the correspondence analysis factor map show it quite far away from the other datas. By gathering MD and PH (which can seem logical because they are both health jobs), the MDPH level is now near the initial MD position, and so MDPH and PCLS seem linked.

In [None]:
T <- read.table("dataBourdieu.dat")
T

In [None]:
CA(T)

In [None]:
# Example for the levels Pharmacie and Medecine: 
T2 <- T
T2 <- cbind(T, MDPH = T[, "MD"] + T[, "PH"])  # gather levels MD and PH (columns 5 and 6)
T2 <- T2[, -c(5,6)]  # delete corresponding columns
T2

In [None]:
CA(T2)

In [None]:
# Example for the levels Droit and Sciences économiques: 
T3 <- T2
T3 <- cbind(T2, DRSCE = T2[, "DR"] + T2[, "SCE"])  # gather levels MD and PH (columns 5 and 6)
T3 <- T3[, -c(1,2)]  # delete corresponding columns
T3

In [None]:
CA(T3)