![logo_uam.png](attachment:logo_uam.png)

###                                    MASTER INGENIERIE DES SYSTEMES D'INFORMATION ET DES DONNEES

###                                        PROBABILITE ET STATISTIQUES POUR INTELLIGENCE ARTIFIELLE


#### Analyse Factorielle des Correspondances (AFC)

L’analyse factorielle des correspondances (AFC) – ou analyse des correspondances pour simplifier - propose une vision synthétique des informations « saillantes ou patterns » portées par un tableau de contingence. Elle permet de débroussailler rapidement les grands tableaux. Son pouvoir de séduction repose en grande partie sur les représentations graphiques qu’elle propose. Elles nous permettent de situer facilement les similarités (dissimilarités) entre les profils et les attractions (répulsions) entre les modalités. Elle a été mise au point à partir des années 1960 par Jean-Paul Benzécri. Les facteurs – les variables latentes – qui en sont issus sont des combinaisons linéaires des points modalités (lignes ou colonnes) exprimés par des profils (lignes ou colonnes).

##### Conditions d'application

l’AFC s’applique en priorité sur les tableaux de contingence (tableau de comptage) mais aussi sur tout autre tableau de valeurs positives pour laquelle les notions de marges (sommes en ligne et colonne) et profils (ratios en ligne et colonne) ont un sens

**Exemple 1**

On a demandé à un ensemble d'électeurs leur département et leur vote à l’élection présidentielle. Supposons qu'il y a I candidats et J départements 

Ces données peuvent être rangées dans un tableau de contingence de la forme: ![image.png](attachment:image.png) 

![image-3.png](attachment:image-3.png):représente le nombre de personnes ayant voté pour le candidat i dans le département j.

 

In [24]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


**Exercice d'application : On s’intéresse à la relation entre la couleur des yeux et la couleur des cheveux de 592 sujets féminins. Les données sont résumées dans le tableau ci-aprés**
**Les données s’inspirent de l’article de Snee (1974)**

In [25]:
X=np.array([[119,26,7],
           [54,14,10],
           [29,14,16],
           [84,17,94]])

In [26]:
exdapp=pd.DataFrame(X, index=['Marrons', 'Noisettes', 'Verts', 'Bleus' ], 
                   columns=['Chatains',' Roux', 'Blonds'])
exdapp

Unnamed: 0,Chatains,Roux,Blonds
Marrons,119,26,7
Noisettes,54,14,10
Verts,29,14,16
Bleus,84,17,94


In [39]:

afc = pd.read_excel("../TD_ACP/Data_Methodes_Factorielles_python.xlsx",sheet_name="AFC_ETUDES",index_col=0)

In [40]:
afc

Unnamed: 0_level_0,Droit,Sciences,Medecine,IUT
CSP_vs_Filiere,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExpAgri,80,99,65,58
Patron,168,137,208,62
CadreSup,470,400,876,79
Employe,145,133,135,54
Ouvrier,166,193,127,129


### Test de khi2
Avant de démarrer une afc, il faut réaliser le test de Chi2 d'indépendance afin de s'assuer de la liaison entre les variavles.

In [80]:
from scipy.stats import chi2_contingency as chi2_contingency

khi2, pval , ddl , contingent_theorique = chi2_contingency(afc)

In [83]:
print("Khi2 = ", khi2)
print("pvalue = ", pval)
print("degré de liberté = ", ddl)

Khi2 =  320.2658717522244
pvalue =  2.582612649831932e-61
degré de liberté =  12


### I.1. Fréquences et Fréquences relatives

In [41]:

F = pd.DataFrame()

# Calculer les fréquences relatives
total = afc['Droit'].sum() + afc['Sciences'].sum() + afc['Medecine'].sum() + afc['IUT'].sum()
F['Droit_freq'] = afc['Droit'] / total 
F['Sciences_freq'] = afc['Sciences'] / total
F['Medecine_freq'] = afc['Medecine'] / total
F['IUT_freq'] = afc['IUT'] / total

F

Unnamed: 0_level_0,Droit_freq,Sciences_freq,Medecine_freq,IUT_freq
CSP_vs_Filiere,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExpAgri,0.021142,0.026163,0.017178,0.015328
Patron,0.044397,0.036205,0.054968,0.016385
CadreSup,0.124207,0.105708,0.231501,0.020877
Employe,0.038319,0.035148,0.035677,0.014271
Ouvrier,0.043869,0.051004,0.033562,0.034091


In [None]:
0.021142 / 

### II. Profils ligne et colonnes

In [96]:
profil_ligne = np.apply_along_axis(arr=F.values, axis=1, func1d= lambda x:x/np.sum(x))
profil_ligne

array([[0.26490066, 0.32781457, 0.21523179, 0.19205298],
       [0.29217391, 0.23826087, 0.36173913, 0.10782609],
       [0.25753425, 0.21917808, 0.48      , 0.04328767],
       [0.31049251, 0.28479657, 0.28907923, 0.11563169],
       [0.2699187 , 0.31382114, 0.20650407, 0.2097561 ]])

In [100]:
# Profil ligne : diviser chaque ligne par son total
row_totals = F.sum(axis=1)
row_profil = F.div(row_totals, axis=0)

# Profil colonne : diviser chaque colonne par son total
col_totals = F.sum(axis=0)
col_profil = F.div(col_totals, axis=1)

row_profil

Unnamed: 0_level_0,Droit_freq,Sciences_freq,Medecine_freq,IUT_freq
CSP_vs_Filiere,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExpAgri,0.264901,0.327815,0.215232,0.192053
Patron,0.292174,0.238261,0.361739,0.107826
CadreSup,0.257534,0.219178,0.48,0.043288
Employe,0.310493,0.284797,0.289079,0.115632
Ouvrier,0.269919,0.313821,0.206504,0.209756


In [101]:
col_profil

Unnamed: 0_level_0,Droit_freq,Sciences_freq,Medecine_freq,IUT_freq
CSP_vs_Filiere,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExpAgri,0.077745,0.102911,0.046067,0.151832
Patron,0.163265,0.142412,0.147413,0.162304
CadreSup,0.456754,0.4158,0.620836,0.206806
Employe,0.140914,0.138254,0.095677,0.141361
Ouvrier,0.161322,0.200624,0.090007,0.337696


Calcul des métrics ligne et colonnes

In [109]:
Dn = np.diag(row_totals)
Dp = np.diag(col_totals)
Dn

array([[0.07980973, 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.1519556 , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.48229387, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.12341438, 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.16252643]])

In [110]:
np.dot(np.linalg.inv(Dn), F)

array([[0.26490066, 0.32781457, 0.21523179, 0.19205298],
       [0.29217391, 0.23826087, 0.36173913, 0.10782609],
       [0.25753425, 0.21917808, 0.48      , 0.04328767],
       [0.31049251, 0.28479657, 0.28907923, 0.11563169],
       [0.2699187 , 0.31382114, 0.20650407, 0.2097561 ]])

In [111]:
np.linalg.inv(Dn)@F

Unnamed: 0,Droit_freq,Sciences_freq,Medecine_freq,IUT_freq
0,0.264901,0.327815,0.215232,0.192053
1,0.292174,0.238261,0.361739,0.107826
2,0.257534,0.219178,0.48,0.043288
3,0.310493,0.284797,0.289079,0.115632
4,0.269919,0.313821,0.206504,0.209756


In [112]:
np.linalg.inv(Dp)@F.T

CSP_vs_Filiere,ExpAgri,Patron,CadreSup,Employe,Ouvrier
0,0.077745,0.163265,0.456754,0.140914,0.161322
1,0.102911,0.142412,0.4158,0.138254,0.200624
2,0.046067,0.147413,0.620836,0.095677,0.090007
3,0.151832,0.162304,0.206806,0.141361,0.337696


### III. Profils Moyen ligne et colonnes

In [104]:
row_totals

CSP_vs_Filiere
ExpAgri     0.079810
Patron      0.151956
CadreSup    0.482294
Employe     0.123414
Ouvrier     0.162526
dtype: float64

In [107]:
col_totals

Droit_freq       0.271934
Sciences_freq    0.254228
Medecine_freq    0.372886
IUT_freq         0.100951
dtype: float64

#### Calcul de distances

In [150]:
import numpy as np

# Fonction pour calculer la distance entre deux profils lignes
def calc_distance(row1, row2, col_means):
    """
    row1, row2 : Profils des lignes
    col_means : Moyennes des colonnes pour le calcul du Khi carré
    """
    # Distance Euclidienne classique
    d_squared = np.sum(((row1 - row2) / row1.sum(axis=0)) ** 2)
    
    # Distance du Khi-deux
    d_khi_squared = np.sum((1 / col_means) * (((row1 - row2) / row1.sum()) ** 2))
    
    return d_squared, d_khi_squared

# Calculons toutes les distances entre les lignes du tableau des fréquences
distances = []

for i in range(len(F)):
    for j in range(i + 1, len(F)):
        d_sq, d_khi_sq = calc_distance(
            F.iloc[i],
            F.iloc[j],
            F.sum(axis=0)
        )
        distances.append({
            "Ligne 1": i,
            "Ligne 2": j,
            "Distance Euclidienne^2": d_sq,
            "Distance Khi^2": d_khi_sq
        })

# Convertir les résultats en DataFrame pour afficher
distances_df = pd.DataFrame(distances)
distances_df

Unnamed: 0,Ligne 1,Ligne 2,Distance Euclidienne^2,Distance Khi^2
0,0,1,0.325128,0.977542
1,0,2,9.877451,29.427852
2,0,3,0.112901,0.366027
3,0,4,0.275394,1.339827
4,1,2,1.835572,5.465418
5,1,3,0.01796,0.051217
6,1,4,0.042919,0.225065
7,2,3,0.218163,0.644787
8,2,4,0.2098,0.611789
9,3,4,0.044615,0.328644


In [71]:
# Calcul spécifique pour les distances entre CadreSup (ligne 2), Ouvrier (ligne 4), et Patron (ligne 1)

# Index des lignes dans le tableau des fréquences
cadre_index = 2
ouvrier_index = 4
patron_index = 1

# Calcul des distances
cadre_ouvrier_euclid, cadre_ouvrier_khi = calc_distance(
    F.iloc[cadre_index],
    F.iloc[ouvrier_index],
    F.sum(axis=0)
)

cadre_patron_euclid, cadre_patron_khi = calc_distance(
    F.iloc[cadre_index],
    F.iloc[patron_index],
    F.sum(axis=0)
)

# Résultats
distances_specific = pd.DataFrame({
    "Paire de Lignes": ["Cadre-Ouvrier", "Cadre-Patron"],
    "Distance Euclidienne^2": [cadre_ouvrier_euclid, cadre_patron_euclid],
    "Distance Khi^2": [cadre_ouvrier_khi, cadre_patron_khi]
})

distances_specific

Unnamed: 0,Paire de Lignes,Distance Euclidienne^2,Distance Khi^2
0,Cadre-Ouvrier,0.2098,0.611789
1,Cadre-Patron,0.182214,0.542542


### Distance CadreSup-Ouvrier

In [157]:
distances = []
colon = col_totals.to_list()
for i in range(len(col_profil)):
    for j in range(i+1, len(col_profil)):
        d = np.sum((row_profil.iloc[i] - row_profil.iloc[j])**2 / colon)
        distances.append({
            'row1': i,
            'row2': j,
            'distance': d
        })

distances_df = pd.DataFrame(distances)
distances_df
#np.sum((row_profil.iloc[2] - row_profil.iloc[4])**2 / colon)

Unnamed: 0,row1,row2,distance
0,0,1,0.162117
1,0,2,0.453847
2,0,3,0.0874
3,0,4,0.004172
4,1,2,0.084611
5,1,3,0.024514
6,1,4,0.191823
7,2,3,0.176847
8,2,4,0.510901
9,3,4,0.115413


### Distance CadreSup-Patron

In [147]:
np.sum((row_profil.iloc[1] - row_profil.iloc[2])**2 / colon)

0.08461088232967058

In [130]:
X = F.iloc[2] - F.iloc[4]
X.T@np.linalg.inv(Dp)@X

0.14230671782941506

In [118]:
F

Unnamed: 0_level_0,Droit_freq,Sciences_freq,Medecine_freq,IUT_freq
CSP_vs_Filiere,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExpAgri,0.021142,0.026163,0.017178,0.015328
Patron,0.044397,0.036205,0.054968,0.016385
CadreSup,0.124207,0.105708,0.231501,0.020877
Employe,0.038319,0.035148,0.035677,0.014271
Ouvrier,0.043869,0.051004,0.033562,0.034091


In [146]:
col_totals

Droit_freq       0.271934
Sciences_freq    0.254228
Medecine_freq    0.372886
IUT_freq         0.100951
dtype: float64

In [131]:
row_profil

Unnamed: 0_level_0,Droit_freq,Sciences_freq,Medecine_freq,IUT_freq
CSP_vs_Filiere,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExpAgri,0.264901,0.327815,0.215232,0.192053
Patron,0.292174,0.238261,0.361739,0.107826
CadreSup,0.257534,0.219178,0.48,0.043288
Employe,0.310493,0.284797,0.289079,0.115632
Ouvrier,0.269919,0.313821,0.206504,0.209756


0.5109007867776723

In [135]:
row_profil

Unnamed: 0_level_0,Droit_freq,Sciences_freq,Medecine_freq,IUT_freq
CSP_vs_Filiere,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ExpAgri,0.264901,0.327815,0.215232,0.192053
Patron,0.292174,0.238261,0.361739,0.107826
CadreSup,0.257534,0.219178,0.48,0.043288
Employe,0.310493,0.284797,0.289079,0.115632
Ouvrier,0.269919,0.313821,0.206504,0.209756


In [None]:
1/ 0.271934 *(0.257534 - 0.269919)   0.254228

**Interprétaions**

La structure de choix de filière des enfant cadre est plus proche (similaire) du choix de formation des enfants patron qu ne l'est celle des enfants ouvriers.

### IV. Indépendance

In [72]:
# Calculs des marges et du total
row_sums = F.sum(axis=1)  # k_{i,.} : Somme des lignes
col_sums = F.sum(axis=0)  # k_{.,j} : Somme des colonnes
total_sum = F.values.sum()  # k_{..} : Somme totale

Z = 0

for i in range(len(row_sums)):
    for j in range(len(col_sums)):
        Eij = (row_sums[i] * col_sums[j]) / total_sum
        Z += ((F.iloc[i, j] - Eij) ** 2) / Eij
Z

  Eij = (row_sums[i] * col_sums[j]) / total_sum


0.08463685828547156

In [73]:
S = F.T@row_sums@F@col_sums

ValueError: matrices are not aligned