# Analyse des notes du S3 (2021-2022)

Le fichier `S3_2022-2023.csv` contient les résultats (anonymisés) du S3 de l'année dernière. Nous allons essayer d'identifier des tendances sur les résultats.

Les cours du S3 (hors LV2) sont :

- CSA : calcul scientifique (a)
- CSB : calcul scientifique (b)
- CAS : contrôle automatique des systèmes
- CSI : conception de systèmes industriels
- MFL : mécanique des fluides
- MDS : mécanique des structures
- SDM : science des matériaux
- RAY : rayonnement
- EPS : éducation physique et sportive
- COM : communication professionnelle
- EED : énergie et environnement : les défis
- ANG : anglais
- STA : stage ouvrier

In [1]:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori

In [2]:
s3 = pd.read_csv('S3_2022-2023.csv')
s3

Unnamed: 0,CSA,CSB,CAS,CSI,MFL,MDS,SDM,RAY,EPS,COM,EED,ANG,STA
0,16.30,13.57,14.7,19.5,16.38,17.23,15.73,13.34,15.5,18.0,17,16.00,17.15
1,16.60,16.29,13.7,19.5,16.38,14.69,16.08,15.68,14.0,18.0,11,17.50,16.50
2,20.00,15.57,15.7,18.2,14.62,14.00,15.31,11.29,14.5,16.5,12,16.00,16.50
3,17.10,14.61,16.0,19.4,13.15,16.54,15.73,12.68,14.5,19.0,11,14.86,16.93
4,14.90,15.32,16.6,19.1,13.69,15.08,14.77,14.12,13.0,18.0,12,17.70,17.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
176,1.50,9.14,12.2,12.5,12.50,7.88,8.54,11.12,15.0,13.0,7,10.00,16.75
177,4.36,8.68,11.6,10.8,11.62,8.54,6.85,10.68,14.0,16.0,9,9.50,14.50
178,2.02,12.57,9.9,10.8,9.81,6.27,8.54,6.07,12.5,14.5,9,15.00,15.75
179,4.20,9.04,14.4,10.0,11.88,9.08,9.12,9.34,15.0,16.0,9,8.90,10.00


In [3]:
s3 = (s3.fillna(20) < 10)
s3.astype(int)

Unnamed: 0,CSA,CSB,CAS,CSI,MFL,MDS,SDM,RAY,EPS,COM,EED,ANG,STA
0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
176,1,1,0,0,0,1,1,0,0,0,1,0,0
177,1,1,0,0,0,1,1,0,0,0,1,1,0
178,1,0,1,0,1,1,1,1,0,0,1,0,0
179,1,1,0,0,0,1,1,1,0,0,1,1,0


In [4]:
s3['cnt'] = s3.sum(axis=1)
s3 = s3.sort_values(by=["cnt"] + list(s3.columns), ascending=True).drop(columns=['cnt']).reset_index(drop=True)
sample = s3.tail(10)
sample.astype(int)

Unnamed: 0,CSA,CSB,CAS,CSI,MFL,MDS,SDM,RAY,EPS,COM,EED,ANG,STA
171,1,0,0,0,0,1,1,1,0,0,1,0,0
172,1,0,0,0,1,0,1,1,0,0,1,0,0
173,1,0,0,0,1,1,0,1,0,0,0,1,0
174,1,1,0,0,0,0,1,1,0,0,1,0,0
175,1,1,0,0,0,1,1,0,0,0,1,0,0
176,1,1,0,0,1,1,1,0,0,0,0,0,0
177,1,1,0,0,0,1,1,0,0,0,1,1,0
178,1,1,0,0,1,1,1,1,0,0,0,0,0
179,1,0,1,0,1,1,1,1,0,0,1,0,0
180,1,1,0,0,0,1,1,1,0,0,1,1,0


L'extrait suivant représente les échecs des dix élèves en ayant le plus, avec un seuil à 10 (l'absence de résultat est considérée comme une réussite).

|         |   CSA |   CSB |   CAS |   CSI |   MFL |   MDS |   SDM |   RAY |   EPS |   COM |   EED |   ANG |   STA |
|--------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|------:|
| **171** |     1 |     0 |     0 |     0 |     0 |     1 |     1 |     1 |     0 |     0 |     1 |     0 |     0 |
| **172** |     1 |     0 |     0 |     0 |     1 |     0 |     1 |     1 |     0 |     0 |     1 |     0 |     0 |
| **173** |     1 |     0 |     0 |     0 |     1 |     1 |     0 |     1 |     0 |     0 |     0 |     1 |     0 |
| **174** |     1 |     1 |     0 |     0 |     0 |     0 |     1 |     1 |     0 |     0 |     1 |     0 |     0 |
| **175** |     1 |     1 |     0 |     0 |     0 |     1 |     1 |     0 |     0 |     0 |     1 |     0 |     0 |
| **176** |     1 |     1 |     0 |     0 |     1 |     1 |     1 |     0 |     0 |     0 |     0 |     0 |     0 |
| **177** |     1 |     1 |     0 |     0 |     0 |     1 |     1 |     0 |     0 |     0 |     1 |     1 |     0 |
| **178** |     1 |     1 |     0 |     0 |     1 |     1 |     1 |     1 |     0 |     0 |     0 |     0 |     0 |
| **179** |     1 |     0 |     1 |     0 |     1 |     1 |     1 |     1 |     0 |     0 |     1 |     0 |     0 |
| **180** |     1 |     1 |     0 |     0 |     0 |     1 |     1 |     1 |     0 |     0 |     1 |     1 |     0 |

## Itemsets fréquents

**Q1** Sur cet extrait, lister les index (i.e. les numéros d'élève entre 171 et 180) correspondant aux échecs suivants :

- CSA
- CSA, MDS
- CSA, MDS, EED
- CSA, MDS, EED, MFL
- CSA, MDS, EED, MFL, ANG

In [5]:
for itemset in [{'CSA'}, {'CSA', 'MDS'}, {'CSA', 'MDS', 'EED'}, {'CSA', 'MDS', 'EED', 'MFL'}, {'CSA', 'MDS', 'EED', 'MFL', 'ANG'}]:
    print(f"{itemset}: {sample[(sample[list(itemset)]).all(axis=1)].index.to_list()}")

{'CSA'}: [171, 172, 173, 174, 175, 176, 177, 178, 179, 180]
{'MDS', 'CSA'}: [171, 173, 175, 176, 177, 178, 179, 180]
{'MDS', 'CSA', 'EED'}: [171, 175, 177, 179, 180]
{'MFL', 'MDS', 'CSA', 'EED'}: [179]
{'MFL', 'MDS', 'ANG', 'CSA', 'EED'}: []


**Q2** En déduire les support (absolu et relatif) de ces cinq itemsets.

In [6]:
fq = apriori(sample, min_support=0.1, use_colnames=True) # ici un support < 0.1 vaut 0
for itemset in [{'CSA'}, {'CSA', 'MDS'}, {'CSA', 'MDS', 'EED'}, {'CSA', 'MDS', 'EED', 'MFL'}, {'CSA', 'MDS', 'EED', 'MFL', 'ANG'}]:
    sup = fq[fq['itemsets'] == itemset]['support']
    sup = sup.iloc[0] if len(sup) else 0
    print(f"{itemset}: {sup}")

{'CSA'}: 1.0
{'MDS', 'CSA'}: 0.8
{'MDS', 'CSA', 'EED'}: 0.5
{'MFL', 'MDS', 'CSA', 'EED'}: 0.1
{'MFL', 'MDS', 'ANG', 'CSA', 'EED'}: 0


**Q3** En suivant l'algorithme Apriori, lister les itemsets fréquents pour un support minimum de 7, avec le support associé à chaque itemset.

In [7]:
fq = apriori(sample, min_support=0.7, use_colnames=True)
fq.sort_values("support", ascending=False)

Unnamed: 0,support,itemsets
0,1.0,(CSA)
2,0.9,(SDM)
6,0.9,"(SDM, CSA)"
1,0.8,(MDS)
5,0.8,"(MDS, CSA)"
3,0.7,(RAY)
4,0.7,(EED)
7,0.7,"(RAY, CSA)"
8,0.7,"(CSA, EED)"
9,0.7,"(SDM, MDS)"


**Q4** Lister les itemsets fréquents pour un support minimum de 8.

Pour le calcul à la main, il faut bien entendu se baser sur le résultat de la question précédente, et ne pas repartir de zéro.  
Il en est de même pour un traitement informatisé : les itemsets fréquents pour un seuil de 0.8 sont déjà tous présents dans le résultat calculé.  
Pour cet exemple le temps de calcul est cependant très court (quelques milisecondes).

In [8]:
fq[fq['support'] >= 0.8]

Unnamed: 0,support,itemsets
0,1.0,(CSA)
1,0.8,(MDS)
2,0.9,(SDM)
5,0.8,"(MDS, CSA)"
6,0.9,"(SDM, CSA)"


**Q5** Vérifier vos réponses aux trois questions précédentes en les comparant avec celles calculés par la fonction `apriori` de la bibliothèque [mlxtend](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/).

**Q6** Lister les fréquents maximaux pour un support minimum de 0.7.

In [9]:
# Remarque : ce n'est pas la manière la plus efficace de calculer les fréquents maximaux, il vaut mieux utiliser un algorithme approprié
def isstrictsubset(itemset, itemsets):
    for it in itemsets:
        if itemset < it:
            return True
    return False
fq = apriori(sample, min_support=0.7, use_colnames=True)
fq['max'] = fq['itemsets'].apply(lambda x: not isstrictsubset(x, fq['itemsets']))
maxfq = fq[fq['max']]
maxfq

Unnamed: 0,support,itemsets,max
7,0.7,"(RAY, CSA)",True
11,0.7,"(SDM, MDS, CSA)",True
12,0.7,"(SDM, CSA, EED)",True


In [10]:
from mlxtend.frequent_patterns import fpmax

fq = fpmax(sample, min_support=0.7, use_colnames=True)
fq

Unnamed: 0,support,itemsets
0,0.7,"(RAY, CSA)"
1,0.7,"(SDM, CSA, EED)"
2,0.7,"(SDM, MDS, CSA)"


**Q7** Lister les fréquents clos pour un support minimum de 0.7.

In [11]:
# Remarque : ce n'est pas non plus la manière la plus efficace de calculer les fréquents clos
fq = apriori(sample, min_support=0.7, use_colnames=True)
fq['closed'] = fq.apply(lambda x: not isstrictsubset(x['itemsets'], fq[fq['support'] == x['support']]['itemsets']), axis=1)
clfq = fq[fq['closed']]
clfq

Unnamed: 0,support,itemsets,closed
0,1.0,(CSA),True
5,0.8,"(MDS, CSA)",True
6,0.9,"(SDM, CSA)",True
7,0.7,"(RAY, CSA)",True
11,0.7,"(SDM, MDS, CSA)",True
12,0.7,"(SDM, CSA, EED)",True


**Q8** Donner deux itemsets comparables &ndash; c'est-à-dire dont le support de l'un est garanti d'être supérieur ou égal au support de l'autre &ndash; et deux itemsets incomparables &ndash; c'est-à-dire dont les supports ne sont pas liés par la relation de monotonie.

Deux itemsets sont comparables si l'un est inclus dans l'autre, par exemple {CSA, MDS} et {CSA, MDS, SDM}, avec comme effet que support({CSA, MDS}) $\geq$ support({CSA, MDS, SDM}).

Deux itemsets sont incomparables dans le cas contraire, par exemple {CSA, MDS} et {CSA, EED, SDM}.

**Q9** Sur l'ensemble des notes du s3, comparer les temps d'exécution des fonctions `apriori`, `fpgrowth` et `fpmax`.  
Vous pourrez pour cela utiliser la commande [`%timeit`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit).

In [12]:
from mlxtend.frequent_patterns import fpgrowth, fpmax

In [13]:
min_sup = 0.001
%timeit apriori(s3, min_support=min_sup)
%timeit fpgrowth(s3, min_support=min_sup)
%timeit fpmax(s3, min_support=min_sup)

2.06 ms ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.04 ms ± 14.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
659 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


**Q10** Vérifier que les résultats donnés par `apriori` et `fpgrowth` sont identiques.

In [14]:
fq1 = apriori(s3, min_support=min_sup)
fq2 = fpgrowth(s3, min_support=min_sup)
fq1['itemsets'] = fq1['itemsets'].apply(lambda x: tuple(sorted(x)))
fq2['itemsets'] = fq2['itemsets'].apply(lambda x: tuple(sorted(x)))
fq1 = fq1.sort_values(by='itemsets').reset_index(drop=True)
fq2 = fq2.sort_values(by='itemsets').reset_index(drop=True)
fq1.equals(fq2)

True

**Q11** Quelle serait l'UE de 2, 3 ou 4 matières la plus difficile ?  
Ces résultats étaient-ils prévisibles en considérant uniquement le nombre d'échecs dans chaque matière ?

In [15]:
s3.sum().sort_values(ascending=False)

CSA    109
MDS     52
EED     51
RAY     33
MFL     17
CSB     16
SDM     15
CAS      6
ANG      5
STA      1
CSI      0
EPS      0
COM      0
dtype: int64

In [16]:
fq = fpgrowth(s3, min_support=0.01, use_colnames=True)
fq['len'] = fq['itemsets'].apply(len)
fq['support (abs.)'] = (fq['support'] * len(s3)).astype(int)

In [17]:
# exemple pour une UE avec 4 matières
fq[fq['len'] == 4].sort_values(by='support', ascending=False).head(10)

Unnamed: 0,support,itemsets,len,support (abs.)
84,0.033149,"(RAY, SDM, MDS, CSA)",4,6
77,0.027624,"(RAY, SDM, CSA, EED)",4,5
133,0.027624,"(MDS, CSA, CSB, EED)",4,5
112,0.027624,"(SDM, MDS, CSA, CSB)",4,5
78,0.027624,"(SDM, MDS, CSA, EED)",4,5
17,0.022099,"(RAY, MDS, CSA, EED)",4,4
120,0.022099,"(MFL, MDS, CSA, RAY)",4,4
104,0.022099,"(SDM, CSA, CSB, EED)",4,4
138,0.016575,"(RAY, MDS, CSA, CSB)",4,3
124,0.016575,"(MFL, MDS, CSA, EED)",4,3


Ici les échecs ne semblent pas complètement indépendants. Cependant, il faudrait vérifier si cela n'est pas simplement dû au hasard.  

## Règles d'association

**Q12** Sur l'extrait des dix derniers relevés, calculer le support des itemsets suivants :

- ANG
- MDS
- ANG, MDS
- ANG, CSB, MDS
- ANG, CSB, MFL
- CSB, MFL, MDS

In [18]:
fq = apriori(sample, min_support=0.1, use_colnames=True)
for itemset in [
    {'ANG'}, {'MDS'}, {'ANG', 'MDS'},
    {'ANG', 'CSB', 'MDS'}, {'ANG', 'CSB', 'MFL'}, {'CSB', 'MFL', 'MDS'},
]:
    sup = fq[fq['itemsets'] == itemset]['support']
    sup = sup.iloc[0] if len(sup) else 0
    print(f"{itemset}: {sup}")

{'ANG'}: 0.3
{'MDS'}: 0.8
{'MDS', 'ANG'}: 0.3
{'MDS', 'ANG', 'CSB'}: 0.2
{'MFL', 'ANG', 'CSB'}: 0
{'MFL', 'MDS', 'CSB'}: 0.2


**Q13** En déduire la confiance des règles d'association suivantes :

- ANG $\rightarrow$ MDS
- MDS $\rightarrow$ ANG
- ANG, MDS $\rightarrow$ CSB
- MFL, CSB $\rightarrow$ ANG

In [19]:
from mlxtend.frequent_patterns import association_rules

fq = fpgrowth(sample, min_support=0.1, use_colnames=True)
rl = association_rules(fq, min_threshold=0.1)
rl = pd.concat([rl[(rl['antecedents'] == x) & (rl['consequents'] == y)]
           for x, y in [
               ({'ANG'}, {'MDS'}),
               ({'MDS'}, {'ANG'}),
               ({'ANG', 'MDS'}, {'CSB'}),
               ({'MFL', 'CSB'}, {'ANG'})
           ]])
rl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
895,(ANG),(MDS),0.3,0.8,0.3,1.0,1.25,0.06,inf,0.285714
894,(MDS),(ANG),0.8,0.3,0.3,0.375,1.25,0.06,1.12,1.0
1916,"(MDS, ANG)",(CSB),0.3,0.6,0.2,0.666667,1.111111,0.02,1.2,0.142857


La dernière règle, non listée, (MFL, CSB $\rightarrow$ ANG) correspond à une confiance de zéro. En effet, dans ce cas, le support(X $\cup$ Y) = 0.

**Q14** Vérifier vos réponses à la question Q13 en les comparant avec celles calculées par la fonction `association_rules` de la bibliothèque [mlxtend](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/).

**Q15** Ajouter à la DataFrame résultat les mesures de Kulczynski, all_confidence, max_confidence, cosine et IR.  
Vous pourrez utiliser la fonction de la cellule ci-dessous à appliquer à la DataFrame résultat de la fonction `association_rules`.

In [20]:
def compute_metrics(df):
    rl = df.copy()
    rl['Kulc'] = rl['support'] * (rl['antecedent support'] + rl['consequent support']) / (2 * rl['antecedent support'] * rl['consequent support'])
    rl['all'] = pd.concat([rl['support'] / rl['antecedent support'], rl['support'] / rl['consequent support']], axis=1).min(axis=1)
    rl['max'] = pd.concat([rl['support'] / rl['antecedent support'], rl['support'] / rl['consequent support']], axis=1).max(axis=1)
    rl['cos'] = rl['support'] / np.sqrt(rl['antecedent support'] * rl['consequent support'])
    rl['IR'] = np.abs(rl['antecedent support'] - rl['consequent support']) / (rl['antecedent support'] + rl['consequent support'] - rl['support'])
    return rl

In [21]:
rl = compute_metrics(rl)
rl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,Kulc,all,max,cos,IR
895,(ANG),(MDS),0.3,0.8,0.3,1.0,1.25,0.06,inf,0.285714,0.6875,0.375,1.0,0.612372,0.625
894,(MDS),(ANG),0.8,0.3,0.3,0.375,1.25,0.06,1.12,1.0,0.6875,0.375,1.0,0.612372,0.625
1916,"(MDS, ANG)",(CSB),0.3,0.6,0.2,0.666667,1.111111,0.02,1.2,0.142857,0.5,0.333333,0.666667,0.471405,0.428571


**Q16** Quelles sont, d'après la mesure de Kulczynski, les règles les plus intéressantes sur l'ensemble des résultats du s3 ?  
Pensez à ajuster les seuils de support et de confiance afin de filtrer les résultats peu significatifs.

In [22]:
fq = fpgrowth(s3, min_support=5/len(s3), use_colnames=True) # calculer le min_support en fonction d'un nombre significatif d'élèves, ici 5/181 = 0.027
rl = association_rules(fq, min_threshold=0.5)
rl = compute_metrics(rl)
rl['support (abs.)'] = (rl['support'] * len(s3)).round(0).astype(int)
rl.sort_values(by=['Kulc', 'IR'], ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,Kulc,all,max,cos,IR,support (abs.)
7,(MDS),(CSA),0.287293,0.602210,0.226519,0.788462,1.309280,0.053509,1.880462,0.331443,0.582304,0.376147,0.788462,0.544589,0.475000,41
3,(RAY),(CSA),0.182320,0.602210,0.160221,0.878788,1.459272,0.050426,3.281768,0.384902,0.572421,0.266055,0.878788,0.483535,0.672566,29
67,"(MDS, CSB)",(CSA),0.060773,0.602210,0.060773,1.000000,1.660550,0.024175,inf,0.423529,0.550459,0.100917,1.000000,0.317675,0.899083,11
30,"(SDM, MDS)",(CSA),0.055249,0.602210,0.055249,1.000000,1.660550,0.021977,inf,0.421053,0.545872,0.091743,1.000000,0.302891,0.908257,10
51,(MFL),(CSA),0.093923,0.602210,0.088398,0.941176,1.562871,0.031837,6.762431,0.397485,0.543983,0.146789,0.941176,0.371691,0.836364,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9,(SDM),(EED),0.082873,0.281768,0.044199,0.533333,1.892810,0.020848,1.539069,0.514307,0.345098,0.156863,0.533333,0.289241,0.620690,8
28,"(SDM, MDS)","(CSA, EED)",0.055249,0.171271,0.027624,0.500000,2.919355,0.018162,1.657459,0.695906,0.330645,0.161290,0.500000,0.283981,0.583333,5
13,"(SDM, CSA)",(EED),0.077348,0.281768,0.038674,0.500000,1.774510,0.016880,1.436464,0.473054,0.318627,0.137255,0.500000,0.261968,0.637931,7
17,"(SDM, MDS)",(EED),0.055249,0.281768,0.027624,0.500000,1.774510,0.012057,1.436464,0.461988,0.299020,0.098039,0.500000,0.221404,0.732143,5


**Q17** Existe-t-il des règles permettant de déduire avec une confiance de 100% l'échec en rayonnement ? En automatique ? En stage ?

In [23]:
fq = fpgrowth(s3, min_support=0.001, use_colnames=True)
rl = association_rules(fq, min_threshold=1)
rl = compute_metrics(rl)
rl['support (abs.)'] = (rl['support'] * len(s3)).round(0).astype(int)
# Pour RAY, on observe des règles qui concernent plusieurs élèves
display(rl[rl['consequents'] >= {'RAY'}].sort_values(by=['support'], ascending=False).head(5))
# Mais pas pour CAS (maximum un élève par règle)
display(rl[rl['consequents'] >= {'CAS'}].sort_values(by=['support'], ascending=False).head(5))
# Aucune règle exacte ne s'applique pour le stage, malgré qu'un étudiant ait obtenu 0
display(rl[rl['consequents'] >= {'STA'}].sort_values(by=['support'], ascending=False).head(5))

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,Kulc,all,max,cos,IR,support (abs.)
712,"(MFL, SDM, EED)","(RAY, CSA)",0.01105,0.160221,0.01105,1.0,6.241379,0.009279,inf,0.849162,0.534483,0.068966,1.0,0.262613,0.931034,2
710,"(MFL, SDM, CSA, EED)",(RAY),0.01105,0.18232,0.01105,1.0,5.484848,0.009035,inf,0.826816,0.530303,0.060606,1.0,0.246183,0.939394,2
705,"(MFL, SDM, EED)",(RAY),0.01105,0.18232,0.01105,1.0,5.484848,0.009035,inf,0.826816,0.530303,0.060606,1.0,0.246183,0.939394,2
10,"(SDM, CAS)",(RAY),0.005525,0.18232,0.005525,1.0,5.484848,0.004518,inf,0.822222,0.515152,0.030303,1.0,0.174078,0.969697,1
324,"(MFL, CAS, MDS, EED)","(RAY, CSA)",0.005525,0.160221,0.005525,1.0,6.241379,0.00464,inf,0.844444,0.517241,0.034483,1.0,0.185695,0.965517,1


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,Kulc,all,max,cos,IR,support (abs.)
120,"(MFL, SDM, MDS, EED)",(CAS),0.005525,0.033149,0.005525,1.0,30.166667,0.005342,inf,0.972222,0.583333,0.166667,1.0,0.408248,0.833333,1
187,"(MFL, MDS, RAY, EED)",(CAS),0.005525,0.033149,0.005525,1.0,30.166667,0.005342,inf,0.972222,0.583333,0.166667,1.0,0.408248,0.833333,1
217,"(MFL, MDS, RAY, EED, SDM)",(CAS),0.005525,0.033149,0.005525,1.0,30.166667,0.005342,inf,0.972222,0.583333,0.166667,1.0,0.408248,0.833333,1
223,"(MFL, MDS, RAY, EED)","(SDM, CAS)",0.005525,0.005525,0.005525,1.0,181.0,0.005494,inf,1.0,1.0,1.0,1.0,1.0,0.0,1
225,"(MFL, SDM, MDS, EED)","(RAY, CAS)",0.005525,0.022099,0.005525,1.0,45.25,0.005403,inf,0.983333,0.625,0.25,1.0,0.5,0.75,1


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,Kulc,all,max,cos,IR,support (abs.)


## Lien entre les fréquents clos et maximaux et les règles d'association

**Q18** Calculer les règles pouvant être générées à partir des fréquents maximaux sur l'extrait des dix derniers relevés de note et un seuil de fréquence de 0.7.

**Q19** Vérifier vos résultats avec ceux générés par la bibliothèque `mlxtend`.

In [24]:
maxfq = fpmax(sample, min_support=0.7, use_colnames=True)
maxfq

Unnamed: 0,support,itemsets
0,0.7,"(RAY, CSA)"
1,0.7,"(SDM, CSA, EED)"
2,0.7,"(SDM, MDS, CSA)"


In [25]:
maxrl = association_rules(maxfq, min_threshold=0.7, support_only=True)
maxrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(RAY),(CSA),,,0.7,,,,,
1,(CSA),(RAY),,,0.7,,,,,
2,"(SDM, CSA)",(EED),,,0.7,,,,,
3,"(SDM, EED)",(CSA),,,0.7,,,,,
4,"(CSA, EED)",(SDM),,,0.7,,,,,
5,(SDM),"(CSA, EED)",,,0.7,,,,,
6,(CSA),"(SDM, EED)",,,0.7,,,,,
7,(EED),"(SDM, CSA)",,,0.7,,,,,
8,"(SDM, MDS)",(CSA),,,0.7,,,,,
9,"(SDM, CSA)",(MDS),,,0.7,,,,,


**Q20** Comparer les règles d'association générées sur la base des fréquents (Q3), des fréquents maximaux (Q6/Q18) et des fréquents clos (Q7).

In [26]:
# Fréquents (Q3)
fq = apriori(sample, min_support=0.7, use_colnames=True)
rl = association_rules(fq, min_threshold=0.1)
rl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(MDS),(CSA),0.8,1.0,0.8,1.0,1.0,0.0,inf,0.0
1,(CSA),(MDS),1.0,0.8,0.8,0.8,1.0,0.0,1.0,0.0
2,(SDM),(CSA),0.9,1.0,0.9,1.0,1.0,0.0,inf,0.0
3,(CSA),(SDM),1.0,0.9,0.9,0.9,1.0,0.0,1.0,0.0
4,(RAY),(CSA),0.7,1.0,0.7,1.0,1.0,0.0,inf,0.0
5,(CSA),(RAY),1.0,0.7,0.7,0.7,1.0,0.0,1.0,0.0
6,(CSA),(EED),1.0,0.7,0.7,0.7,1.0,0.0,1.0,0.0
7,(EED),(CSA),0.7,1.0,0.7,1.0,1.0,0.0,inf,0.0
8,(SDM),(MDS),0.9,0.8,0.7,0.777778,0.972222,-0.02,0.9,-0.222222
9,(MDS),(SDM),0.8,0.9,0.7,0.875,0.972222,-0.02,0.8,-0.125


In [27]:
# voir Q7 pour le calcul de clfq
clfq

Unnamed: 0,support,itemsets,closed
0,1.0,(CSA),True
5,0.8,"(MDS, CSA)",True
6,0.9,"(SDM, CSA)",True
7,0.7,"(RAY, CSA)",True
11,0.7,"(SDM, MDS, CSA)",True
12,0.7,"(SDM, CSA, EED)",True


In [28]:
# Fréquents clos
clrl = association_rules(clfq, min_threshold=0.1, support_only=True)
clrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(MDS),(CSA),,,0.8,,,,,
1,(CSA),(MDS),,,0.8,,,,,
2,(SDM),(CSA),,,0.9,,,,,
3,(CSA),(SDM),,,0.9,,,,,
4,(RAY),(CSA),,,0.7,,,,,
5,(CSA),(RAY),,,0.7,,,,,
6,"(SDM, MDS)",(CSA),,,0.7,,,,,
7,"(SDM, CSA)",(MDS),,,0.7,,,,,
8,"(MDS, CSA)",(SDM),,,0.7,,,,,
9,(SDM),"(MDS, CSA)",,,0.7,,,,,


La génération basée sur les fréquents clos n'a pas de perte d'information (si ce n'est qu'avec mlxtend, les mesures compélmentaires ne sont pas calculées : il s'agit d'une limitation technique et non théorique).

Par exemple la règle {EED} $\to$ {CSA} n'est pas présente car couverte par la règle {EED} $\to$ {CSA, SDM} qui possède les mêmes valeurs de support (antecédent et règle).

Pour les fréquents maximaux, la règle {SDM} $\to$ {CSA} est absente car couverte. On sait que sa confiance est supérieure à celles de {SDM} $\to$ {CSA, EED} et {SDM} $\to$ {CSA, MDS} sans pouvoir connaître la valeur exacte.

**Q21** Compléter les valeurs de support des antecédents et conséquents sur la base des support des itemsets fréquents associés (clos ou maximaux).

In [29]:
def sup(itemset, fq):
    return fq[fq['itemsets'] >= itemset]['support'].max()

In [30]:
# Exemple
it = {"CSA", "SDM"}
print(sup(it, fq))
print(sup(it, clfq))
print(sup(it, maxfq))

0.9
0.9
0.7


In [31]:
# Fréquents clos
# Pour les clos les supports sont exacts
# Il est possible sur cette base de calculer tous les autres indicateurs (confidence, lift, leverage, etc.)
clrl['antecedent support'] = clrl['antecedents'].apply(lambda x: sup(x, clfq))
clrl['consequent support'] = clrl['consequents'].apply(lambda x: sup(x, clfq))
clrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(MDS),(CSA),0.8,1.0,0.8,,,,,
1,(CSA),(MDS),1.0,0.8,0.8,,,,,
2,(SDM),(CSA),0.9,1.0,0.9,,,,,
3,(CSA),(SDM),1.0,0.9,0.9,,,,,
4,(RAY),(CSA),0.7,1.0,0.7,,,,,
5,(CSA),(RAY),1.0,0.7,0.7,,,,,
6,"(SDM, MDS)",(CSA),0.7,1.0,0.7,,,,,
7,"(SDM, CSA)",(MDS),0.9,0.8,0.7,,,,,
8,"(MDS, CSA)",(SDM),0.8,0.9,0.7,,,,,
9,(SDM),"(MDS, CSA)",0.9,0.8,0.7,,,,,


In [32]:
# Fréquents maximaux
# Pour les maximaux les supports calculés sont potentiellements inférieurs à la réalité
maxrl['antecedent support'] = maxrl['antecedents'].apply(lambda x: sup(x, maxfq))
maxrl['consequent support'] = maxrl['consequents'].apply(lambda x: sup(x, maxfq))
maxrl

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(RAY),(CSA),0.7,0.7,0.7,,,,,
1,(CSA),(RAY),0.7,0.7,0.7,,,,,
2,"(SDM, CSA)",(EED),0.7,0.7,0.7,,,,,
3,"(SDM, EED)",(CSA),0.7,0.7,0.7,,,,,
4,"(CSA, EED)",(SDM),0.7,0.7,0.7,,,,,
5,(SDM),"(CSA, EED)",0.7,0.7,0.7,,,,,
6,(CSA),"(SDM, EED)",0.7,0.7,0.7,,,,,
7,(EED),"(SDM, CSA)",0.7,0.7,0.7,,,,,
8,"(SDM, MDS)",(CSA),0.7,0.7,0.7,,,,,
9,"(SDM, CSA)",(MDS),0.7,0.7,0.7,,,,,
