# PrefixSpan

Les algorithmes de **sequential-pattern-mining** sont utilisés pour extraire des motifs fréquents dans des séquences de données. Il existe deux approches principales pour effectuer cette tâche: l'approche basée sur la génération de candidats (ou candidates generation) et l'approche basée sur la croissance de motifs (ou pattern growth).

+ Les algorithmes de **candidates generation**, tels que *AprioriAll* et *GSP* (Generalized Sequential Pattern), génèrent des candidats de motifs en combinant des ensembles de motifs de taille inférieure. Ils passent ensuite à travers la base de données pour compter le nombre d'occurrences de chaque candidat généré et éliminer les candidats non fréquents. Ces étapes de génération de candidats et de filtrage sont répétées jusqu'à ce que tous les motifs fréquents soient extraits.

+ En revanche, les algorithmes de **pattern growth**, tels que *PrefixSpan*, utilisent une approche différente. Ils construisent des motifs fréquents de manière récursive à partir d'un ensemble initial de préfixes. Ils développent ensuite chaque préfixe en ajoutant des éléments de la séquence qui sont fréquents. Cette approche évite la génération de tous les candidats et la comparaison avec la base de données, ce qui la rend souvent plus efficace que l'approche de candidates generation.

In [9]:
pip install prefixspan





[notice] A new release of pip available: 22.3.1 -> 23.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
from prefixspan import PrefixSpan
import pandas as pd
import numpy as np
from tqdm import tqdm
import pickle

In [11]:
%%capture capt
p_soins=pd.read_csv("../data/parcours_soins.csv")
patients=pd.read_csv("../data/profil_patient.csv")

In [12]:
p_soins=p_soins.rename(columns={"BEN_NIR_IDT": "CODE_PATIENT"})
p_soins=p_soins.set_index("CODE_PATIENT")
patients=patients.set_index("CODE_PATIENT")
p_soins["cluster"]=patients["cluster"]

In [13]:
with open('../data/parcours_soins.pickle', 'rb') as handle:
    p_soin=pickle.load(handle)
    
with open('../data/parcours_soins_dp.pickle', 'rb') as handle:
    p_soin_dp=pickle.load(handle)

In [14]:
def motifs_frequents(data, dico, topk):
    top_freq=[]
    top_effectif=[]
    top_motif=[]
    results=pd.DataFrame()

    for length in range(1,4):
        for cluster_p in range(18):
            ps = PrefixSpan(dico[f"Cluster {cluster_p}"])
            ps.minlen = length
            if ps.topk(k=topk) != [] :
                effectif_cluster = (data.cluster==cluster_p).sum()
                top_effectif.append(ps.topk(k=topk)[topk-1][0])
                top_freq.append(round(ps.topk(k=topk)[topk-1][0]/effectif_cluster,3))
                top_motif.append(ps.topk(k=topk)[topk-1][1])
            else:
                top_freq.append(0)
                top_effectif.append(0)
                top_motif.append([])

        results[f"len{length}_effectif"]=top_effectif
        results[f"len{length}_freq"]=top_freq
        results[f"len{length}_motif"]=top_motif
        top_freq=[]
        top_motif=[]
        top_effectif=[]

    return results

In [15]:
motifs_frequents(p_soins, p_soin, 1)

Unnamed: 0,len1_effectif,len1_freq,len1_motif,len2_effectif,len2_freq,len2_motif,len3_effectif,len3_freq,len3_motif
0,207,0.162,[02C05J],28,0.022,"[05M092, 05M092]",7,0.005,"[05M091, 05M092, 05M092]"
1,553,0.195,[05M093],96,0.034,"[05M092, 05M092]",18,0.006,"[05M092, 05M092, 05M092]"
2,38,0.158,[06K04J],7,0.029,"[06K04J, 06K04J]",4,0.017,"[06K04J, 06K04J, 06K04J]"
3,112,0.232,[05K101],15,0.031,"[05K101, 05K061]",3,0.006,"[05K101, 05K061, 05K101]"
4,180,0.377,[05M092],83,0.174,"[05M092, 05M092]",40,0.084,"[05M092, 05M092, 05M092]"
5,53,0.262,[05K101],14,0.069,"[05K101, 05K101]",10,0.05,"[11M171, 11M171, 11M171]"
6,136,0.35,[05M092],49,0.126,"[05M092, 05M092]",20,0.051,"[05M092, 05M093, 05M093]"
7,139,0.166,[02C05J],13,0.016,"[05K101, 02C05J]",4,0.005,"[05M092, 05M092, 05M092]"
8,109,0.304,[05M092],41,0.115,"[05M092, 05M092]",14,0.039,"[05M092, 05M092, 05M092]"
9,185,0.304,[05K101],60,0.099,"[05M092, 05M092]",28,0.046,"[05M092, 05M092, 05M092]"


In [16]:
motifs_frequents(p_soins, p_soin_dp, 1)

Unnamed: 0,len1_effectif,len1_freq,len1_motif,len2_effectif,len2_freq,len2_motif,len3_effectif,len3_freq,len3_motif
0,229,0.179,[I500],58,0.045,"[I500, I500]",22,0.017,"[I500, I500, I500]"
1,703,0.248,[I500],156,0.055,"[I500, I500]",38,0.013,"[I500, I500, I500]"
2,28,0.116,[I500],6,0.025,"[Z098, Z098]",2,0.008,"[Z098, Z098, Z098]"
3,67,0.139,[I501],16,0.033,"[Z098, Z098]",4,0.008,"[Z098, Z098, Z098]"
4,205,0.429,[I500],89,0.186,"[I500, I500]",52,0.109,"[I500, I500, I500]"
5,48,0.238,[I501],23,0.114,"[I501, I501]",14,0.069,"[Z940, Z940, Z940]"
6,141,0.362,[I500],72,0.185,"[I500, I500]",38,0.098,"[I500, I500, I500]"
7,146,0.175,[I500],22,0.026,"[I500, I500]",7,0.008,"[I500, I500, I500]"
8,138,0.385,[I501],53,0.148,"[I501, I501]",27,0.075,"[I501, I501, I501]"
9,198,0.326,[I500],87,0.143,"[I500, I500]",47,0.077,"[I500, I500, I500]"
