Ce document vise à calculer les probabilités qu'un mot tapé soit un autre mot de notre base de données.
On part du principe qu'on a une liste de mots connus.
On se soucie simplement de regarder la ressemblance avec d'autres mots existants, pas de regarder les mots à proximité.

In [1]:
dictionnaire = ['aime', 'aimé', 'aimer', 'aimes', 'meme', 'aimez', 'mimez', 'mime', 'même', 'mimer', 'mimes', 'mimé']
# exemple jouet, on déploiera ça de manière plus générale une fois qu'on aura la data de Quentin

La méthode qu'on va utiliser consiste à regarder le mot tapé de gauche à droite, et de calculer le nombre de lettres différentes pour chaque mot de notre dictionnaire.

Les probabilités calculées le seront a partir d'un score. Il faut considérer l'ensemble des erreurs possibles. On prend :
- La taille du mot qui diffère;
- Le nombre de lettres différentes (score plus ou moins important si les lettres sont proches de la lettre visée ou pas);
- Eventuelles inversions de lettres;
- Fautes d'accent

On suppose que deux mots peuvent potentiellement être proches seulement s'ils ont une taille assez similaire (+2 ou -1 caractères)

In [5]:
def is_len_words_near(mot_tapé, mot_dico):
    assert isinstance(mot_tapé, str)
    assert isinstance(mot_dico, str)
    n1 = len(mot_tapé)
    n2 = len(mot_dico)
    return (n1 >= n2-1 & n1 <= n2+2)

is_len_words_near('aime', 'aimeras')

True

Les mots qui ne satisfont pas ce critère voient leur probabilité d'être potentiellment proches égale à 0.

On suppose aussi que la première lettre du mot est toujours la bonne (c'est ce que font les correcteurs). Donc les mots qui ne commencent pas par la même lettre ont une probabilité nulle aussi.

In [7]:
def is_first_letter_the_same(mot_tapé, mot_dico):
    assert isinstance(mot_tapé, str)
    assert isinstance(mot_dico, str)
    return (mot_tapé[0] == mot_dico[0])

is_first_letter_the_same('aime', 'mimer')

False

Calcul du nombre de lettres différentes : on ajoute 3 au score si la lettre est différente et loin, 1 si elle est différente et proche, 1 si c'est une faute d'accent.

In [None]:
dict_proches_clavier = {"a":["&","é","z","s","q"],"z":["&","é","a","s","q","d","e","\""],"e":["\'","\"","z","s","f","d","r"],
 "t":["-","y","g","f","h","r","("],"y":["-","t","g","j","h","u","è"],"u":["_","y","k","j","h","i","è"],
 "i":["_","u","k","j","l","o","ç"],"o":["à","i","k","m","l","p","ç"],"p":["o","l","m","ù",")","à"],
 "q":["a","z","s","w","x","<"],"s":["a","z","k","q","d","w","x"],"d":["s","z","e","r","f","x","c"],
 "f":["e","r","t","d","g","c","v"],"g":["r","t","y","h","f","v","b"],"h":["t","y","u","g","j","b","n"],
 "j":["u","i","y","h","k","n",","],"k":["u","i","o","j","l",",",";"],"l":["i","o","p","k","m",";",":"],
 "m":["o","p","ù","l",":","!"],"w":["<","q","s","d","x"],"x":["q","s","d","c","w"],"c":["s","d","f","x","v"],
 "v":["f","g","c","b"],"b":["v","g","h","n"],"i":["b","h","j",","],"r":["\'","t","g","f","d","e","("]}

In [None]:
def edit_distance(s1, s2):
    # Nombre d'opérations nécessaires pour aller d'un mot à un autre (parmi suppression d'un carac, insertion, substitution)
    dp = [[0 for j in range(len(s2)+1)] for i in range(len(s1)+1)]
    for i in range(1, len(s2)+1):
        dp[0][i] = i
    
    for i in range(1, len(s1)+1):
        dp[i][0] = i

    for i in range(1, len(s1) + 1):
        for j in range(1, len(s2) + 1):
            dp[i][j] = min(min(dp[i-1][j], dp[i][j-1]) + 1, dp[i-1][j-1] + (1 if s1[i-1] != s2[j-1] else 0))
    
    return dp[len(s1)][len(s2)]

In [None]:
def edits1(word):
    """All edits that are one edit away from 'word'"""
    letters = 'abcdefghijklmnopqrstuvwxyz'
    splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
 
    deletes = [L + R[1:] for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
    inserts = [L + c + R for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

In [None]:
def editsk(word, k):
    """All edits that are k edits away from 'word'"""
    e = edits1(word)
    prev = e
    for i in range(1, k):
        temp = set(e2 for e1 in prev for e2 in edits1(e1))
        e.update(temp)
        prev = temp
    e.discard(word)
    return e

In [None]:
def score_letter_diff(lettre_tapée, lettre_dico):
    score = 0
    #if lettre_tapée in dico_accents and lettre_dico in dico_accents:
        #score += 1
    if lettre_dico.isin(dict_proches_clavier[lettre_tapée]):
        score += 1
    elif lettre_dico != lettre_tapée:
        score += 3
    return score
    