# Introduction

Dans cet essai je vais essayer d'appliquer des fonctions pr√©sent√©es dans les cours de NLP. A la fin de ce notebook j'appliquerai les fonctions les plus int√©ressantes √† notre dataset et j'essaierai d'am√©liorer notre code. 

Dans ce jupyter notebook, nous allons importer les data puis les s√©parer en trois groupes. Nous avons les formulaires des personnes concern√©es, de leur personnels soignants ainsi que de leur entourage. Nous allons analyzer chaque cat√©gorie et comparer les r√©sultats.  

La premi√®re √©tape d'analyze est le nettoyage des donn√©es. Cette √©tape est longue et elle est une priorit√© dans notre projet. Nous voulons extraire des besoins de textes o√π il ne reste uniquement les informations les plus importantes. 

La seconde √©tape est d'examiner les mots restant √† leur racine. Nous voulons √©purer ces mots. Nous appelons √ßa la lemmatization. 

La prochaine √©tape se veut de nous aider √† classifier les mots restants en les rassemblant en fonction de leur importance, leur sens et le sentiment que les phrases peuvent transmettre. Nous appelons √ßa le clustering. Nous allons mettre au point plusieurs diff√©rentes mani√®res de trouver des r√©sultats. Puis nous les comparerons. 

Les documents n√©cessaires pour faire tourner le jupyter notebook se trouvent dans le dossier Colab dans le projet git appel√© "Idlys". Il vous suffit de le t√©l√©charger puis d'aller dans la banderole √† gauche de votre √©cran sur Colab, de cliquer sur la quatri√®me cat√©gorie appel√©e "Dossiers" ou "Files", puis d'importer vos fichiers. Il est important qu'ils ne soient pas importer via "sample_data" ou d'un drive car cela modifierait le chemin d'importation et les fichiers ne pourront √™tre lu. 

# Librairies & Importation des donn√©es 

**Importation** 

> Importation de base

In [None]:
import math
import random
import numpy as np
import pandas as pd
import os
import json
import re
import scipy
import sklearn
import pdb
import pickle
import string
import time
import gensim
import matplotlib.pyplot as plt

> Importation n√©cessaire pour le NLP

In [None]:
# lemmatizer spacy
!pip install spacy-lefff
import spacy
from spacy_lefff import LefffLemmatizer, POSTagger


# from nltk
import nltk.corpus
import nltk as nlp
nltk.download('wordnet')
nltk.download('twitter_samples')
nltk.download('stopwords')
nltk.data.path.append('.')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords, twitter_samples
from nltk.tokenize import TweetTokenizer

# from gensim
from gensim.models import KeyedVectors

# from os 
from os import getcwd

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
!python -m spacy download fr

[38;5;2m‚úî Download and installation successful[0m
You can now load the model via spacy.load('fr_core_news_sm')
[38;5;2m‚úî Linking successful[0m
/usr/local/lib/python3.6/dist-packages/fr_core_news_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/fr
You can now load the model via spacy.load('fr')


In [None]:
import spacy

fr_spaCy = spacy.load("fr")

In [None]:
#!pip install urllib3==1.25.10

#!pip install smart_open==2.0.0

> Importation des fonctions cr√©es en annexe 

In [None]:
from nettoyage import stopwords
from Frequence import count_words, get_words_with_nplus_frequency, count_n_grams, estimate_probability, estimate_probabilities
from Pour_aller_plus_loin import make_count_matrix, make_probability_matrix
from Neighboor import cosine_similarity, get_dict, get_document_embedding, get_document_vecs, hash_value_of_vector, make_hash_table, approximate_knn, nearest_neighbor

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Concat√©ner les donn√©es et cr√©er trois dataframes en fonction des intervenants**

Nous importons nos donn√©es. Nous avons donc trois sortes de formulaires. 

Nous avons : 

> df1 : r√©ponses des personnes concern√©es 

> df2 : r√©ponses de leur personnel soigant 

> df3 : r√©ponses de leur entourage 

In [None]:
# Readind the excel and turning it into a single string
'''df1 = pd.read_excel('/content/sample_data/Copie-de-r√©ponses-proches-51-100.xlsx', header = None)
df1.head()'''

"df1 = pd.read_excel('/content/sample_data/Copie-de-r√©ponses-proches-51-100.xlsx', header = None)\ndf1.head()"

In [None]:
'''df2 = pd.read_excel('/content/sample_data/Copie-de-r√©ponses-proches-51-100.xlsx', header = None)
df2.head()'''

"df2 = pd.read_excel('/content/sample_data/Copie-de-r√©ponses-proches-51-100.xlsx', header = None)\ndf2.head()"

In [None]:
df3 = pd.read_excel('Copie-de-r√©ponses-proches-51-100.xlsx', header = None)
print(df3.head())


    0  ...                                                  6
0  N¬∞  ...       Si seulement il existait un truc pour faire‚Ä¶
1  51  ...    Un fauteuil √©lectrique avec d√©tection de danger
2  52  ...  Militer pour que le langage des signes soit ad...
3  53  ...  OUVRIR UNE PORTE D UN PLACAR AVEC UNE TELECOMM...
4  58  ...  Il faut d√©composer chaque geste construit en d...

[5 rows x 7 columns]


**S√©lectionner les colonnes qui pr√©sentent des besoins** 

Dans cette section nous s√©lectionnons les colonnes qui pourraient pr√©senter des besoins. 

Nous avons trois cat√©gories diff√©rentes de formulaires. Nous allons donc cr√©er trois datasets diff√©rents pour chaque cat√©gories contenant ces colonnes. 

> cat1 : cat√©gorie 1, personne concern√©e 

> cat2 : cat√©gorie 2, personnel soigant

> cat3 : cat√©gorie 3, entourage de personne concern√©e

In [None]:
'''cat1 = df1[6]
cat1 =cat1.to_string(index=False)
print(cat1)'''

'cat1 = df1[6]\ncat1 =cat1.to_string(index=False)\nprint(cat1)'

In [None]:
'''cat2 = df2[6]
cat2 =cat2.to_string(index=False)
print(cat2)'''

'cat2 = df2[6]\ncat2 =cat2.to_string(index=False)\nprint(cat2)'

In [None]:
cat3 = df3[6]
cat3 =cat3.to_string(index=False)
print(cat3)

      Si seulement il existait un truc pour faire‚Ä¶
   Un fauteuil √©lectrique avec d√©tection de danger
 Militer pour que le langage des signes soit ad...
 OUVRIR UNE PORTE D UN PLACAR AVEC UNE TELECOMM...
 Il faut d√©composer chaque geste construit en d...
                                               NaN
 Attacher le fauteuil dans le v√©hicule sans se ...
 pour transforme run fauteuil roulant en fauteu...
 Un d√©ambulateur performant pliable pour aider ...
                                               NaN
 Un truc pour alerter automatiquement le fourni...
 Un syst√®me pour permettre la lecture au lit (o...
 Un truc aussi pour pouvoir appeler au t√©l√©phon...
 S'il existait un truc pour aider une personne ...
 ne pas se contorsioner pour attacher le fauteu...
 Des parcs de jeux adapt√©s !!! DES plages avec ...
                domotique pour fermer le domicile?
                                               NaN
                                                 X
 Pas de solutions po

# Nettoyage 

Nous cr√©ons une fonction qui permet de nettoyer notre dataset. Cette fonction appelle d'autres fonctions cr√©e dans d'autres fichiers.

> Rappel : Ces fonctions se trouvent dans le fichier .py appel√© "Nettoyage". 

In [None]:
def clean(data): 

  # bags of words 
  data_1 = data.split()
  #print(data_1)

  # minuscule
  data_2 = [word.lower() for word in data_1]
  #print(data_2)

  # enlever tous les stopwords 
  data_3 = stopwords(data_2)
  #print(data_3)

  # enlever tous les "nan" du texte
  data_4 = [x for x in data_3 if str(x) != 'nan']
  #print(data_4)

  return data_4

Nous appliquons notre fonction √† toutes les cat√©gories. Nous cr√©eons un nouveau data set pour chacune des cat√©gories. 

> d1_net : donn√©es nettoy√©es cat√©gorie 1 

> d2_net : donn√©es nettoy√©es cat√©gorie 2

> d3_net : donn√©es nettoy√©es cat√©gorie 3 

In [None]:
'''d1_net = clean(cat1)
d2_net = clean(cat2)
d3_net = clean(cat3)'''

'd1_net = clean(cat1)\nd2_net = clean(cat2)\nd3_net = clean(cat3)'

In [None]:
d3_net = clean(cat3)
print(d3_net)

['existait', 'truc', 'faire‚Ä¶', 'fauteuil', '√©lectrique', 'd√©tection', 'danger', 'militer', 'langage', 'signes', 'ad...', 'ouvrir', 'porte', 'placar', 'telecomm...', 'faut', 'd√©composer', 'geste', 'construit', 'd...', 'attacher', 'fauteuil', 'v√©hicule', '...', 'transforme', 'run', 'fauteuil', 'roulant', 'fauteu...', 'd√©ambulateur', 'performant', 'pliable', 'aider', '...', 'truc', 'alerter', 'automatiquement', 'fourni...', 'syst√®me', 'permettre', 'lecture', 'lit', '(o...', 'truc', 'pouvoir', 'appeler', 't√©l√©phon...', "s'il", 'existait', 'truc', 'aider', '...', 'contorsioner', 'attacher', 'fauteu...', 'parcs', 'jeux', 'adapt√©s', '!!!', 'plages', '...', 'domotique', 'fermer', 'domicile?', 'solutions', 'attacher', 'correctement', 'mo...', 'www.changing-places.org', 'voir', 'favoriser', 'hand', 'sport', 'petit.', 'lew', 'parents', 'pouvait', 'aid...', 'oui', 'enlever', 'spasticit√©', 'membres', 'inf√©rieu...']


# Fr√©quence



Dans cette section nous voulons faire une comparaison sur les diff√©rents types de mots pr√©sents en fonction de chaque cat√©gorie.

Nous allons chercher √† savoir la fr√©quence des mots dans chaque texte.Nous pourrions savoir si certains mots reviennent plus que d'autres dans chacune des cat√©ogries ou par exemple si une cat√©gorie utilisent plus certains mots en particulier.

Nous allons proc√©der en plusieurs √©tapes. Nous voulons tout d'abord d√©compter chaque mot puis calculer leur fr√©quence. Dans un second temps, nous allons utiliser des foncitonnalit√©s plus avanc√©es comme la N-grams o√π nous pourrions d√©compter chaque mot mais aussi des couples de mots et pouvoir mieux analyzer notre texte. 

Puis nous allons voir comment nous pouvons afficher la fr√©quence des mots dans notre dataset. Nous allons d'abord r√©fl√©chir √† un certain mot cl√©, une target et calculer sa fr√©quence. Puis nous allons g√©n√©rer cette op√©rations sur l'ensemble de notre dataset et calculer les probablit√©s de chacun des mots. 

In [None]:
def frequence(data):

   # count chaque mot
   data_1 = count_words(data)
   print("Compte:", data_1)

   # fr√©quence de chaque mot 
   data_2 = get_words_with_nplus_frequency(data, count_threshold=2)
   print("Fr√©quence:", data_2)

   # N-grams
   data_3 = count_n_grams(data, 1)
   print("N-grams:", data_3)
   data_32 = count_n_grams(data, 2)
   print("N-grams par couples de mots:", data_32)

   # calculation probability target
   #data_4 = estimate_probability(data)
   unique_words = [d3_net[i] for i in range(0, 50)] # on prend les 50 premi√®res lignes de notre dataset 
   data_4 = estimate_probability("faire", "truc", data_3, data_32, len(unique_words), k=1)
   print("Probabilit√© de la target:", data_4)

   # calcul probaility all words 
   unique_words = [d3_net[i] for i in range(0, 50)]
   data_5 = estimate_probabilities("faire", data_3, data_32, unique_words, k=1)
   print("Probabilit√© de chacun des mots:", data_5)

   return 



Test:

In [None]:
Get_Freq = frequence(d3_net)

Compte: {'existait': 2, 'truc': 4, 'faire‚Ä¶': 1, 'fauteuil': 3, '√©lectrique': 1, 'd√©tection': 1, 'danger': 1, 'militer': 1, 'langage': 1, 'signes': 1, 'ad...': 1, 'ouvrir': 1, 'porte': 1, 'placar': 1, 'telecomm...': 1, 'faut': 1, 'd√©composer': 1, 'geste': 1, 'construit': 1, 'd...': 1, 'attacher': 3, 'v√©hicule': 1, '...': 4, 'transforme': 1, 'run': 1, 'roulant': 1, 'fauteu...': 2, 'd√©ambulateur': 1, 'performant': 1, 'pliable': 1, 'aider': 2, 'alerter': 1, 'automatiquement': 1, 'fourni...': 1, 'syst√®me': 1, 'permettre': 1, 'lecture': 1, 'lit': 1, '(o...': 1, 'pouvoir': 1, 'appeler': 1, 't√©l√©phon...': 1, "s'il": 1, 'contorsioner': 1, 'parcs': 1, 'jeux': 1, 'adapt√©s': 1, '!!!': 1, 'plages': 1, 'domotique': 1, 'fermer': 1, 'domicile?': 1, 'solutions': 1, 'correctement': 1, 'mo...': 1, 'www.changing-places.org': 1, 'voir': 1, 'favoriser': 1, 'hand': 1, 'sport': 1, 'petit.': 1, 'lew': 1, 'parents': 1, 'pouvait': 1, 'aid...': 1, 'oui': 1, 'enlever': 1, 'spasticit√©': 1, 'membres': 1,

Dans cette section nous avons voulu explorer nos donn√©es. Nous chercher les mots qui apparaissaient le plus de fois puis leur fr√©quence exacte. Nous avons aussi introduit la fonctionnalit√© de N-grams qui permet de d√©compter tous les mots du dataset ainsi que de possibles couples de mots. Il est int√©ressant de voir que certains mots vont par pairs comme "fauteuils roulants" ou "langue signes". Ces couples peuvent avoir une fr√©quence √©l√©v√©e et peuvent nous r√©l√©ver des indices sur les besoins ou les pricipaux th√®mes des r√©ponses. 



# Pour aller plus loin 

> Pour aller plus loin et ainsi une meilleure compr√©hension, voici une courte documentation sur les fonctionnalit√©s utilis√©es dans la fonction. 

> Rappel : toutes les fonctions utilis√©es dans cette section se trouvent dans les fichiers python (.py) appel√©s **Fr√©quence** et **Neighboor** 

**N-gram**



Dans cette section, nous allons d√©velopper le mod√®le de langage n-grams.
- Supposons que la probabilit√© du mot suivant ne d√©pende que du n-gramme pr√©c√©dent.
- Le n-gramme pr√©c√©dent est la s√©rie des "n" mots pr√©c√©dents.

La probabilit√© conditionnelle pour le mot √† la position "t" dans la phrase, √©tant donn√© que les mots qui le pr√©c√®dent sont $w_{t-1}, w_{t-2} \cdots w_{t-n}$ is:

$$ P(w_t | w_{t-1}\dots w_{t-n}) \tag{1}$$

Nous pouvons estimer cette probabilit√© en comptant les occurrences de ces s√©ries de mots dans les donn√©es de formation.
- La probabilit√© peut √™tre estim√©e sous la forme d'un rapport, o√π
- Le num√©rateur est le nombre de fois que le mot "t" appara√Æt apr√®s les mots t-1 √† t-n dans les donn√©es de formation.
- Le d√©nominateur est le nombre de fois que les mots t-1 √† t-n apparaissent dans les donn√©es d'entra√Ænement.

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} $$

- La fonction $C(\cdots)$ indique le nombre d'occurrences de la s√©quence donn√©e. 
- La fonction $\hat{P}$ d√©signe l'estimation de $P$. 
- Notez que le d√©nominateur de l'√©quation (2) est le nombre d'occurrences des mots $n$ pr√©c√©dents, et le num√©rateur est la m√™me s√©quence suivie du mot $w_t$.

Plus tard, nous modifierons l'√©quation (2) en ajoutant un lissage k, qui √©vite les erreurs lorsque les comptes sont nuls.

L'√©quation (2) nous dit que pour estimer les probabilit√©s bas√©es sur les n-grammes, nous avons besoin des nombres de n-grammes (pour le d√©nominateur) et de (n+1)-grammes (pour le num√©rateur).


**Counts-grams**

Nous cr√©ons une fonction qui calcule les comptes de n-grammes pour un nombre arbitraire ùëõ. 
Lors du calcul du nombre de n-grammes, pr√©parer la phrase √† l'avance en pr√©parant ùëõ-1

des marqueurs de d√©part "< s >" pour indiquer le d√©but de la phrase.

    - Par exemple, dans le mod√®le du bi-gramme (N=2), une s√©quence avec deux marqueurs de d√©part "<s><s>" devrait pr√©dire le premier mot d'une phrase.
    - Ainsi, si la phrase est "J'aime la nourriture", modifiez la pour qu'elle soit "<s><s> J'aime la nourriture".
    - Pr√©parez √©galement la phrase pour le comptage en ajoutant un jeton de fin "<e>" afin que le mod√®le puisse pr√©dire quand terminer une phrase.

Note technique : dans cette impl√©mentation, vous stockerez les comptages sous forme de dictionnaire.

    - La cl√© de chaque paire cl√©-valeur dans le dictionnaire est un tuple de n mots (et non une liste)
    - La valeur dans la paire cl√©-valeur est le nombre d'occurrences.
    - La raison pour laquelle on utilise un tuple comme cl√© au lieu d'une liste est qu'une liste en Python est un objet mutable (elle peut √™tre modifi√©e apr√®s sa cr√©ation). Un tuple est "immuable", c'est-√†-dire qu'il ne peut pas √™tre modifi√© apr√®s sa cr√©ation. Un tuple peut donc √™tre utilis√© comme type de donn√©es pour la cl√© d'un dictionnaire.

In [None]:
print("Uni-gram:")
print(count_n_grams(d3_net, 1))
print("Bi-gram:")
print(count_n_grams(d3_net, 2))

Uni-gram:
{('existait',): 2, ('truc',): 4, ('faire‚Ä¶',): 1, ('fauteuil',): 3, ('√©lectrique',): 1, ('d√©tection',): 1, ('danger',): 1, ('militer',): 1, ('langage',): 1, ('signes',): 1, ('ad...',): 1, ('ouvrir',): 1, ('porte',): 1, ('placar',): 1, ('telecomm...',): 1, ('faut',): 1, ('d√©composer',): 1, ('geste',): 1, ('construit',): 1, ('d...',): 1, ('attacher',): 3, ('v√©hicule',): 1, ('...',): 4, ('transforme',): 1, ('run',): 1, ('roulant',): 1, ('fauteu...',): 2, ('d√©ambulateur',): 1, ('performant',): 1, ('pliable',): 1, ('aider',): 2, ('alerter',): 1, ('automatiquement',): 1, ('fourni...',): 1, ('syst√®me',): 1, ('permettre',): 1, ('lecture',): 1, ('lit',): 1, ('(o...',): 1, ('pouvoir',): 1, ('appeler',): 1, ('t√©l√©phon...',): 1, ("s'il",): 1, ('contorsioner',): 1, ('parcs',): 1, ('jeux',): 1, ('adapt√©s',): 1, ('!!!',): 1, ('plages',): 1, ('domotique',): 1, ('fermer',): 1, ('domicile?',): 1, ('solutions',): 1, ('correctement',): 1, ('mo...',): 1, ('www.changing-places.org',): 1,

**Estimer la probabilit√© d'un mot target**

Nous voulons estimer la probabilit√© d'un mot donn√© par rapport aux "n" mots pr√©c√©dents en utilisant le nombre de n-grammes.

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n)}{C(w_{t-1}\dots w_{t-n})} \tag{2} $$

Cette formule ne fonctionne pas quand le compte d'un n-gramme est √©gal √† z√©ro..
- Supposons que nous rencontrions un n-gram qui ne figurait pas dans les donn√©es de formation.  
- Alors, l'√©quation (2) ne peut pas √™tre √©valu√©e (elle devient z√©ro divis√© par z√©ro).

Une fa√ßon de traiter les comptes de z√©ros est d'ajouter un lissage k.  
- Le K-smoothing ajoute une constante positive $k$ √† chaque num√©rateur et $k \times |V|$ au d√©nominateur, o√π $|V|$ est le nombre de mots du vocabulaire.

$$ \hat{P}(w_t | w_{t-1}\dots w_{t-n}) = \frac{C(w_{t-1}\dots w_{t-n}, w_n) + k}{C(w_{t-1}\dots w_{t-n}) + k|V|} \tag{3} $$


Pour les n-grammes qui ont un compte de z√©ro, l'√©quation (3) devient $\frac{1}{|V|}$.
- Cela signifie que tout n-gramme ayant une valeur nulle a la m√™me probabilit√© de $\frac{1}{|V|}$.

D√©finissez une fonction qui calcule l'estimation de la probabilit√© (3) √† partir du nombre de n-grammes et d'une constante $k$.

- La fonction prend dans un dictionnaire "n_gram_counts", o√π la cl√© est le n-gram et la valeur est le nombre de ce n-gram.
- La fonction prend √©galement un autre dictionnaire "n_plus1_gram_counts", que vous utiliserez pour trouver le compte du n-gram pr√©c√©dent plus le mot courant.

In [None]:
#test
unique_words = [d3_net[i] for i in range(0, 50)]

unigram_counts = count_n_grams(d3_net, 1)
bigram_counts = count_n_grams(d3_net, 2)
tmp_prob = estimate_probability(d3_net[3], d3_net[2], unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"La probabilit√© estim√©e du mot 'faire', compte tenu du n-gramme 'truc' pr√©c√©dent, est: {tmp_prob:.4f}")

La probabilit√© estim√©e du mot 'faire', compte tenu du n-gramme 'truc' pr√©c√©dent, est: 0.0200


**Estimer la probabilit√© de tous les mots**

La fonction d√©finie ci-dessous fait une boucle sur tous les mots du vocabulaire pour calculer les probabilit√©s de tous les mots possibles.

In [None]:

#test
unique_words = [d3_net[i] for i in range(0, 50)]
unigram_counts = count_n_grams(d3_net, 1)
bigram_counts = count_n_grams(d3_net, 2)
prob_globale = estimate_probabilities(d3_net[2], unigram_counts, bigram_counts, unique_words, k=1)
print(prob_globale)

{'existait': 0.019230769230769232, 'truc': 0.019230769230769232, 'faire‚Ä¶': 0.019230769230769232, 'fauteuil': 0.019230769230769232, '√©lectrique': 0.019230769230769232, 'd√©tection': 0.019230769230769232, 'danger': 0.019230769230769232, 'militer': 0.019230769230769232, 'langage': 0.019230769230769232, 'signes': 0.019230769230769232, 'ad...': 0.019230769230769232, 'ouvrir': 0.019230769230769232, 'porte': 0.019230769230769232, 'placar': 0.019230769230769232, 'telecomm...': 0.019230769230769232, 'faut': 0.019230769230769232, 'd√©composer': 0.019230769230769232, 'geste': 0.019230769230769232, 'construit': 0.019230769230769232, 'd...': 0.019230769230769232, 'attacher': 0.019230769230769232, 'v√©hicule': 0.019230769230769232, '...': 0.019230769230769232, 'transforme': 0.019230769230769232, 'run': 0.019230769230769232, 'roulant': 0.019230769230769232, 'fauteu...': 0.019230769230769232, 'd√©ambulateur': 0.019230769230769232, 'performant': 0.019230769230769232, 'pliable': 0.019230769230769232,

**Matrices de probabilit√©s**

Ce tableau ci-dessous pr√©sente la pr√©mi√®re matrice de probabilit√©s que nous r√©alisons ici. Elle nous premet de voir quels peuvent √™tre les mots corr√©l√©s dans notre texte. Elle nous montre aussi deux types de corr√©lation. Le premier est le type de mot qui doivent √™tre associ√©s √† un autre pour avoir du sens, comme "langage signes". Le second type est une corr√©lation qui donnerait au mot plus d'impact ou des associations de mots qui transmettraient une id√©e comme "militer langage", "pouvoir appeler" ou encore "d√©tection danger".

Ce qui pourrait maintenant √™tre fait est de s√©parer ces deux types et garder la seconde mais transformer la premi√®re. Il faudrait cr√©er une liste d'expression qui pourrait revenir comme "fauteuil roulant" ou "langage signes", qui pourraient √™tre remplacer par un mot √©quivalent. Cela apporterait un nettoyage dans cette matrices et nous permettrait de sans doute observer des th√®mes ou extraire des id√©es du textes. 



In [None]:
unique_words = [d3_net[i] for i in range(0, 70)]
bigram_counts = count_n_grams(d3_net, 2)

print('bigram counts')
display(make_count_matrix(bigram_counts, unique_words))

bigram counts


Unnamed: 0,existait,truc,faire‚Ä¶,fauteuil,√©lectrique,d√©tection,danger,militer,langage,signes,ad...,ouvrir,porte,placar,telecomm...,faut,d√©composer,geste,construit,d...,attacher,fauteuil.1,v√©hicule,...,transforme,run,fauteuil.2,roulant,fauteu...,d√©ambulateur,performant,pliable,aider,....1,truc.1,alerter,automatiquement,fourni...,syst√®me,permettre,lecture,lit,(o...,truc.2,pouvoir,appeler,t√©l√©phon...,s'il,existait.1,truc.3,aider.1,....2,contorsioner,attacher.1,fauteu....1,parcs,jeux,adapt√©s,!!!,plages,....3,domotique,fermer,domicile?,solutions,attacher.2,correctement,mo...,www.changing-places.org,voir,<e>,<oov>
"(aider,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(syst√®me,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(construit,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(domotique,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(correctement,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"(d√©ambulateur,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
(),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(faut,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(attacher,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


Les matrices suivantes sont d'autres matrices de probabilit√©s. Elles montrent la probabilit√© des mots du texte en fonction des autres mots. La premi√®re est une matrice de probabilit√© effectuant la fonctionalit√© bi-gram soit par couple de mots. Elle nous montre la probabilit√© d'un mot par rapport √† un autre. Par exmple, si il existe le mot "v√©hicule" dans le texte, en fonction de ce texte, la probabilit√© que il y ait aussi "existait" est de 0.018868. 

La seconde matrice utilise les fonctionalit√©s du trigram. Cela signifie que par couple de mots, il matrice calculera la probabilit√© d'un mot. Par exemple ici, la matrice choisit un couple de mots du texte (ouvrir, porte). Ces mots semblent tr√®s corr√©l√©s. Elle montre ensuite une probabilit√© importante de l'existance du mot "placar" soit 0.0377. Cela peut √™tre interpr√©ter par le fait que la matrice trouve des mots corr√©l√©s qui peuvent faire partis d'un th√®me particulier, comme ici le fait d'ouvrir une porte. Puis elle pr√©duit qu'il ait une importante probabilit√© que d'autres mots similaires ou appartenant au m√™me th√®me apparaissent. Le mot "placar" peut √™tre simialire ici, au mot "porte" car il n√©cessite la m√™me action pour l'utiliser soit le fait d'ouvrir. 

In [None]:
def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
    count_matrix = make_count_matrix(n_plus1_gram_counts, vocabulary)
    count_matrix += k
    prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)
    return prob_matrix

In [None]:
unique_words = [d3_net[i] for i in range(0, 50)]
bigram_counts = count_n_grams(d3_net, 2)
print("bigram probabilities")
#print(make_probability_matrix(bigram_counts, unique_words, k=1))
display(make_probability_matrix(bigram_counts, unique_words, k=1))

bigram probabilities
2021-01-06 21:26:30,460 - numexpr.utils - INFO - NumExpr defaulting to 2 threads.


Unnamed: 0,existait,truc,faire‚Ä¶,fauteuil,√©lectrique,d√©tection,danger,militer,langage,signes,ad...,ouvrir,porte,placar,telecomm...,faut,d√©composer,geste,construit,d...,attacher,fauteuil.1,v√©hicule,...,transforme,run,fauteuil.2,roulant,fauteu...,d√©ambulateur,performant,pliable,aider,....1,truc.1,alerter,automatiquement,fourni...,syst√®me,permettre,lecture,lit,(o...,truc.2,pouvoir,appeler,t√©l√©phon...,s'il,existait.1,truc.3,<e>,<oov>
"(aider,)",0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.055556,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519
"(syst√®me,)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868
"(construit,)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868
"(domotique,)",0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231
"(correctement,)",0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"(d√©ambulateur,)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868
(),0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231
"(faut,)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868
"(attacher,)",0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.037037,0.018519,0.037037,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519,0.018519


In [None]:
print("trigram probabilities")
trigram_counts = count_n_grams(d3_net, 3)
display(make_probability_matrix(trigram_counts, unique_words, k=1))

trigram probabilities


Unnamed: 0,existait,truc,faire‚Ä¶,fauteuil,√©lectrique,d√©tection,danger,militer,langage,signes,ad...,ouvrir,porte,placar,telecomm...,faut,d√©composer,geste,construit,d...,attacher,fauteuil.1,v√©hicule,...,transforme,run,fauteuil.2,roulant,fauteu...,d√©ambulateur,performant,pliable,aider,....1,truc.1,alerter,automatiquement,fourni...,syst√®me,permettre,lecture,lit,(o...,truc.2,pouvoir,appeler,t√©l√©phon...,s'il,existait.1,truc.3,<e>,<oov>
"(jeux, adapt√©s)",0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231
"(t√©l√©phon..., s'il)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868
"(..., domotique)",0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231
"(faut, d√©composer)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868
"(lecture, lit)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"(ouvrir, porte)",0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.037736,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868,0.018868
"(fauteu..., parcs)",0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231
"(enlever, spasticit√©)",0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231
"(plages, ...)",0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231,0.019231


La prochaine √©tape serait de garder dans un nouveau tableau les mots reli√©s √† une probabilit√© sup√©rieur √† 0.030. Certains besoins ou th√®mes vont pouvoir apparaitre. Cela aidera pour la prochaine √©tape de clustering puis de l'√©laboration de l'algorithm de machine learning. 

# Lemmatization et racine

Voic un des r√©sultats de lemmatization. Cette fonction ci-dessous nous donne une lemmatization tr√®s simplifi√©e. Elle n'est pas la plus efficace. Nous conseillons l'utilisation des librairies **spaCy**. Ici, nous ne utilisons pas mais potentiellement elles serait plus int√©ressante car elle est travaill√©e pour diff√©rentes langues . Elle ne s'applique pas uniquement pour l'anglais mais aussi pour des textes fran√ßais ou encore allemand.

Cependant, les r√©sultats que nous obtenons sont satisfaisants.Une grande partie des mots sont ramen√©s √† leur racine. 

In [None]:
def lem(data): 

  # stemming 
  porter_stemmer = nlp.PorterStemmer()
  roots = [porter_stemmer.stem(each) for each in d3_net]
  print("result of stemming: ",roots)

  # lemmatization 
  lemma = nlp.WordNetLemmatizer()
  lem_roots = [lemma.lemmatize(each) for each in roots]
  print("result of lemmatization: ",lem_roots)

  return lem_roots 
 

In [None]:
'''d1_lem = lem(d1_net)
d2_lem = lem(d2_net)
d3_lem = lem(d3_net)'''

'd1_lem = lem(d1_net)\nd2_lem = lem(d2_net)\nd3_lem = lem(d3_net)'

In [None]:
d3_lem = lem(d3_net)

result of stemming:  ['existait', 'truc', 'faire‚Ä¶', 'fauteuil', '√©lectriqu', 'd√©tection', 'danger', 'milit', 'langag', 'sign', 'ad...', 'ouvrir', 'port', 'placar', 'telecomm...', 'faut', 'd√©compos', 'gest', 'construit', 'd...', 'attach', 'fauteuil', 'v√©hicul', '...', 'transform', 'run', 'fauteuil', 'roulant', 'fauteu...', 'd√©ambulateur', 'perform', 'pliabl', 'aider', '...', 'truc', 'alert', 'automatiqu', 'fourni...', 'syst√®m', 'permettr', 'lectur', 'lit', '(o...', 'truc', 'pouvoir', 'appel', 't√©l√©phon...', "s'il", 'existait', 'truc', 'aider', '...', 'contorsion', 'attach', 'fauteu...', 'parc', 'jeux', 'adapt√©', '!!!', 'plage', '...', 'domotiqu', 'fermer', 'domicile?', 'solut', 'attach', 'correct', 'mo...', 'www.changing-places.org', 'voir', 'favoris', 'hand', 'sport', 'petit.', 'lew', 'parent', 'pouvait', 'aid...', 'oui', 'enlev', 'spasticit√©', 'membr', 'inf√©rieu...']
result of lemmatization:  ['existait', 'truc', 'faire‚Ä¶', 'fauteuil', '√©lectriqu', 'd√©tection', 'danger

# Clustering 

## Trouver des r√©ponses similaires 

### Premi√®re m√©thode : Nearest Neighbor

Dans cette section, le but est trouv√© des r√©ponses similaires et de les rassembler ensembles. Nous allons introduire un document comprenant les "embeddings", soit le chiffrement des mots. Nous allons les utilis√©s pour assimil√©s les mots de notre dataset avec un chiffrement qui sera lui-m√™me utilis√© pour vectoriser ces mots. 

Ces vecteurs vont √™tre compar√©s et ainsi, analys√©s pour ainsi les rassembler en groupe. Gr√¢ce √† cela, nous pourrons comment √† classifier les types de r√©ponses et nettoyer encore plus profondemment notre dataset. 

Cette √©tape est notre premier √©tape de clustering. 

Nous importons un document qui contient les √©quivalents chiffr√©s de chaque mot de notre dataset. 

In [None]:
fr_embeddings = os.path.join('/content/fr_embeddings.p')
fr_embeddings = str(fr_embeddings)
d3_lem = str(d3_lem)

Dans un premier temps, nous allons associer aux mots pr√©sents dans notre texte des embeddings, puis nous allons vectoriser ces mots. 

In [None]:
def mot_vecteur(data, fr_embeddings): 

  # associer chiffrement aux mots 
  mot_embedding = get_document_embedding(data, fr_embeddings)
  #print('Mots chiffr√©s:', mot_embedding)

  # Vectoriser les mots 
  mot_vect, ind2data = get_document_vecs(data, fr_embeddings)
  #print('Mots vectoris√©s:', mot_vect, ind2data)

  return mot_embedding, mot_vect, ind2data


In [None]:
mot_embedding, mot_vect, ind2data = mot_vecteur(d3_lem, fr_embeddings)
print(mot_vect)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
print(f"length of dictionary {len(ind2data)}")
print(f"shape of document_vecs {mot_vect.shape}")

length of dictionary 859
shape of document_vecs (859, 300)


Maintenant que nous avons cr√©er ces vecteurs, nous voulons assembler entre elles les r√©ponses les plus similaires. 

Pour cela, nous allons cr√©er des tables de hachages et appeler les fonctions de clustering comme K-means et K-neighboors. Ces fonctions vont permettre de rassembler les r√©ponses les plus simalaires gr√¢ce √† leur vecteurs. 

In [None]:
# this gives you a similar tweet as your input.
# this implementation is vectorized...
import numpy as np
np.seterr(divide='ignore', invalid='ignore')

idx = np.argmax(cosine_similarity(mot_vect, mot_embedding))
print(d3_lem[idx])

[


In [None]:
N_VECS = len(d3_lem)       # This many vectors.
N_DIMS = len(ind2data[1])     # Vector dimensionality.
print(f"Number of vectors is {N_VECS} and each has {N_DIMS} dimensions.")

Number of vectors is 859 and each has 300 dimensions.


In [None]:
# The number of planes. We use log2(625) to have ~16 vectors/bucket.
N_PLANES = 10
# Number of times to repeat the hashing to improve the search.
N_UNIVERSES = 25

In [None]:
np.random.seed(0)
planes_l = [np.random.normal(size=(N_DIMS, N_PLANES))
            for _ in range(N_UNIVERSES)]

Nous appliquons donc les fonctions suivantes :

* la table de hachage 
* l'approximation knn 
* le vecteur le plus proche 

In [None]:
np.random.seed(0)
idx = 0
planes = planes_l[idx]  # get one 'universe' of planes to test the function
vec = np.random.rand(1, 300)

In [None]:
def table_hash(vec,document_vecs, planes) : 

  print(f" The hash value for this vector,",
      f"and the set of planes at index {idx},",
      f"is {hash_value_of_vector(vec, planes)}")
  
  tmp_hash_table, tmp_id_table = make_hash_table(document_vecs, planes)
  print(f"The hash table at key 0 has {len(tmp_hash_table[0])} document vectors")
  print(f"The id table at key 0 has {len(tmp_id_table[0])}")
  print(f"The first 5 document indices stored at key 0 of are {tmp_id_table[0][0:5]}")

  return


In [None]:
hash = table_hash(vec, mot_vect, planes)

 The hash value for this vector, and the set of planes at index 0, is 768
The hash table at key 0 has 859 document vectors
The id table at key 0 has 859
The first 5 document indices stored at key 0 of are [0, 1, 2, 3, 4]


In [None]:
# Creating the hashtables
hash_tables = []
id_tables = []
for universe_id in range(N_UNIVERSES):  # there are 25 hashes
    print('working on hash universe #:', universe_id)
    planes = planes_l[universe_id]
    hash_table, id_table = make_hash_table(vec, planes)
    hash_tables.append(hash_table)
    id_tables.append(id_table)

working on hash universe #: 0
working on hash universe #: 1
working on hash universe #: 2
working on hash universe #: 3
working on hash universe #: 4
working on hash universe #: 5
working on hash universe #: 6
working on hash universe #: 7
working on hash universe #: 8
working on hash universe #: 9
working on hash universe #: 10
working on hash universe #: 11
working on hash universe #: 12
working on hash universe #: 13
working on hash universe #: 14
working on hash universe #: 15
working on hash universe #: 16
working on hash universe #: 17
working on hash universe #: 18
working on hash universe #: 19
working on hash universe #: 20
working on hash universe #: 21
working on hash universe #: 22
working on hash universe #: 23
working on hash universe #: 24


In [None]:
#document_vecs, ind2Tweet
doc_id = 0
doc_to_search = d3_lem[doc_id]
#vec_to_search = mot_vect[doc_id]
vec_to_search = ind2data[doc_id]

Nous cherchons maintenant les voisins les plus proches de chaque vecteurs. Ces vecteurs sont chacune des r√©ponses des intervenants. Nous souhaitons trouver des r√©ponses similaires pour ainsi les rassembler et les traiter chacune diff√©remment. Cela nous permettra d'avoir des clusters et de les analyze chacun afin d'en extraire des besoins. 

In [None]:
# This is the code used to do the fast nearest neighbor search. Feel free to go over it
def approximate_knn(doc_id, v, planes_l, k=1, num_universes_to_use=N_UNIVERSES):
    """Search for k-NN using hashes."""
    assert num_universes_to_use <= N_UNIVERSES

    # Vectors that will be checked as possible nearest neighbor
    vecs_to_consider_l = list()

    # list of document IDs
    ids_to_consider_l = list()

    # create a set for ids to consider, for faster checking if a document ID already exists in the set
    ids_to_consider_set = set()

    # loop through the universes of planes
    for universe_id in range(num_universes_to_use):

        # get the set of planes from the planes_l list, for this particular universe_id
        planes = planes_l[universe_id]

        # get the hash value of the vector for this set of planes
        hash_value = hash_value_of_vector(v, planes)

        # get the hash table for this particular universe_id
        hash_table = hash_tables[universe_id]

        # get the list of document vectors for this hash table, where the key is the hash_value
        document_vectors_l = hash_table[hash_value]

        # get the id_table for this particular universe_id
        id_table = id_tables[universe_id]

        # get the subset of documents to consider as nearest neighbors from this id_table dictionary
        new_ids_to_consider = id_table[hash_value]

        # remove the id of the document that we're searching
        if doc_id in new_ids_to_consider:
            new_ids_to_consider.remove(doc_id)
            print(f"removed doc_id {doc_id} of input vector from new_ids_to_search")

        # loop through the subset of document vectors to consider
        for i, new_id in enumerate(new_ids_to_consider):

            # if the document ID is not yet in the set ids_to_consider...
            if new_id not in ids_to_consider_set:
                # access document_vectors_l list at index i to get the embedding
                # then append it to the list of vectors to consider as possible nearest neighbors
                document_vector_at_i = document_vectors_l[i]
                vecs_to_consider_l.append(document_vector_at_i)

                # append the new_id (the index for the document) to the list of ids to consider
                ids_to_consider_l.append(new_id)

                # also add the new_id to the set of ids to consider
                # (use this to check if new_id is not already in the IDs to consider)
                ids_to_consider_set.add(new_id)

    # Now run k-NN on the smaller set of vecs-to-consider.
    print("Fast considering %d vecs" % len(vecs_to_consider_l))

    # convert the vecs to consider set to a list, then to a numpy array
    vecs_to_consider_arr = np.array(vecs_to_consider_l)

    # call nearest neighbors on the reduced list of candidate vectors
    nearest_neighbor_idx_l = nearest_neighbor(v, vecs_to_consider_arr, k=k)

    # Use the nearest neighbor index list as indices into the ids to consider
    # create a list of nearest neighbors by the document ids
    nearest_neighbor_ids = [ids_to_consider_l[idx]
                            for idx in nearest_neighbor_idx_l]

    return nearest_neighbor_ids

In [None]:
# Fonction recherchant le voisin le plus proche 
nearest_neighbor_ids = approximate_knn(doc_id, vec_to_search, planes_l, k=3, num_universes_to_use=5) 

print(nearest_neighbor_ids)


print(f"Nearest neighbors for document {doc_id}")
print(f"Document contents: {doc_to_search}")
print("")

for neighbor_id in nearest_neighbor_ids:
  
  print(f"Nearest neighbor at document id {neighbor_id}")
  print(f"document contents: {d3_lem[neighbor_id]}")

Fast considering 0 vecs
[]
Nearest neighbors for document 0
Document contents: [



### Seconde m√©thode : spaCy 

**Tokenisation**

In [None]:
def return_token(sentence):
    # Tokeniser la phrase
    doc = nlp(sentence)
    # Retourner le texte de chaque token
    return [X.text for X in doc]

In [None]:
return_token(cat3)

TypeError: ignored

**Stopwords**

In [None]:
from nltk.corpus import stopwords
stopWords = set(stopwords.words('french'))

clean_words = []
for token in return_token(cat3):
    if token not in stopWords:
        clean_words.append(token)

clean_words

NameError: ignored

## Trouver centroids 

Dans cette partie, un travail de recherche et d'essais est fait sur les fonctionnalit√©s de Kmeans. 

In [None]:
'''
# Function: K Means
# -------------
# K-Means is an algorithm that takes in a dataset and a constant
# k and returns k centroids (which define clusters of data in the
# dataset which are similar to one another).
def kmeans(dataSet, k):
	
    # Initialize centroids randomly
    numFeatures = dataSet.getNumFeatures()
    centroids = getRandomCentroids(numFeatures, k)
    
    # Initialize book keeping vars.
    iterations = 0
    oldCentroids = None
    
    # Run the main k-means algorithm
    while not shouldStop(oldCentroids, centroids, iterations):
        # Save old centroids for convergence test. Book keeping.
        oldCentroids = centroids
        iterations += 1
        
        # Assign labels to each datapoint based on centroids
        labels = getLabels(dataSet, centroids)
        
        # Assign centroids based on datapoint labels
        centroids = getCentroids(dataSet, labels, k)
        
    # We can get the labels too by calling getLabels(dataSet, centroids)
    return centroids
    
    '''

'\n# Function: K Means\n# -------------\n# K-Means is an algorithm that takes in a dataset and a constant\n# k and returns k centroids (which define clusters of data in the\n# dataset which are similar to one another).\ndef kmeans(dataSet, k):\n\t\n    # Initialize centroids randomly\n    numFeatures = dataSet.getNumFeatures()\n    centroids = getRandomCentroids(numFeatures, k)\n    \n    # Initialize book keeping vars.\n    iterations = 0\n    oldCentroids = None\n    \n    # Run the main k-means algorithm\n    while not shouldStop(oldCentroids, centroids, iterations):\n        # Save old centroids for convergence test. Book keeping.\n        oldCentroids = centroids\n        iterations += 1\n        \n        # Assign labels to each datapoint based on centroids\n        labels = getLabels(dataSet, centroids)\n        \n        # Assign centroids based on datapoint labels\n        centroids = getCentroids(dataSet, labels, k)\n        \n    # We can get the labels too by calling getLab

In [None]:
"""# Function: Should Stop
# -------------
# Returns True or False if k-means is done. K-means terminates either
# because it has run a maximum number of iterations OR the centroids
# stop changing.
def shouldStop(oldCentroids, centroids, iterations):
    if iterations > MAX_ITERATIONS: return True
    return oldCentroids == centroids

"""

'# Function: Should Stop\n# -------------\n# Returns True or False if k-means is done. K-means terminates either\n# because it has run a maximum number of iterations OR the centroids\n# stop changing.\ndef shouldStop(oldCentroids, centroids, iterations):\n    if iterations > MAX_ITERATIONS: return True\n    return oldCentroids == centroids\n\n'

In [None]:
"""# Function: Get Labels
# -------------
# Returns a label for each piece of data in the dataset. 
def getLabels(dataSet, centroids):
  
    # For each element in the dataset, chose the closest centroid. 
    # Make that centroid the element's label.
"""

"# Function: Get Labels\n# -------------\n# Returns a label for each piece of data in the dataset. \ndef getLabels(dataSet, centroids):\n  \n    # For each element in the dataset, chose the closest centroid. \n    # Make that centroid the element's label.\n"

In [None]:
"""# Function: Get Centroids
# -------------
# Returns k random centroids, each of dimension n.
def getCentroids(dataSet, labels, k):
    # Each centroid is the geometric mean of the points that
    # have that centroid's label. Important: If a centroid is empty (no points have
    # that centroid's label) you should randomly re-initialize it.
"""

"# Function: Get Centroids\n# -------------\n# Returns k random centroids, each of dimension n.\ndef getCentroids(dataSet, labels, k):\n    # Each centroid is the geometric mean of the points that\n    # have that centroid's label. Important: If a centroid is empty (no points have\n    # that centroid's label) you should randomly re-initialize it.\n"

# R√©partition des donn√©es en train set et test set 

Le but final est de construire un algorithm capable de classifier chaque r√©ponse de toutes les cat√©gories de formulaires et d'en extraire un besoin. Nous voulons donc cr√©er un algorithm de machine learning ou deep-learning qui serait capable de comprendre les besoins formul√©s par les intervenants. 

Nous r√©fl√©chissons √† plusieurs mani√®res d'√©laborer un tel algorithm. 

Une option int√©ressante serait de relever manuellement quelques besoins r√©currents et d'entrainer notre algorithm √† les trouver dans notre texte. 

Une autre option serait de demander √† l'algorithm de trouver ces besoins par lui-m√™me gr√¢ce aux essais de clustering r√©alis√©s pr√©cedemment. Gr√¢ce aux vectorisations des mots, il pourrait ressembler les vecteurs les plus proches ou les plus similaires. 



In [None]:
'''#tokenized_data = tokenize_sentences(cat3_roots)
#print(tokenized_data)
random.seed(87)
random.shuffle(d3_lem)

train_size = int(len(d3_lem) * 0.8)
train_data = d3_lem[0:train_size]
test_data = d3_lem[train_size:]'''

'#tokenized_data = tokenize_sentences(cat3_roots)\n#print(tokenized_data)\nrandom.seed(87)\nrandom.shuffle(d3_lem)\n\ntrain_size = int(len(d3_lem) * 0.8)\ntrain_data = d3_lem[0:train_size]\ntest_data = d3_lem[train_size:]'

In [None]:
'''print("{} data are split into {} train and {} test set".format(
    len(d3_lem), len(train_data), len(test_data)))

print("First training sample:")
print(train_data[0])
print(train_data)
      
print("First test sample")
print(test_data[0])'''

'print("{} data are split into {} train and {} test set".format(\n    len(d3_lem), len(train_data), len(test_data)))\n\nprint("First training sample:")\nprint(train_data[0])\nprint(train_data)\n      \nprint("First test sample")\nprint(test_data[0])'

# Conclusion