We will build a preprocessing pipeline in order to shape the documents for training models.



In [3]:
from tensorflow.keras.layers import TextVectorization

import pandas as pd
import os

from config import RAW_DATA_PATH


2022-11-05 09:58:02.548237: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
df = pd.read_csv(os.path.join(RAW_DATA_PATH, 'CADA-2022-05-31.csv'), sep=',')

In [5]:
documents = df['Avis'].values.tolist()

# Basic tokenization

There are different ways to encode documents into tokens.

Let's use the SpacY tokenizer with default settings on French vocab.

In [46]:
from spacy.tokenizer import Tokenizer
from spacy.lang.fr import French

nlp = French()

tokenizer = Tokenizer(nlp.vocab)

In [75]:
def document_preprocessing(document, tokenizer):
    """Normalize documents with some preprocessing tasks"""
    
    # split the document into tokens
    tokens = tokenizer(str(document).lower())
    
    # remove stopwords and punctuation
    tokens = [tk.text.replace('.','') for tk in tokens if
                  tk.is_stop is False and
                  tk.is_punct is False
             ]
    
    return tokens

In [78]:
document_tokens = []

for doc in documents:
    token_list = document_preprocessing(document=doc, tokenizer=tokenizer)
    document_tokens.append(' '.join(token_list))


In [79]:
document_tokens[0]

"commission d'accès documents administratifs examiné séance 3 mars 1984 demande l'avez saisie lettre 21 décembre 1983 \n\n commission émis avis défavorable communication dossier d'enquête relatif refus admission qualité d'élève-officier réserve interprète chiffre marine, motif qu'elle porterait atteinte secret défense nationale, exception prévue l'article 6 loi 17 juillet 1978"

# Bert WordPiece tokenizer

# Word vectors

We need to map documents into a feature space.

## Word frequencies

Consider a document as a vector of size `N` which is the vocabulary size.

We encode all tokens of the document into this vector.

The vector values are based on TFIDF approach:
* The numerator is the token frequency within the document
* The denominator is the token frequency within the whole collection of documents.

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics.pairwise import cosine_similarity

In [212]:
vectorizer = TfidfVectorizer(
    analyzer='word',
    max_df=10000,  # ignore tokens that appear more than X times in the document collection
    use_idf=True,
    smooth_idf=True,
    max_features=10000,   # capping on vocabulary size
)
vectorizer = vectorizer.fit(document_tokens)

X = vectorizer.transform(document_tokens)

In [100]:
print('Matrix size:', X.shape[0], 'documents,', X.shape[1], 'vocabulary size')

Matrix size: 48746 documents, 10000 vocabulary size


In [104]:
X

<48746x10000 sparse matrix of type '<class 'numpy.float64'>'
	with 3956017 stored elements in Compressed Sparse Row format>

Now we can build the distance matrix, which represents the distance between documents.

The computation cost is quadratic. Let's take the first 15k documents for the moment.

In [129]:
DATASET_TRUNCATION = 15000

In [106]:
distance_matrix = cosine_similarity(X[:DATASET_TRUNCATION,:], dense_output=False)

In [107]:
distance_matrix

<15000x15000 sparse matrix of type '<class 'numpy.float64'>'
	with 201824095 stored elements in Compressed Sparse Row format>

## Pre-trained embeddings

# Clustering
Let's find documents that look alike.

## Reduce dimension

In [116]:
from sklearn.decomposition import TruncatedSVD, MiniBatchSparsePCA

pca = MiniBatchSparsePCA(n_components=100, batch_size=50)

pca = pca.fit(distance_matrix.toarray())

In [123]:
reduced_distance_matrix = pca.transform(distance_matrix.toarray())

In [124]:
reduced_distance_matrix.shape

(15000, 100)

## Clustering

In [125]:
from sklearn.cluster import DBSCAN

In [151]:
clustering = DBSCAN(min_samples=10)

clustering = clustering.fit(reduced_distance_matrix)

labels = clustering.labels_

In [152]:
set(labels)

{-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}

In [139]:
df['cluster_labels'] = list(labels) + [-1] * (len(df) - DATASET_TRUNCATION)

In [140]:
df['cluster_labels'].value_counts()

-1     48168
 8       191
 16       77
 2        48
 4        44
 15       34
 10       32
 9        21
 5        18
 12       16
 13       15
 11       14
 0        13
 3        12
 6        12
 7        11
 1        10
 14       10
Name: cluster_labels, dtype: int64

Let's analyze the documents of the same cluster through the lens of the common tokens.

Here we care about the most discriminative tokens i.e. the ones that are especially present in these documents
relative to the others.

To do that, we look at the tokens with highest TFIDF score.

In [157]:
from collections import Counter

In [214]:
vectorizer.transform(doc).toarray().shape

(400, 10000)

In [193]:
doc = [doc for idx, doc in enumerate(document_tokens[:DATASET_TRUNCATION]) if labels[idx]==1][0]

doc = doc.split(" ")

Counter(doc).most_common(30)

[('commission', 13),
 ('\r\n\r\n', 8),
 ('\r\n', 7),
 ('dispositions', 7),
 ('documents', 6),
 ('demande', 6),
 ('titre', 6),
 ('l', 6),
 ('informations', 6),
 ('météo', 5),
 ('france', 5),
 ('code', 5),
 ("l'environnement", 5),
 ('données', 4),
 ("l'ensemble", 4),
 ('124-1', 4),
 ('droit', 4),
 ('relatives', 4),
 ("l'environnement,", 4),
 ('xxx', 3),
 ("d'accès", 3),
 ('courrier', 3),
 ('président-directeur', 3),
 ('général', 3),
 ('rapport', 3),
 ('interministérielle', 3),
 ('sécheresse', 3),
 ('2009', 3),
 ('exploitées', 3),
 ('demandes', 3)]

In [192]:
doc = [doc for idx, doc in enumerate(document_tokens[:DATASET_TRUNCATION]) if labels[idx]==1][1]

doc = doc.split(" ")

Counter(doc).most_common(30)

[('commission', 13),
 ('\r\n\r\n', 8),
 ('dispositions', 7),
 ('documents', 6),
 ('demande', 6),
 ('\xa0', 6),
 ('\r\n', 6),
 ('titre', 6),
 ('l', 6),
 ('informations', 6),
 ('météo', 5),
 ('france', 5),
 ('code', 5),
 ("l'environnement", 5),
 ('données', 4),
 ("l'ensemble", 4),
 ('124-1', 4),
 ('droit', 4),
 ('relatives', 4),
 ("l'environnement,", 4),
 ('xxx', 3),
 ("d'accès", 3),
 ('courrier', 3),
 ('président-directeur', 3),
 ('général', 3),
 ('rapport', 3),
 ('interministérielle', 3),
 ('sécheresse', 3),
 ('2009', 3),
 ('exploitées', 3)]

In [186]:
''.split("oui sncf")

['']

# Topic modelling