In [1]:
import pandas as pd
import numpy as np

In [2]:
articles = pd.read_pickle('./Partie_1/backend/app/core/data/full_articles.pkl')
articles.head()

Unnamed: 0,index,Name,Id,Cid,Text
0,0,L1,LEGIARTI000018764571,LEGIARTI000017961623,Tout projet de réforme envisagé par le Gouvern...
1,1,L2,LEGIARTI000042654546,LEGIARTI000017961625,Le Gouvernement soumet les projets de textes l...
2,2,L3,LEGIARTI000042654542,LEGIARTI000017961627,"Chaque année, les orientations de la politique..."
3,3,L1111-1,LEGIARTI000006900781,LEGIARTI000006900781,Les dispositions du présent livre sont applica...
4,4,L1111-2,LEGIARTI000019353569,LEGIARTI000006900783,Pour la mise en oeuvre des dispositions du pré...


In [3]:
articles = articles.drop(columns=['index'])
print(len(articles))
articles.head()

20835


Unnamed: 0,Name,Id,Cid,Text
0,L1,LEGIARTI000018764571,LEGIARTI000017961623,Tout projet de réforme envisagé par le Gouvern...
1,L2,LEGIARTI000042654546,LEGIARTI000017961625,Le Gouvernement soumet les projets de textes l...
2,L3,LEGIARTI000042654542,LEGIARTI000017961627,"Chaque année, les orientations de la politique..."
3,L1111-1,LEGIARTI000006900781,LEGIARTI000006900781,Les dispositions du présent livre sont applica...
4,L1111-2,LEGIARTI000019353569,LEGIARTI000006900783,Pour la mise en oeuvre des dispositions du pré...


In [4]:
articles = articles.dropna().drop_duplicates()
print(len(articles))


20835


In [90]:
import re
import nltk
import string
import gensim.downloader as api

from nltk import word_tokenize
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_samples, silhouette_score

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\danielalexander.muro\AppData\Roaming\nltk_dat
[nltk_data]     a...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [72]:
wv = api.load('word2vec-google-news-300')

In [24]:
wv.vector_size

300

### Clean and Tokenize Data

In [74]:
def clean_text(text, tokenizer):
    """Pre-process text and generate tokens

    Args:
        text: Text to tokenize.

    Returns:
        Tokenized text.
    """
    text = str(text).lower()  # Lowercase words
    text = re.sub(r"\s+", " ", text)  # Remove multiple spaces in content
    text = re.sub(r"(?<=\w)-(?=\w)", " ", text)  # Replace dash between words
    text = re.sub(
        f"[{re.escape(string.punctuation)}]", "", text
    )  # Remove punctuation

    tokens = tokenizer(text, language='french')  # Get tokens from text
    return tokens

In [144]:
df = articles.copy()
df['Text'] = df['Text'].astype(str)
df['Tokens'] = df['Text'].map(lambda x: clean_text(x, word_tokenize))


In [127]:
len(df['Tokens'])

20835

In [145]:

# _, idx = np.unique(df['Tokens'], return_index=True)
# df = df.iloc[idx, :]

# df = df.loc[df.Tokens.map(lambda x: len(x) > 0), ['Id', 'Text', 'Tokens']]

df = df[['Text', 'Tokens']]


In [150]:
all_texts = df['Text'].values
tokenized_texts = df['Tokens'].values

print(f"Original dataframe: {articles.shape}")
print(f"Pre-processed dataframe: {df.shape}")

Original dataframe: (20835, 4)
Pre-processed dataframe: (20835, 3)


### Create Document Vectors from Word Embedding

This code will get all the word vectors of each document and average them to generate a vector per each document.

1. Define the vectorize function that takes a list of documents and a gensim model as input, and generates a feature vector per document as output.
2. Apply the function to the documents' tokens in tokenized_doc, using the Word2Vec model you trained earlier.
3. Print the length of the list of documents and the size of the generated vectors.


In [25]:
def vectorize(list_of_docs, model):
    """Generate vectors for list of documents using a Word Embedding

    Args:
        list_of_docs: List of documents
        model: Gensim's Word Embedding

    Returns:
        List of document vectors
    """
    features = []

    for tokens in list_of_docs:
        zero_vector = np.zeros(model.vector_size)
        vectors = []
        for token in tokens:
            if token in model:
                try:
                    vectors.append(model[token])
                except KeyError:
                    continue
        if vectors:
            vectors = np.asarray(vectors)
            avg_vec = vectors.mean(axis=0)
            features.append(avg_vec)
        else:
            features.append(zero_vector)
    return features

In [147]:
vectorized_texts = vectorize(tokenized_texts, model=wv)
len(vectorized_texts), len(vectorized_texts[0])

(20835, 300)

Next, cluster the documents using Mini-batches K-means. This K-means variant uses random input data samples to reduce the time required during training. The upside is that it shares the same objective function with the original algorithm.

Mini Batch K-means has been proposed as an alternative to the K-means algorithm for clustering massive datasets. The advantage of this algorithm is to reduce the computational cost by not using all the dataset each iteration but a subsample of a fixed size.

The number of clusters has an important impact in the difference between the partition obtained by k-means and mini batch k-means. This difference ranges from 2% of loss of quality for a small number of clusters (less than 10) to more than 8% for a larger number of clusters (more than 20). This difference also shows when comparing the partitions of both algorithms to the true partition. In fact, as the number of clusters increases, their respective difference to the true partition also increases. This means that it is not probable that applying the mini batch k-means algorithm to very large dataset with a large number of clusters will result in equivalent partitions to the ones from k-means.

In [29]:
def mbkmeans_clusters(
	X, 
    k, 
    mb, 
    print_silhouette_values, 
):
    """Generate clusters and print Silhouette metrics using MBKmeans

    Args:
        X: Matrix of features.
        k: Number of clusters.
        mb: Size of mini-batches.
        print_silhouette_values: Print silhouette values per cluster.

    Returns:
        Trained clustering model and labels based on X.
    """
    km = MiniBatchKMeans(n_clusters=k, batch_size=mb).fit(X)
    print(f"For n_clusters = {k}")
    print(f"Silhouette coefficient: {silhouette_score(X, km.labels_):0.2f}")
    print(f"Inertia:{km.inertia_}")

    if print_silhouette_values:
        sample_silhouette_values = silhouette_samples(X, km.labels_)
        print(f"Silhouette values:")
        silhouette_values = []
        for i in range(k):
            cluster_silhouette_values = sample_silhouette_values[km.labels_ == i]
            silhouette_values.append(
                (
                    i,
                    cluster_silhouette_values.shape[0],
                    cluster_silhouette_values.mean(),
                    cluster_silhouette_values.min(),
                    cluster_silhouette_values.max(),
                )
            )
        silhouette_values = sorted(
            silhouette_values, key=lambda tup: tup[2], reverse=True
        )
        for s in silhouette_values:
            print(
                f"Cluster {s[0]}: Size:{s[1]} | Avg:{s[2]:.2f} | Min:{s[3]:.2f} | Max: {s[4]:.2f}"
            )
    return km, km.labels_

This function creates the clusters using the Mini-batches K-means algorithm. 

It takes the following arguments:

X: Matrix of features. In this case, it's your vectorized documents.
k: Number of clusters you'd like to create.
mb: Size of mini-batches.
print_silhouette_values: Defines if the Silhouette Coefficient is printed for each cluster. If you haven't heard about this coefficient, don't worry, you'll learn about it in a bit!
mbkmeans_cluster takes these arguments and returns the fitted clustering model and the labels for each document.

In [148]:
clustering, cluster_labels = mbkmeans_clusters(
	X = vectorized_texts,
    k = 2,
    mb = 500,
    print_silhouette_values=True,
)

df_clusters = pd.DataFrame({
    'Text': all_texts,
    'Tokens': [" ".join(text) for text in tokenized_texts],
    'Cluster': cluster_labels
})

For n_clusters = 2
Silhouette coefficient: 0.09
Inertia:3885.1820036206045
Silhouette values:
    Cluster 0: Size:12520 | Avg:0.10 | Min:-0.00 | Max: 0.26
    Cluster 1: Size:8315 | Avg:0.09 | Min:-0.01 | Max: 0.24


This code will fit the clustering model, print the Silhouette Coefficient per cluster, and return the fitted model and the labels per cluster. It'll also create a data frame which can be used to review the results.

There are a few things to consider when setting the input arguments:

* print_silhouette_values is straightforward. In this case, you set it to True to print the evaluation metric per cluster. This will help you review the results.
* mb depends on the size of your dataset. It's mandatory to ensure that it is not too small to avoid a significant impact on the quality of results and not too big to avoid making the execution too slow. In this case, you set it to 500 observations.
* k is trickier. In general, it involves a mix of qualitative analysis and quantitative metrics. After a few experiments on my side, I found that 50 seemed to work well. But that is more or less arbitrary.

it's possible to use metrics like the Silhouette Coefficient for the quantitative evaluation of the number of clusters. This coefficient is an evaluation metric frequently used in problems where ground truth labels are unknown. It's calculated using the mean intra-cluster distance and the mean nearest-cluster distance and goes from -1 to 1. Well-defined clusters result in positive values of this coefficient, while incorrect clusters will result in negative values.

The qualitative part generally requires you to have domain knowledge of the subject matter so you can sense-check your clustering algorithm's results.

For n_clusters = 2

Silhouette coefficient: 0.09

Inertia:3777.5293674021395

Silhouette values:

    Cluster 0: Size:11131 | Avg:0.09 | Min:-0.00 | Max: 0.25
    Cluster 1: Size:9377 | Avg:0.09 | Min:-0.01 | Max: 0.24

This is the output of the clustering algorithm. The sizes and Silhouette Coefficients per cluster are the most relevant metrics. The clusters are printed by the value of the Silhouette coefficient in descending order. A higher score means denser – and thus better – clusters.

In [142]:
articles.head()

Unnamed: 0,Name,Id,Cid,Text
0,L1,LEGIARTI000018764571,LEGIARTI000017961623,Tout projet de réforme envisagé par le Gouvern...
1,L2,LEGIARTI000042654546,LEGIARTI000017961625,Le Gouvernement soumet les projets de textes l...
2,L3,LEGIARTI000042654542,LEGIARTI000017961627,"Chaque année, les orientations de la politique..."
3,L1111-1,LEGIARTI000006900781,LEGIARTI000006900781,Les dispositions du présent livre sont applica...
4,L1111-2,LEGIARTI000019353569,LEGIARTI000006900783,Pour la mise en oeuvre des dispositions du pré...


In [149]:
df_clusters.head()

Unnamed: 0,text,tokens,cluster
0,Tout projet de réforme envisagé par le Gouvern...,tout projet de réforme envisagé par le gouvern...,0
1,Le Gouvernement soumet les projets de textes l...,le gouvernement soumet les projets de textes l...,0
2,"Chaque année, les orientations de la politique...",chaque année les orientations de la politique ...,1
3,Les dispositions du présent livre sont applica...,les dispositions du présent livre sont applica...,1
4,Pour la mise en oeuvre des dispositions du pré...,pour la mise en oeuvre des dispositions du pré...,1


### Qualitative Review of Clusters

There are a few ways to qualitatively analyze the results. During the earlier sections, the approach resulted in vector representations of tokens and documents, and vectors of the clusters' centroids. It's possible to find the most representative tokens and documents to analyze the results by looking for the vectors closest to the clusters' centroids.

In [117]:
test_cluster = 1
most_representative_docs = np.argsort(
    np.linalg.norm(vectorized_texts - clustering.cluster_centers_[test_cluster], axis=1)
)
for d in most_representative_docs[:3]:
    print(all_texts[d])
    print("-------------")

Des arrêtés du ministre chargé du travail ou du ministre chargé de l'agriculture déterminent les équipements de protection individuelle et catégories d'équipement de protection individuelle pour lesquels le chef d'établissement ou le travailleur indépendant doit procéder ou faire procéder à des vérifications générales périodiques afin que soit décelé en temps utile toute défectuosité susceptible d'être à l'origine de situations dangereuses ou tout défaut d'accessibilité contraire aux conditions déterminées conformément à l'article R. 233-42-1. Ces arrêtés précisent la périodicité des vérifications et, en tant que de besoin, leur nature et leur contenu. L'intervalle entre lesdites vérifications peut être réduit sur mise en demeure de l'inspecteur du travail ou du contrôleur du travail lorsque, en raison notamment des conditions de stockage ou d'environnement, du mode de fonctionnement ou de la conception de certains organes, les équipements de protection individuelle sont soumis à des c