# Hugginface transformers
This notebook aims to classify the CLA documents based on a pre-trained NLP model from HuggingFace. 

We have extracted the most important keywords from the metadata in `make_targets_from_metadata.ipynb` and will use this set of keywords as target for a pre-trained model.

The results are very dissapointing, so another method is used in the `clustering.ipynb` notebook.

In [1]:
import pandas as pd
import numpy as np
import re
import os

# Tensorflow
import tensorflow as tf
import tensorflow_addons as tfa
from tensorflow.python.client import device_lib

# Huggingface
from transformers import TFDistilBertForSequenceClassification  
from transformers import DistilBertTokenizer

# NLTK
import nltk
from nltk import word_tokenize
from rake_nltk import Metric, Rake
from nltk.corpus import stopwords

# FastText
import fasttext as ft

# Scikit-learn
from sklearn.model_selection import train_test_split

import tqdm as notebook_tqdm
from ipywidgets import FloatProgress

import warnings
warnings.filterwarnings("ignore")

## Language detection

We use the handy and very effective language detection library fasttext. This will give us and idea in what language the document is written. It is used in the short_text function.

In [2]:
# FASTTEXT LANGUAGE detection 
# Load the pretrained model
ft_model = ft.load_model("../model/pretrained/lid.176.ftz")

def fasttext_language_predict(text, model = ft_model):

  text = text.replace('\n', " ")
  prediction = model.predict([text])

  return prediction




### 2. Shortening the text

1. Get ranked phrases with Rake_NLTK
2. Do the language detection
3. Replace the original text with the shortened phrases 


In [3]:
def shorten_text(full_text):

    # Add custom stopwords
    stopwords_list = stopwords.words('dutch')

    custom_stopwords=['per','waar','waarvoor','wegens','wanneer','gevolg','gevolge','voorbehoud','erratum','correctie','sommige','betreffende','maatregel','procedure','stelsel','sector','organisatie','excl','aanv','adv','art','artikel','hoofdstuk','XII','XI','IX','VII','VI','V', '2020','2019','2018','2021','2022','dag','dagen','uur','uren','jaar','jaarlijks','maand','maanden','januari','februari','maart','april','mei','juni','juli','augustus','september','oktober','november','december''volgt','voordat','behoudt','beschouwd','bepaald','gedaan','leiden','zullen','gaan']
    stopwords_list.extend(custom_stopwords)
    
    rake_nltk_var = Rake(language='dutch',stopwords=stopwords_list,include_repeated_phrases=False,ranking_metric=Metric.DEGREE_TO_FREQUENCY_RATIO,min_length=1,max_length=3)

    rake_nltk_var.extract_keywords_from_text(full_text)    
    phrases_extracted = rake_nltk_var.get_ranked_phrases()

    # Keyword_extracted is a list, so split them into seperate words for the columns 
    phrases=set(phrases_extracted)

    # First column will be document_id, so put it as first element
    doc_keywords=[]

    # Exclude keywords with lenght < 4
    for phrase in phrases:
        if len(phrase)>3:
            detected_language=fasttext_language_predict(phrase, model = ft_model)[0][0][0][-2:]  
            if detected_language.upper()=='NL':
                doc_keywords.append(phrase)

    short_text=doc_keywords
    string_text=str()

    for keywords in short_text:
        string_text+=keywords
        string_text+=' '

    return string_text

In [4]:
# Make the text in dataframe shorter 
# And replace df_model['text'] with this shorter condensed version

def modify_df(df):

    for i in range(len(df)):
        full_text=df["text"][i]
        
        string_text=shorten_text(full_text)

        df.loc[i, 'text'] = string_text
        return df


## X and y

For X we will use the shortened text, instead of the full document text.

For y we load the labels as found by our `make_targets_from_metadata.ipynb` notebook in preprocessing folder.

For this notebook we will only use the first column, which hosts the most important label from the nltk extraction.

Then we split our training data into train, test and validation

In [5]:
df=pd.read_csv('../csv/CLA_targets_NL.csv',sep=';')
df_model=modify_df(df)

df_model=df_model[:200]
X = df_model["text"].tolist()

y_1 = pd.get_dummies(df_model['1'])
y_2 = pd.get_dummies(df_model['2'])
y_3 = pd.get_dummies(df_model['3'])
y_4 = pd.get_dummies(df_model['4'])
y_5 = pd.get_dummies(df_model['5'])
y_6 = pd.get_dummies(df_model['6'])
y_7 = pd.get_dummies(df_model['7'])
y_8 = pd.get_dummies(df_model['8'])
y_9 = pd.get_dummies(df_model['9'])
y_10 = pd.get_dummies(df_model['10'])
y=pd.concat([y_1,y_2,y_3,y_4,y_5,y_6,y_7,y_8,y_9,y_10],axis=1,join='inner')
y=pd.concat([y_1],axis=1,join='inner')
 
y = y.groupby(level=0,axis=1).sum()
y_unique = y.loc[:,~y.columns.duplicated()].copy() 

y=y_unique

# Split Train and Validation data

#print (len(X))
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=43)
#print ('X_train_val',len(X_train_val))
#print ('X_test',len(X_test))

print (len(X))
# Keep some data for inference (testing)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=43)
print ('X_train',len(X_train))
print ('X_test',len(X_test))
print ('X_val',len(X_val))

200
X_train 140
X_test 40
X_val 60


## The labels 

Now we have the deduced labels for our model to train on.

In [6]:
y.to_csv('../csv/columns.csv',sep=';')
y.head(-5)

Unnamed: 0,aanvullende pensioenen,actieve werknemer,bedrijfstoeslag,bonuskostenvergoedingen,ecocheques,eindeloopbaandagenoudere werknemers,feestdagen,fondsen,functieclassificatie,landingsbanen,...,overlijdensvergoeding,overuren,risicogroepen,swtopzegging,swtstelsel,syndicale afvaardiging,syndicale premie,syndicale vormingmaatregel,uitzendarbeidveiligheid,vergoedingen
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
191,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
192,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
193,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Pre-trained model selection

We choose distilbert for sequence classification

In [7]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=len(set(y))) 

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_layer_norm', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier', 'dropout_19', 'classifier']
You should probably TRAIN this model on a down-stream task to be able to use i

## Tokenization

Here we tokenize the texts and create the  train, test and validation sets. 

We try various max_lengths. For the purpose of demonstration we keep it at 12 tokens.

In [8]:
train_encodings = tokenizer(X_train, max_length=12, truncation=True, padding=True)
val_encodings = tokenizer(X_val, max_length=12, truncation=True, padding=True)
test_encodings = tokenizer(X_test, max_length=12, truncation=True, padding=True)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
))

val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    y_val
))

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
))

## Compile model

### Metrics, Loss function, optimizer

We try a few metrics, accuracy f1_score and binaryaccuracy lead to serious overfitting, AUC aswell. The best results in terms of overfitting is not using any metrics at all.

In [9]:
OPTIMIZER =  tf.keras.optimizers.Adam(learning_rate= 1e-5)
LOSS = tf.keras.losses.BinaryCrossentropy(from_logits=True)
METRICS=['Accuracy']
METRICS = [tf.keras.metrics.BinaryAccuracy()]
METRICS=[tf.keras.metrics.AUC(
    num_thresholds=200,
    curve='ROC',
    summation_method='interpolation',
    name=None,
    dtype=None,
    thresholds=None,
    multi_label=True,
    num_labels=len(set(y)),
    label_weights=None,
    from_logits=True
)]
 
METRICS=[tfa.metrics.F1Score(
    average=None, threshold=None, name='f1_score', dtype=None, num_classes=len(set(y))
)]
model.compile(optimizer=OPTIMIZER, loss=LOSS, metrics=METRICS)
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  19994     
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 66,973,466
Trainable params: 66,973,466
Non-trainable params: 0
_________________________________________________________________


In [10]:
tf.config.run_functions_eagerly(False)

BATCH_SIZE = 2
EPOCHS =8
history=model.fit(
    train_dataset.batch(BATCH_SIZE) ,
    epochs=EPOCHS,
    validation_data=val_dataset.batch(BATCH_SIZE)
)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [11]:
def clean_text(text):
  text = text.lower()
  text = re.sub("[^a-zA-Z\'\-éòóôëè]", " ", text) 
  return " ".join(word_tokenize(text)[:])

## Verify

Let's see what predictions are, we print out the 1st,2nd and 3th highest label scores.

In [12]:
count=0

for root, dirs, files in os.walk('../data/processed/NL/'):
 
    for file in files:

        with open(os.path.join(root, file), 'r',encoding='utf-8') as f: 
            text=f.read() 
            text = clean_text(text)
            text= shorten_text(text)
            encodings = tokenizer([text], truncation=True, padding=True)
            ds = tf.data.Dataset.from_tensor_slices(dict(encodings))
            predictions = model.predict(ds)
            
            class_indices = np.argsort(predictions[0])[::-1]  # get the indices of the probabilities in descending order
            highest_index = class_indices[0][0]  # select the index of the class with the highest probability
            second_index = class_indices[0][3]  # select the index of the class with the 2nd highest probability
            third_index = class_indices[0][9]  # select the index of the class with the 3th highest probability
            mapping = {i: name for i, name in enumerate(y.columns)}
                
            print(file, len(text), mapping[highest_index], mapping[second_index], mapping[third_index])
            
            count+=1
            if count >10: 
                break

10202-2018-012766.txt 6250 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10202-2020-013175.txt 2735 uitzendarbeidveiligheid opleiding swtstelsel
10205-2018-004963.txt 4326 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10206-2019-003872.txt 32313 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10206-2020-000814.txt 4597 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10206-2021-015270.txt 4752 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10207-2018-013171.txt 10922 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10207-2018-013172.txt 4119 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10207-2018-013174.txt 7428 uitzendarbeidveiligheid opleiding syndicale afvaardiging
10209-2021-015509.txt 998 uitzendarbeidveiligheid opleiding swtstelsel
104-2019-010269.txt 11774 uitzendarbeidveiligheid opleiding swtstelsel


## Conclusion

As we can see this model classification is not capable of assigning the right labels to the text. 