# NLP Data augmantation project 

This notebook presents some different services to use a backtranslation approach for aumenting data for NLP tasks.

__Warning:__ there are a few directories pointing to a private drive account but they can be easily changed to reproduce the work. API keys are not provided here.

## Importing initial data

In [5]:
import pandas as pd

data = pd.read_csv('/content/drive/My Drive/project_codebase/project/data/requests.csv', sep=';')
demandes = data['demande']
data

Unnamed: 0,demande,motif,groupe_motif
0,je viens enregistrer mon bac,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
1,je viens de pardon je viens pour déposer un do...,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
2,je souhaite enregistrer une convention de PACS...,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
3,bonjour je viens pour un enregistrement de PACS,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
4,je souhaite enregistrer une convention de pact...,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
...,...,...,...
1183,nous souhaitons annuler notre pacte civil de s...,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"
1184,je viens de déposer mon dossier de PACS,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"
1185,je me PACS,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"
1186,je souhaite modifier mon contrat PACS,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"


## Back translation with Yandex

In [0]:
import requests
import json

api = 'https://translate.yandex.net/api/v1.5/tr.json/translate'
key = '<Your API key>'

languages = ['en', 'es', 'ru', 'de', 'ar', 'it'] # , 'ja', 'ca', 'zh'

In [0]:
result = []

In [0]:
for i, request in demandes.loc[1168:].iteritems():
    for lang in languages:
        text = request
        lang, back = f'fr-{lang}', f'{lang}-fr'

        r = requests.post(api, data={'key':key,
                                     'text':text,
                                    'lang':lang})
        answer = json.loads(r.text)
        if answer['code'] != 200:
            print(answer)
        translated_text = answer['text'][0]

        r = requests.post(api, data={'key':key,
                                    'text':translated_text,
                                    'lang':back})
        answer = json.loads(r.text)
        if answer['code'] != 200:
            print(answer)
        final = answer['text'][0]
        result.append([i, final])
    print(i)


1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187


In [0]:
augmented = pd.DataFrame(result, columns=['corresponding_example_id', 'generated'])
# augmented.to_csv('/content/drive/My Drive/project_codebase/project/generated_yandex.csv', index=False)

## Back translation with Azure

In [0]:
import requests, uuid, json

subscription_key = '<Your API key>'
endpoint = '<Your endpoint>'

path = '/translate?api-version=3.0'
params = '&to=de&to=it'
constructed_url = endpoint + path + params

headers = {
    'Ocp-Apim-Subscription-Key': subscription_key,
    'Content-type': 'application/json',
    'X-ClientTraceId': str(uuid.uuid4())
}

# You can pass more than one object in body.
body = [{
    'text' : 'Hello World!'
}]
request = requests.post(constructed_url, headers=headers, json=body)
# response = request.json()

# print(json.dumps(response, sort_keys=True, indent=4, separators=(',', ': ')))
print(request.text)

{"error":{"code":"404","message": "Resource not found"}}


## Backtranslation with Goslate

In [0]:
pip install goslate

Collecting goslate
  Downloading https://files.pythonhosted.org/packages/39/0b/50af938a1c3d4f4c595b6a22d37af11ebe666246b05a1a97573e8c8944e5/goslate-1.5.1.tar.gz
Collecting futures
  Downloading https://files.pythonhosted.org/packages/05/80/f41cca0ea1ff69bce7e7a7d76182b47bb4e1a494380a532af3e8ee70b9ec/futures-3.1.1-py3-none-any.whl
Building wheels for collected packages: goslate
  Building wheel for goslate (setup.py) ... [?25l[?25hdone
  Created wheel for goslate: filename=goslate-1.5.1-cp36-none-any.whl size=11550 sha256=ffa17e20011a127799035cca55e239a353593a047eb978461ce7242d346a7853
  Stored in directory: /root/.cache/pip/wheels/4f/7f/28/6f52271012a7649b54b1a7adaae329b4246bbbf9d1e4f6e51a
Successfully built goslate
Installing collected packages: futures, goslate
Successfully installed futures-3.1.1 goslate-1.5.1


In [0]:
import goslate
import pandas as pd
import sys
import time

translator = goslate.Goslate()
d = pd.read_csv("/content/drive/My Drive/project_codebase/project/data/requests.csv", sep=';')
classes = d['motif'].unique()
langs= ['en', 'es', 'de', 'ru', 'ar', 'it', 'ja', 'ca', 'zh']
for cl in classes:
	di = d.loc[d['motif']==cl]
	di = di['demande'].values.tolist()
	# print("class : {}".format(cl), file=sys.stderr)
	for l in langs:
		# print("lang = {}".format(l), file=sys.stderr)
		inter = translator.translate(di, l)
		res = translator.translate(inter, 'fr')
		for x in res:
			print("{};{}".format(x, cl), file='results.csv')
		time.sleep(5)


HTTPError: ignored

## Augmented data

Putting the dataset back together.

In [18]:
generated = pd.read_csv('/content/drive/My Drive/project_codebase/project/generated_yandex.csv')
initial = pd.read_csv('/content/drive/My Drive/project_codebase/project/data/requests.csv', sep=';').reset_index()
initial.rename(columns={'index': 'corresponding_example_id'}, inplace=True)
generated = generated.join(initial, on='corresponding_example_id', lsuffix='0')
generated.drop(['corresponding_example_id0', 'corresponding_example_id'], axis=1, inplace=True)
generated.rename(columns={'demande':'demande_originale', 'generated':'demande'}, inplace=True)
generated

Unnamed: 0,demande,demande_originale,motif,groupe_motif
0,Je viens de sauver mon réservoir,je viens enregistrer mon bac,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
1,Je viens juste de garder mon dépôt,je viens enregistrer mon bac,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
2,je viens d'enregistrer mon bac,je viens enregistrer mon bac,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
3,je viens d'enregistrer mon aquarium,je viens enregistrer mon bac,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
4,Je ne sauvegarde que le réservoir,je viens enregistrer mon bac,Enregistrement de PACS,01c - Etat Civil PACS Enregistrement
...,...,...,...,...
7123,Je veux dissoudre mon pacte civil de solidarité,je souhaite faire dissoudre mon pacte civil de...,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"
7124,"je voudrais, pour dissoudre mon pacte civil de...",je souhaite faire dissoudre mon pacte civil de...,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"
7125,je veux résoudre mon pacte civil de solidarité,je souhaite faire dissoudre mon pacte civil de...,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"
7126,Je veux que ma solution de la Charte de civil ...,je souhaite faire dissoudre mon pacte civil de...,"PACS (Dépôt de dossier, modification ou dissol...","01d - Etat Civil PACS Modification, dissolution"


## Evaluation

The evaluation is done by commenting and uncommenting some lines so that the `dev.txt` and `test.txt` always remain the same and `train.txt` change according to the data provided.

In [22]:
from os import path

import os
import numpy
import pandas
corpus_path = "corpus_splits/"
if not path.exists(corpus_path):
    os.mkdir(corpus_path)

#Loading dataset
# df = pandas.read_csv("/content/drive/My Drive/project_codebase/project/data/requests.csv", sep=";")
df = generated
df = df[["motif", "demande"]]
df["motif"] = "__label__" + df["motif"].astype("str")
df["motif"] = df["motif"].str.replace(" ","_",regex=False)

# Number of splits
num_splits = 10


for split in range(num_splits):
    base_path = corpus_path + "split_" + str(split)
    if not path.exists(base_path):
        os.mkdir(base_path)

    train, test, dev = numpy.split(df.sample(frac=1), [int(.7 * len(df)), int(.9 * len(df))])  # type: # DataFrame

    train.to_csv(base_path + "/train.txt", index=False, sep="\t", header=False)
    # test.to_csv(base_path + "/test.txt", index=False, sep="\t", header=False)
    # dev.to_csv(base_path + "/dev.txt", index=False, sep="\t", header=False)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [0]:
!pip install flair

In [24]:
%tensorflow_version 2.x

from flair.data import Corpus
from flair.datasets import ClassificationCorpus
from flair.embeddings import CamembertEmbeddings, DocumentRNNEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from torch.optim import Adam
import numpy as np

TensorFlow 2.x selected.


In [25]:
data_folder = 'corpus_splits/'

# column format indicating which columns hold the text and label(s)
column_name_map = {1: "text", 2: "label_topic", }

# Camembert
camembert = CamembertEmbeddings(layers="-1,-2,-3,-4")

embedding_list = [camembert]

# Document embedding model
document_embeddings = DocumentRNNEmbeddings(embedding_list, hidden_size=750, bidirectional=True,
                                            rnn_layers=2,
                                            rnn_type='GRU',
                                            dropout=0.4,
                                            word_dropout=0.1)
results = []

# 10-fold cross validation
for root, dirs, files in os.walk(data_folder):
    for dir in dirs:
        if "split" in dir:
            print("Processing " + dir + " ...")
            corpus: Corpus = ClassificationCorpus(data_folder + "/" + dir,
                                                  test_file='test.txt',
                                                  dev_file='dev.txt',
                                                  train_file='train.txt', in_memory=True)

            classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(),
                                        multi_label=False)
            trainer = ModelTrainer(classifier, corpus)
            model_path = data_folder + "/" + dir + "/model/"
            scores = trainer.train(model_path, max_epochs=10,
                                   embeddings_storage_mode="cpu",
                                   learning_rate=0.3,
                                   mini_batch_size=32,
                                   anneal_factor=0.5,
                                   shuffle=False,
                                   patience=5, save_final_model=False, anneal_with_restarts=False)
            expected = [sentence.labels[0].value for sentence in corpus.test.sentences]
            predictions = [sentence.labels[0].value for sentence in classifier.predict(corpus.test.sentences)]
            scores['test_f1'] = f1_score(expected, predictions, average='micro')
            results.append(scores)


HBox(children=(IntProgress(value=0, description='Downloading', max=810912, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=596, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=445032417, style=ProgressStyle(description_…


Processing split_4 ...
2020-02-16 18:43:15,643 Reading data from corpus_splits/split_4
2020-02-16 18:43:15,644 Train: corpus_splits/split_4/train.txt
2020-02-16 18:43:15,644 Dev: corpus_splits/split_4/dev.txt
2020-02-16 18:43:15,646 Test: corpus_splits/split_4/test.txt
2020-02-16 18:43:16,285 Computing label dictionary. Progress:


100%|██████████| 4965/4965 [00:00<00:00, 276858.19it/s]

2020-02-16 18:43:16,313 [b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'Recensement_des_jeunes', b'Certificats,_l\xc3\xa9galisation_de_signature', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'D\xc3\xa9claration_de_d\xc3\xa9c\xc3\xa8s', b'Livret_de_famille', b'Mariage', b'Renseignements,_modification_de_dossier', b'explication_avis_\xc3\xa9ch\xc3\xa9ance_loyer', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'd\xc3\xa9compte_de_sortie_(locataire_quittant_OPHEOR)', b'Inscription_sur_liste_\xc3\xa9lectorale', b'R\xc3\xa8glement_cantine_en_esp\xc3\xa8ces', b'explication_r\xc3\xa9gularisation_des_charges', b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'Commande_ou_d\xc3\xa9commande_de_repas', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b"demandes_d'attestations_diverses", b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s




2020-02-16 18:43:24,081 epoch 1 - iter 15/156 - loss 3.96433117 - samples/sec: 62.15
2020-02-16 18:43:32,155 epoch 1 - iter 30/156 - loss 3.68901490 - samples/sec: 60.40
2020-02-16 18:43:40,058 epoch 1 - iter 45/156 - loss 3.58013620 - samples/sec: 62.06
2020-02-16 18:43:48,066 epoch 1 - iter 60/156 - loss 3.45231233 - samples/sec: 60.83
2020-02-16 18:43:55,916 epoch 1 - iter 75/156 - loss 3.35290138 - samples/sec: 62.19
2020-02-16 18:44:03,714 epoch 1 - iter 90/156 - loss 3.18601061 - samples/sec: 62.56
2020-02-16 18:44:11,479 epoch 1 - iter 105/156 - loss 3.05470355 - samples/sec: 62.66
2020-02-16 18:44:19,222 epoch 1 - iter 120/156 - loss 2.97526700 - samples/sec: 62.85
2020-02-16 18:44:26,844 epoch 1 - iter 135/156 - loss 2.89168438 - samples/sec: 64.11
2020-02-16 18:44:34,723 epoch 1 - iter 150/156 - loss 2.79264000 - samples/sec: 61.94
2020-02-16 18:44:37,590 ----------------------------------------------------------------------------------------------------
2020-02-16 18:44:37,5

100%|██████████| 4965/4965 [00:00<00:00, 231558.15it/s]

2020-02-16 18:47:05,817 [b'contentieux_locataire_pr\xc3\xa9sent_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'Enregistrement_de_PACS', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b'Certificats,_l\xc3\xa9galisation_de_signature', b"demandes_d'attestations_diverses", b'd\xc3\xa9compte_de_sortie_(locataire_quittant_OPHEOR)', b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'Livret_de_famille', b'Commande_ou_d\xc3\xa9commande_de_repas', b'Recensement_des_jeunes', b'explication_avis_\xc3\xa9ch\xc3\xa9ance_loyer', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'D\xc3\xa9claration_de_d\xc3\xa9c\xc3\xa8s', b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'Inscription_sur_liste_\xc3\xa9lectorale', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'Renseignements,_modification_de_dossier', b'explication_r\xc3\xa9gularisation_des_charges', b'mise_en_place




2020-02-16 18:47:13,735 epoch 1 - iter 15/156 - loss 1.05282265 - samples/sec: 60.92
2020-02-16 18:47:21,427 epoch 1 - iter 30/156 - loss 0.73885827 - samples/sec: 63.87
2020-02-16 18:47:29,186 epoch 1 - iter 45/156 - loss 0.58744628 - samples/sec: 62.78
2020-02-16 18:47:37,048 epoch 1 - iter 60/156 - loss 0.50671520 - samples/sec: 61.91
2020-02-16 18:47:44,868 epoch 1 - iter 75/156 - loss 0.47177144 - samples/sec: 62.36
2020-02-16 18:47:53,090 epoch 1 - iter 90/156 - loss 0.46099007 - samples/sec: 59.34
2020-02-16 18:48:01,076 epoch 1 - iter 105/156 - loss 0.43257743 - samples/sec: 60.93
2020-02-16 18:48:08,965 epoch 1 - iter 120/156 - loss 0.41261490 - samples/sec: 61.76
2020-02-16 18:48:16,756 epoch 1 - iter 135/156 - loss 0.39294633 - samples/sec: 62.61
2020-02-16 18:48:24,577 epoch 1 - iter 150/156 - loss 0.37496976 - samples/sec: 62.48
2020-02-16 18:48:27,372 ----------------------------------------------------------------------------------------------------
2020-02-16 18:48:27,3

100%|██████████| 4967/4967 [00:00<00:00, 208362.33it/s]

2020-02-16 18:50:56,797 [b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'Certificats,_l\xc3\xa9galisation_de_signature', b'Enregistrement_de_PACS', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'Mariage', b'Recensement_des_jeunes', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'Livret_de_famille', b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'D\xc3\xa9claration_de_d\xc3\xa9c\xc3\xa8s', b'Inscription_sur_liste_\xc3\xa9lectorale', b'explication_r\xc3\xa9gularisation_des_charges', b'contentieux_locataire_pr\xc3\xa9sent_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'R\xc3\xa8glement_cantine_en_esp\xc3\xa8ces', b"demandes_d'attestations_diverses", b'Commande_ou_d\xc3\xa9commande_de_repas', b'mise_en_place_contrat_pr\xc3\




2020-02-16 18:51:04,763 epoch 1 - iter 15/156 - loss 0.84563943 - samples/sec: 60.58
2020-02-16 18:51:13,037 epoch 1 - iter 30/156 - loss 0.52850599 - samples/sec: 59.21
2020-02-16 18:51:20,927 epoch 1 - iter 45/156 - loss 0.41564273 - samples/sec: 61.78
2020-02-16 18:51:28,773 epoch 1 - iter 60/156 - loss 0.35693765 - samples/sec: 62.03
2020-02-16 18:51:36,622 epoch 1 - iter 75/156 - loss 0.31267708 - samples/sec: 62.10
2020-02-16 18:51:44,340 epoch 1 - iter 90/156 - loss 0.28308592 - samples/sec: 63.10
2020-02-16 18:51:52,287 epoch 1 - iter 105/156 - loss 0.25601876 - samples/sec: 61.22
2020-02-16 18:52:07,992 epoch 1 - iter 135/156 - loss 0.22129925 - samples/sec: 60.47
2020-02-16 18:52:15,812 epoch 1 - iter 150/156 - loss 0.20626879 - samples/sec: 62.51
2020-02-16 18:52:18,502 ----------------------------------------------------------------------------------------------------
2020-02-16 18:52:18,503 EPOCH 1 done: loss 0.2014 - lr 0.3000
2020-02-16 18:52:20,169 DEV : loss 0.00395294

100%|██████████| 4971/4971 [00:00<00:00, 266224.26it/s]

2020-02-16 18:54:47,493 [b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b"demandes_d'attestations_diverses", b'explication_r\xc3\xa9gularisation_des_charges', b'contentieux_locataire_pr\xc3\xa9sent_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'mise_en_place_contrat_pr\xc3\xa9l\xc3\xa8vement', b'Commande_ou_d\xc3\xa9commande_de_repas', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'd\xc3\xa9compte_de_sortie_(locataire_quittant_OPHEOR)', b'Certificats,_l\xc3\xa9galisation_de_signature', b'explication_avis_\xc3\xa9ch\xc3\xa9ance_loyer', b'R\xc3\xa8glement_cantine_en_esp\xc3\xa8ces', b'Mariage', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b'Livret_de_famille', b'Recensement_des_jeunes', b'Inscription_sur_liste_\xc3\xa9lector




2020-02-16 18:54:55,777 epoch 1 - iter 15/156 - loss 0.75614556 - samples/sec: 58.28
2020-02-16 18:55:03,454 epoch 1 - iter 30/156 - loss 0.44931348 - samples/sec: 63.70
2020-02-16 18:55:11,448 epoch 1 - iter 45/156 - loss 0.34281796 - samples/sec: 61.06
2020-02-16 18:55:19,243 epoch 1 - iter 60/156 - loss 0.28148861 - samples/sec: 62.64
2020-02-16 18:55:27,280 epoch 1 - iter 75/156 - loss 0.23750490 - samples/sec: 60.63
2020-02-16 18:55:35,038 epoch 1 - iter 90/156 - loss 0.21383898 - samples/sec: 62.73
2020-02-16 18:55:42,740 epoch 1 - iter 105/156 - loss 0.20559717 - samples/sec: 63.45
2020-02-16 18:55:50,798 epoch 1 - iter 120/156 - loss 0.19405362 - samples/sec: 60.56
2020-02-16 18:55:58,672 epoch 1 - iter 135/156 - loss 0.17855375 - samples/sec: 62.20
2020-02-16 18:56:06,342 epoch 1 - iter 150/156 - loss 0.17023895 - samples/sec: 63.49
2020-02-16 18:56:09,344 ----------------------------------------------------------------------------------------------------
2020-02-16 18:56:09,3

100%|██████████| 4962/4962 [00:00<00:00, 206923.28it/s]

2020-02-16 18:58:39,165 [b'Livret_de_famille', b'Certificats,_l\xc3\xa9galisation_de_signature', b'Commande_ou_d\xc3\xa9commande_de_repas', b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'explication_r\xc3\xa9gularisation_des_charges', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b"demandes_d'attestations_diverses", b'explication_avis_\xc3\xa9ch\xc3\xa9ance_loyer', b'D\xc3\xa9claration_de_d\xc3\xa9c\xc3\xa8s', b'Renseignements,_modification_de_dossier', b'Recensement_des_jeunes', b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'Inscription_sur_liste_\xc3\xa9lectorale', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'contentieux_locataire_pr\xc3\xa9sent_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'mise_en_place_co




2020-02-16 18:58:46,912 epoch 1 - iter 15/156 - loss 0.62548945 - samples/sec: 62.33
2020-02-16 18:58:55,297 epoch 1 - iter 30/156 - loss 0.36460742 - samples/sec: 58.15
2020-02-16 18:59:03,478 epoch 1 - iter 45/156 - loss 0.27682351 - samples/sec: 59.47
2020-02-16 18:59:11,324 epoch 1 - iter 60/156 - loss 0.22702121 - samples/sec: 62.03
2020-02-16 18:59:19,527 epoch 1 - iter 75/156 - loss 0.18691550 - samples/sec: 59.49
2020-02-16 18:59:27,466 epoch 1 - iter 90/156 - loss 0.16231866 - samples/sec: 61.29
2020-02-16 18:59:35,318 epoch 1 - iter 105/156 - loss 0.14864532 - samples/sec: 62.18
2020-02-16 18:59:43,179 epoch 1 - iter 120/156 - loss 0.13444437 - samples/sec: 62.01
2020-02-16 18:59:51,508 epoch 1 - iter 135/156 - loss 0.12274236 - samples/sec: 58.38
2020-02-16 18:59:59,406 epoch 1 - iter 150/156 - loss 0.11508873 - samples/sec: 61.91
2020-02-16 19:00:02,287 ----------------------------------------------------------------------------------------------------
2020-02-16 19:00:02,2

100%|██████████| 4967/4967 [00:00<00:00, 243673.48it/s]

2020-02-16 19:02:25,277 [b'Livret_de_famille', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b'Certificats,_l\xc3\xa9galisation_de_signature', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'Commande_ou_d\xc3\xa9commande_de_repas', b'mise_en_place_contrat_pr\xc3\xa9l\xc3\xa8vement', b'contentieux_locataire_pr\xc3\xa9sent_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'Recensement_des_jeunes', b'Renseignements,_modification_de_dossier', b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'Mariage', b'd\xc3\xa9compte_de_sortie_(locataire_quittant_OPHEOR)', b'explication_avis_\xc3\xa9ch\xc3\xa9ance_loyer', b'explication_r\xc3\xa9gularisation_des_charges', b'D\xc3\xa9claration_de_d\xc3\xa9c\xc3\xa8s', b"demandes_d'attestations_diverses", b'Inscription_sur_liste_\xc3\xa9lectorale', b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'contentieux_locataire_parti_:_r\x




2020-02-16 19:02:33,794 epoch 1 - iter 15/156 - loss 0.66981849 - samples/sec: 56.74
2020-02-16 19:02:41,453 epoch 1 - iter 30/156 - loss 0.38610579 - samples/sec: 63.71
2020-02-16 19:02:49,101 epoch 1 - iter 45/156 - loss 0.27506845 - samples/sec: 63.71
2020-02-16 19:02:57,250 epoch 1 - iter 60/156 - loss 0.21826743 - samples/sec: 59.88
2020-02-16 19:03:05,213 epoch 1 - iter 75/156 - loss 0.18674200 - samples/sec: 61.10
2020-02-16 19:03:12,946 epoch 1 - iter 90/156 - loss 0.16235987 - samples/sec: 62.95
2020-02-16 19:03:20,868 epoch 1 - iter 105/156 - loss 0.14265246 - samples/sec: 61.93
2020-02-16 19:03:28,625 epoch 1 - iter 120/156 - loss 0.13335114 - samples/sec: 63.06
2020-02-16 19:03:36,472 epoch 1 - iter 135/156 - loss 0.12324204 - samples/sec: 62.30
2020-02-16 19:03:44,242 epoch 1 - iter 150/156 - loss 0.11544076 - samples/sec: 62.68
2020-02-16 19:03:46,977 ----------------------------------------------------------------------------------------------------
2020-02-16 19:03:46,9

100%|██████████| 4968/4968 [00:00<00:00, 242850.51it/s]

2020-02-16 19:06:12,706 [b'd\xc3\xa9compte_de_sortie_(locataire_quittant_OPHEOR)', b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'explication_r\xc3\xa9gularisation_des_charges', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b"demandes_d'attestations_diverses", b'mise_en_place_contrat_pr\xc3\xa9l\xc3\xa8vement', b'Mariage', b'Enregistrement_de_PACS', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'R\xc3\xa8glement_cantine_en_esp\xc3\xa8ces', b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'Commande_ou_d\xc3\xa9commande_de_repas', b'Certificats,_l\xc3\xa9galisation_de_signature', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b'Livret_de_famille', b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'explication_avis_\xc3\xa9ch\xc3\xa9ance_loyer', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b'Recen




2020-02-16 19:06:20,402 epoch 1 - iter 15/156 - loss 0.64334982 - samples/sec: 63.08
2020-02-16 19:06:28,145 epoch 1 - iter 30/156 - loss 0.34279539 - samples/sec: 63.42
2020-02-16 19:06:36,049 epoch 1 - iter 45/156 - loss 0.24296944 - samples/sec: 61.85
2020-02-16 19:06:43,972 epoch 1 - iter 60/156 - loss 0.19965013 - samples/sec: 61.43
2020-02-16 19:06:51,856 epoch 1 - iter 75/156 - loss 0.16803787 - samples/sec: 61.89
2020-02-16 19:06:59,373 epoch 1 - iter 90/156 - loss 0.15060153 - samples/sec: 64.99
2020-02-16 19:07:07,183 epoch 1 - iter 105/156 - loss 0.13249747 - samples/sec: 62.54
2020-02-16 19:07:15,119 epoch 1 - iter 120/156 - loss 0.11993155 - samples/sec: 61.36
2020-02-16 19:07:23,084 epoch 1 - iter 135/156 - loss 0.11204876 - samples/sec: 61.35
2020-02-16 19:07:30,984 epoch 1 - iter 150/156 - loss 0.10319167 - samples/sec: 61.60
2020-02-16 19:07:33,754 ----------------------------------------------------------------------------------------------------
2020-02-16 19:07:33,7

100%|██████████| 4966/4966 [00:00<00:00, 212323.28it/s]

2020-02-16 19:10:02,702 [b'Certificats,_l\xc3\xa9galisation_de_signature', b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'Renseignements,_modification_de_dossier', b"demandes_d'attestations_diverses", b'Inscription_sur_liste_\xc3\xa9lectorale', b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'Recensement_des_jeunes', b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'R\xc3\xa8glement_cantine_en_esp\xc3\xa8ces', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b'explication_r\xc3\xa9gularisation_des_charges', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'Enregistrement_de_PACS', b'Mariage', b'Commande_ou_d\xc3\xa9commande_de_repas', b'mise_en_place_contrat_pr\xc3\xa9l\xc3\xa8vement', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b'Livret_de_famille', b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', 




2020-02-16 19:10:10,443 epoch 1 - iter 15/156 - loss 0.65915275 - samples/sec: 62.44
2020-02-16 19:10:19,004 epoch 1 - iter 30/156 - loss 0.37332314 - samples/sec: 57.25
2020-02-16 19:10:27,103 epoch 1 - iter 45/156 - loss 0.27575239 - samples/sec: 60.18
2020-02-16 19:10:34,895 epoch 1 - iter 60/156 - loss 0.21951715 - samples/sec: 62.46
2020-02-16 19:10:42,663 epoch 1 - iter 75/156 - loss 0.18796014 - samples/sec: 62.75
2020-02-16 19:10:50,723 epoch 1 - iter 90/156 - loss 0.16661746 - samples/sec: 60.58
2020-02-16 19:10:58,428 epoch 1 - iter 105/156 - loss 0.14692962 - samples/sec: 63.24
2020-02-16 19:11:06,481 epoch 1 - iter 120/156 - loss 0.13174993 - samples/sec: 60.53
2020-02-16 19:11:14,114 epoch 1 - iter 135/156 - loss 0.12225621 - samples/sec: 64.02
2020-02-16 19:11:21,837 epoch 1 - iter 150/156 - loss 0.11249273 - samples/sec: 63.04
2020-02-16 19:11:24,712 ----------------------------------------------------------------------------------------------------
2020-02-16 19:11:24,7

100%|██████████| 4965/4965 [00:00<00:00, 204647.45it/s]

2020-02-16 19:13:51,697 [b'Livret_de_famille', b'Mariage', b'D\xc3\xa9claration_de_naissance,_Reconnaissance', b'Actes_de_naissance,_mariage,_d\xc3\xa9c\xc3\xa8s', b'contentieux_locataire_pr\xc3\xa9sent_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'explication_r\xc3\xa9gularisation_des_charges', b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b'mise_en_place_contrat_pr\xc3\xa9l\xc3\xa8vement', b'Certificats,_l\xc3\xa9galisation_de_signature', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'd\xc3\xa9compte_de_sortie_(locataire_quittant_OPHEOR)', b'Recensement_des_jeunes', b"demandes_d'attestations_diverses", b'D\xc3\xa9claration_de_d\xc3\xa9c\xc3\xa8s', b'Commande_ou_d\xc3\xa9commande_de_repas', b'Inscription_sur_liste_\xc3\xa9lecto




2020-02-16 19:13:59,447 epoch 1 - iter 15/156 - loss 0.58097178 - samples/sec: 62.28
2020-02-16 19:14:07,159 epoch 1 - iter 30/156 - loss 0.34241506 - samples/sec: 63.66
2020-02-16 19:14:15,167 epoch 1 - iter 45/156 - loss 0.24537440 - samples/sec: 60.79
2020-02-16 19:14:23,092 epoch 1 - iter 60/156 - loss 0.19720873 - samples/sec: 61.56
2020-02-16 19:14:31,037 epoch 1 - iter 75/156 - loss 0.16584913 - samples/sec: 61.46
2020-02-16 19:14:38,798 epoch 1 - iter 90/156 - loss 0.14680903 - samples/sec: 62.68
2020-02-16 19:14:46,801 epoch 1 - iter 105/156 - loss 0.13320354 - samples/sec: 60.95
2020-02-16 19:14:54,906 epoch 1 - iter 120/156 - loss 0.12452953 - samples/sec: 60.13
2020-02-16 19:15:02,951 epoch 1 - iter 135/156 - loss 0.11296552 - samples/sec: 60.51
2020-02-16 19:15:11,092 epoch 1 - iter 150/156 - loss 0.10313566 - samples/sec: 59.81
2020-02-16 19:15:14,006 ----------------------------------------------------------------------------------------------------
2020-02-16 19:15:14,0

100%|██████████| 4970/4970 [00:00<00:00, 251984.76it/s]

2020-02-16 19:17:44,146 [b'Certificats,_l\xc3\xa9galisation_de_signature', b'Inscription_P\xc3\xa9riscolaire_(Cantine_et_Accueil)', b'contentieux_locataire_parti_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_FC', b'Inscription_sur_liste_\xc3\xa9lectorale', b'D\xc3\xa9claration_de_d\xc3\xa9c\xc3\xa8s', b'R\xc3\xa8glement_cantine_en_esp\xc3\xa8ces', b'Enregistrement_de_PACS', b'd\xc3\xa9compte_de_sortie_(locataire_quittant_OPHEOR)', b'PACS_(D\xc3\xa9p\xc3\xb4t_de_dossier,_modification_ou_dissolution_)', b'relance_amiable_:_r\xc3\xa9f\xc3\xa9rence_courrier_re\xc3\xa7u_RC', b'Recensement_des_jeunes', b'Mariage', b'Commande_ou_d\xc3\xa9commande_de_repas', b'Changement_de_pr\xc3\xa9noms,_rectification_d\xe2\x80\x99actes', b'mise_en_place_contrat_pr\xc3\xa9l\xc3\xa8vement', b'explication_r\xc3\xa9gularisation_des_charges', b'Renseignements,_modification_de_dossier', b"Premi\xc3\xa8re_Inscription_scolaire-changement_d'\xc3\xa9cole", b'explication_avis_\xc3\xa9ch\xc3\xa9ance_loyer', b"demandes




2020-02-16 19:17:52,152 epoch 1 - iter 15/156 - loss 0.50575357 - samples/sec: 60.36
2020-02-16 19:18:00,975 epoch 1 - iter 30/156 - loss 0.27881936 - samples/sec: 55.49
2020-02-16 19:18:09,005 epoch 1 - iter 45/156 - loss 0.20843795 - samples/sec: 60.64
2020-02-16 19:18:16,870 epoch 1 - iter 60/156 - loss 0.17055582 - samples/sec: 62.01
2020-02-16 19:18:24,637 epoch 1 - iter 75/156 - loss 0.14397726 - samples/sec: 62.67
2020-02-16 19:18:32,281 epoch 1 - iter 90/156 - loss 0.12824199 - samples/sec: 63.86
2020-02-16 19:18:40,015 epoch 1 - iter 105/156 - loss 0.11617981 - samples/sec: 63.41
2020-02-16 19:18:47,942 epoch 1 - iter 120/156 - loss 0.10323548 - samples/sec: 61.48
2020-02-16 19:18:55,917 epoch 1 - iter 135/156 - loss 0.09271934 - samples/sec: 61.33
2020-02-16 19:19:03,959 epoch 1 - iter 150/156 - loss 0.08664010 - samples/sec: 60.50
2020-02-16 19:19:06,633 ----------------------------------------------------------------------------------------------------
2020-02-16 19:19:06,6

In [27]:
accuracies = []
f1s = []
for split in results:
    accuracies.append(split['test_score'])
    f1s.append(split['test_f1'])

print("| {:.3f} +- {:.3f} | {:.3f} +- {:.3f} |".format(float(np.mean(accuracies)), float(np.std(accuracies)), float(np.mean(f1s)), float(np.std(f1s))))


| 0.988 +- 0.015 | 0.988 +- 0.015 |
