### Notebook pour l'entra√Ænement d'un segmenteur

Import des librairies

In [17]:
#import
import sys
from transformers import BertTokenizer, Trainer, TrainingArguments, AutoModelForTokenClassification, set_seed
import aquilign.preproc.tok_trainer_functions as trainer_functions
import aquilign.preproc.eval as evaluation
import aquilign.preproc.utils as utils
import re
import os
import json
import glob
import argparse
#shutil usefull for deleting not empty directories 
import shutil

L'ex√©cution du script permet d'entra√Æner un mod√®le de segmentation automatique de texte. Trois fichiers doivent √™tre fournis, tous au format sp√©cifi√© (chaque token devant √™tre identifi√© comme segmentant le texte doit √™tre pr√©c√©d√© d'un '¬£') : un fichier contenant les donn√©es d'entra√Ænement, un contenant les donn√©es de dev, et un dernier contenant les donn√©es de test. Les fichiers doivent √™tre dans un dossier contenant le code ISO de la langue dans laquelle ils sont √©crits (ce code est r√©cup√©r√© au moment de l'√©valuation des mod√®les).

Le meilleur mod√®le est enregistr√© √† la fin de l'entra√Ænement. L'√©valuation se base √† la fois sur la loss, et sur les m√©triques plus classiques d'√©valuation. Dans notre script, c'est la pr√©cision qui prend le poids le plus important.
L'√©valuation passe √©galement par l'√©valuation compar√©e d'une segmentation bas√©e sur des regex. Il est donc n√©cessaire de remplir le fichier delimiters.json avec des exemple de regex.



In [4]:
## command line usage : python tok_trainer.py model_name tok_name train_file.txt dev_file.txt num_train_epochs batch_size logging_steps
## where :
# model_name is the full name of the model (same name for model and tokenizer or not)
# tok_name is the full name of the tokenizer (can be the same)
# train_file.txt is the file with the sentences and words of interest are identified  (words are identified with ¬£ in the line)
# which will be used for training
## ex. : uoulentiers ¬£mais il nen est pas encor temps. ¬£Certes fait elle
# dev_file.txt is the file with the sentences and words of interest which will be used for eval
# num_train_epochs : the number of epochs we want to train (ex : 10)
# batch_size : the batch size (ex : 8)
# logging_steps : the number of logging steps (ex : 50)

## was changed : if you want to fine-tune a model, we need to have two different names for model_name and tok_name (can also be the same

La fonction training_trainer prend plusieurs arguments :
- model_name : le nom du mod√®le AutoModelForTokenClassification
- tok_name : le nom du mod√®le BertTokenizer
(ces deux noms peuvent √©ventuellement √™tre les m√™mes, si l'on ne fine-tune pas un mod√®le sp√©cifique)
- train_dataset : le chemin du fichier des donn√©es d'entra√Ænement
- dev_dataset : le chemin du fichier des donn√©es de dev
- eval_dataset : le chemin du fichier des donn√©es de test
- num_train_epochs : le nombre d'√©poques d'entra√Ænement (min. 2)
- batch_size
- logging_steps
  
Et en plus, un argument permettant de dire si on veut aussi garder la ponctuation ou non comme aide √† la segmentation.

In [18]:
def training_trainer(model_name, tok_name, train_dataset, dev_dataset, eval_dataset, num_train_epochs, batch_size, logging_steps, keep_punct=True):
    model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=3)
    tokenizer = BertTokenizer.from_pretrained(tok_name, max_length=10)
    
    with open(train_dataset, "r") as train_file:
        train_lines = [item.replace("\n", "") for item in train_file.readlines()]
        if keep_punct is False:
            train_lines = [utils.remove_punctuation(line) for line in train_lines]
        
    with open(dev_dataset, "r") as dev_file:
        dev_lines = [item.replace("\n", "") for item in dev_file.readlines()]
        if keep_punct is False:
            dev_lines = [utils.remove_punctuation(line) for line in dev_lines]
        
    with open(eval_dataset, "r") as eval_files:
        eval_lines = [item.replace("\n", "") for item in eval_files.readlines()]
    eval_data_lang = eval_dataset.split("/")[-2]
    
    # Train corpus
    train_texts_and_labels = utils.convertToSubWordsSentencesAndLabels(train_lines, tokenizer=tokenizer, delimiter="¬£")
    train_dataset = trainer_functions.SentenceBoundaryDataset(train_texts_and_labels, tokenizer)
    
    # Dev corpus
    dev_texts_and_labels = utils.convertToSubWordsSentencesAndLabels(dev_lines, tokenizer=tokenizer, delimiter="¬£")
    dev_dataset = trainer_functions.SentenceBoundaryDataset(dev_texts_and_labels, tokenizer)

    if '/' in model_name:
        name_of_model = re.split('/', model_name)[1]
    else:
        name_of_model = model_name

    # training arguments
    # num train epochs, logging_steps and batch_size should be provided
    # evaluation is done by epoch and the best model of each one is stored in a folder "results_+name"
    training_args = TrainingArguments(
        output_dir=f"results_{name_of_model}/epoch{num_train_epochs}_bs{batch_size}",
        num_train_epochs=num_train_epochs,
        logging_steps=logging_steps,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        evaluation_strategy="epoch",
        logging_strategy="epoch",
        dataloader_num_workers=8,
        dataloader_prefetch_factor=4,
        # ajout pour r√©soudre pb : save_safetensors= False et bf16=False
        bf16=False,    
        save_safetensors=False,
        #modif : cpu
        use_cpu=True,
        save_strategy="epoch",
        load_best_model_at_end=True
        # best model is evaluated on loss
    )

    # define the trainer : model, training args, datasets and the specific compute_metrics defined in functions file
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=dev_dataset,
        compute_metrics=trainer_functions.compute_metrics
    )

    # fine-tune the model
    print("Starting training")
    trainer.train()
    print("End of training")

    # get the best model path
    best_model_path = trainer.state.best_model_checkpoint
    print(best_model_path)
    print(f"Evaluation.")
    
    
    # print the whole log_history with the compute metrics
    best_precision_step, best_step_metrics = utils.get_best_step(trainer.state.log_history)
    best_model_path = f"results_{name_of_model}/epoch{num_train_epochs}_bs{batch_size}/checkpoint-{best_precision_step}"
    print(f"Best model path according to precision: {best_model_path}")
    print(f"Full metrics: {best_step_metrics}")
    
    eval_results = evaluation.run_eval(data=eval_lines, 
                        model_path=best_model_path, 
                        tokenizer_name=tokenizer.name_or_path, 
                        verbose=False, 
                        lang=eval_data_lang)
    

    # We move the best state dir name to "best"
    new_best_path = f"results_{name_of_model}/epoch{num_train_epochs}_bs{batch_size}/best"
    try:
        #os.rmdir(new_best_path)
        shutil.rmtree(new_best_path)
    except FileNotFoundError:
        pass
    os.rename(best_model_path, new_best_path)
    
    #with open(f"{new_best_path}/model_name", "w") as model_name:
    #    model_name.write(modelName)

    with open(f"{new_best_path}/eval.txt", "w") as evaluation_results:
        evaluation_results.write(eval_results)

    with open(f"{new_best_path}/metrics.json", "w") as metrics:
        json.dump(best_step_metrics, metrics)
    
    print(f"\n\nBest model can be found at : {new_best_path} ")
    print(f"You should remove the following directories by using `rm -r results_{name_of_model}/epoch{num_train_epochs}_bs{batch_size}/checkpoint-*`")

    # functions returns best model_path
    return new_best_path

In [19]:
training_trainer('dbmdz/bert-base-french-europeana-cased', 'dbmdz/bert-base-french-europeana-cased', 'data_to_segmenter/fr/randomSentencesComplete-gf.txt', 'data_to_segmenter/fr/randomSentencesEvalComplete-gf.txt', 'data_to_segmenter/fr/randomSentencesfrancais-pourtest-complete-gf.txt', 2, 8, 50, keep_punct=True)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at dbmdz/bert-base-french-europeana-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Starting training




Epoch,Training Loss,Validation Loss,Accurracy,Recall,Precision,F1
1,0.0981,0.041685,{'accuracy': 0.9857142857142858},"[0.9889434889434889, 0.8842105263157894, 1.0]","[0.9865196078431373, 0.9130434782608695, 0.9979674796747967]","[0.9877300613496932, 0.8983957219251337, 0.9989827060020345]"
2,0.0231,0.034185,{'accuracy': 0.9907142857142858},"[0.9926289926289926, 0.9263157894736842, 1.0]","[0.9914110429447853, 0.946236559139785, 0.9979674796747967]","[0.9920196439533456, 0.9361702127659575, 0.9989827060020345]"


Starting eval
Eval finished
Starting eval
Eval finished
End of training
results_bert-base-french-europeana-cased/epoch2_bs8/checkpoint-250
Evaluation.
Best step according to precision: 250
Best model path according to precision: results_bert-base-french-europeana-cased/epoch2_bs8/checkpoint-250
Full metrics: {'loss': 0.0231, 'grad_norm': 0.5186708569526672, 'learning_rate': 0.0, 'epoch': 2.0, 'step': 250, 'eval_loss': 0.0341854952275753, 'eval_accurracy': {'accuracy': 0.9907142857142858}, 'eval_recall': [0.9926289926289926, 0.9263157894736842, 1.0], 'eval_precision': [0.9914110429447853, 0.946236559139785, 0.9979674796747967], 'eval_f1': [0.9920196439533456, 0.9361702127659575, 0.9989827060020345], 'eval_runtime': 4.42, 'eval_samples_per_second': 11.312, 'eval_steps_per_second': 1.584, 'train_runtime': 169.2471, 'train_samples_per_second': 11.817, 'train_steps_per_second': 1.477, 'total_flos': 30620990520000.0, 'train_loss': 0.06056657981872558}




Performing syntactic tokenization evaluation
(0.9547619047619048, [0.9417637271214643, 0.8188405797101449, 1.0], [0.9783923941227312, 0.6174863387978142, 1.0], [0.9597286986011022, 0.7040498442367601, 1.0])
Performing bert-based tokenization evaluation
|           | Synt (None, Delim.)                           | Bert (None, Delim., Pad.)                     |
|-----------+-----------------------------------------------+-----------------------------------------------|
| Accuracy  | 0.9547619047619048                            | 0.9869565217391304                            |
| Precision | [0.9417637271214643, 0.8188405797101449, 1.0] | [0.9895742832319722, 0.9047619047619048, 1.0] |
| Recall    | [0.9783923941227312, 0.6174863387978142, 1.0] | [0.9844425237683665, 0.9344262295081968, 1.0] |
| F1-score  | [0.9597286986011022, 0.7040498442367601, 1.0] | [0.987001733102253, 0.9193548387096774, 1.0]  |


Best model can be found at : results_bert-base-french-europeana-cased/epoch2_bs8/best

'results_bert-base-french-europeana-cased/epoch2_bs8/best'