<a href="https://colab.research.google.com/github/H04K/Seq2Seq_paraphrase_fr/blob/main/Fr_Transfo_para_BART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Paraphrase Model based on BART and SimpleTransformers

French Version

---

Trained on collab Pro w/ specs : 

| Tesla P100-PCIE | 250W | 16280MiB VRAM | 27 Gb RAM


## Dépendances 

In [None]:
!pip install transformers[sentenpiece]
!pip install datasets
!pip install simpletransformers

In [None]:
from datasets import load_dataset
import numpy as np
import pandas as pd
import warnings
import os 
from datetime import datetime
import logging
from sklearn.model_selection import train_test_split
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

## Traitements

In [None]:
def load_data(
    file_path, input_text_column, target_text_column, label_column, keep_label=1
):
    df = pd.read_csv(file_path, sep="\t", error_bad_lines=False)
    df = df.loc[df[label_column] == keep_label]
    df = df.rename(
        columns={input_text_column: "input_text", target_text_column: "target_text"}
    )
    df = df[["input_text", "target_text"]]
    df["prefix"] = "paraphrase"

    return df


def clean_unnecessary_spaces(out_string):
    if not isinstance(out_string, str):
        warnings.warn(f">>> {out_string} <<< is not a string.")
        out_string = str(out_string)
    out_string = (
        out_string.replace(" .", ".")
        .replace(" ?", "?")
        .replace(" !", "!")
        .replace(" ,", ",")
        .replace(" ' ", "'")
        
    )
    return out_string


## Données
Cross-lingual Adversarial Dataset for Paraphrase Identification 
(fr version TRAIN , TEST, EVAL)

In [None]:
dataset = load_dataset("paws-x",'fr')
train_df = pd.DataFrame(data= np.c_[dataset['train']['id'], dataset['train']['sentence1'],dataset['train']['sentence2'],dataset['train']['label']],columns=['id','sentence1','sentence2','label'])
eval_df = pd.DataFrame(data= np.c_[dataset['test']['id'], dataset['test']['sentence1'],dataset['test']['sentence2'],dataset['test']['label']],columns=['id','sentence1','sentence2','label'])
train_df = train_df.astype(str)
eval_df = eval_df.astype(str)

train_df = train_df.loc[train_df["label"] == "1"]
eval_df = eval_df.loc[eval_df["label"] == "1"]

train_df = train_df.rename(
    columns={"sentence1": "input_text", "sentence2": "target_text"}
)
eval_df = eval_df.rename(
    columns={"sentence1": "input_text", "sentence2": "target_text"}
)

train_df = train_df[["input_text", "target_text"]]
eval_df = eval_df[["input_text", "target_text"]]

train_df["prefix"] = "paraphrase"
eval_df["prefix"] = "paraphrase"

train_df["input_text"] = train_df["input_text"].apply(clean_unnecessary_spaces)
train_df["target_text"] = train_df["target_text"].apply(clean_unnecessary_spaces)

eval_df["input_text"] = eval_df["input_text"].apply(clean_unnecessary_spaces)
eval_df["target_text"] = eval_df["target_text"].apply(clean_unnecessary_spaces)

print(train_df)

Downloading:   0%|          | 0.00/2.49k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Downloading and preparing dataset pawsx/fr (download: 28.88 MiB, generated: 13.70 MiB, post-processed: Unknown size, total: 42.58 MiB) to /root/.cache/huggingface/datasets/pawsx/fr/1.1.0/a5033b43902a02a4ba2ee469c1dd22af3e6a4a247ac47fa1af9835d0e734e2af...


Downloading:   0%|          | 0.00/30.3M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset pawsx downloaded and prepared to /root/.cache/huggingface/datasets/pawsx/fr/1.1.0/a5033b43902a02a4ba2ee469c1dd22af3e6a4a247ac47fa1af9835d0e734e2af. Subsequent calls will reuse this data.
                                              input_text  ...      prefix
1      La saison NBA 1975 - 76 était la 30e saison de...  ...  paraphrase
3      Lorsque des débits comparables peuvent être ma...  ...  paraphrase
4      C'est le siège du district de Zerendi dans la ...  ...  paraphrase
5      William Henry Henry Harman est né le 17 févrie...  ...  paraphrase
7      Avec un nombre discret de probabilités Formule...  ...  paraphrase
...                                                  ...  ...         ...
49384  La langue romane, le galicien (Galego), couram...  ...  paraphrase
49390  Notez que k est un vecteur composé de trois no...  ...  paraphrase
49393  Tim Henman a remporté la finale 6 - 2, 7 - 6 c...  ...  paraphrase
49395  Il était considéré comme un membre actif du co...  ...  pa

## Args & training


In [None]:
model_args = Seq2SeqArgs()
model_args.do_sample = True
model_args.eval_batch_size = 32
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 2500
model_args.evaluate_during_training_verbose = True
model_args.fp16 = False
model_args.learning_rate =  1e-10
model_args.max_length = 128
model_args.max_seq_length = 128
model_args.num_beams = None
model_args.num_return_sequences = 3
model_args.num_train_epochs = 2
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.save_eval_checkpoints = False
model_args.save_steps = -1
model_args.top_k = 50
model_args.top_p = 0.95
model_args.train_batch_size = 4
model_args.use_multiprocessing = False
model_args.wandb_project = "paraphrase_fr_TF"


model = Seq2SeqModel(
    encoder_decoder_type="bart",
    encoder_decoder_name="facebook/bart-large",
    args=model_args,
)

model.train_model(train_df, eval_data=eval_df)




## Simple predictions w/ trained Model

In [None]:
model = Seq2SeqModel(
    encoder_decoder_type="bart",
    encoder_decoder_name="pathToModel/outputs/best_model",
)

to_predict = ["Qu'est ce que tu fais demain ? "]

predictions = model.predict(to_predict)
print(predictions)