<a href="https://colab.research.google.com/github/Sambura/NLP-Text-detoxification/blob/main/notebooks/2.0-demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demo

This is a demo notebook to open on google colab to play with

In [None]:
!git clone https://github.com/Sambura/NLP-Text-detoxification.git
%pip install datasets transformers[sentencepiece,torch]

In [None]:
from transformers import Seq2SeqTrainingArguments
import torch

import sys
sys.path.append('./NLP-Text-detoxification/')

In [None]:
from src.models.train_model import DetoxifierTrainer, seed_everything
from src.models.predict_model import DetoxifierPredictor

## Model training:

Create the `DetoxifierTrainer` object:

In [None]:
detoxification_trainer = DetoxifierTrainer()
detoxification_trainer.load_pretrained('t5-small')

Seed all the randomness:

In [None]:
seed_everything(seed=1984)

Specify the portion of dataset to use and validation/training split:

In [None]:
val_ratio = 0.2
dataset_portion = 0.1
detoxification_trainer.load_dataset(val_ratio=val_ratio, dataset_portion=dataset_portion, verbose=True)

This code downloads, preprocesses and loads the default dataset (located [here](https://github.com/skoltech-nlp/detox/releases/download/emnlp2021/filtered_paranmt.zip))

Now let's specify training arguments:

In [None]:
batch_size = 32
args = Seq2SeqTrainingArguments(
    './models/t5-small-detoxifier',
    evaluation_strategy = "epoch",
    learning_rate=4e-4,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    fp16=torch.cuda.is_available(),
    report_to='tensorboard',
    logging_steps=500,
    save_steps=1000,
    generation_config=detoxification_trainer.get_default_generation_config()
)

Construct Seq2SeqTrainer:

In [None]:
trainer = detoxification_trainer.make_trainer(args)

Start training:

In [None]:
trainer.train()

Save the final model:

In [None]:
model_path = './models/t5-small-detoxifier-best'
trainer.save_model(model_path)

## Prediction with fine-tuned model:

Make a predictor object:

In [None]:
predictor = DetoxifierPredictor(
    path=model_path,
    model=detoxification_trainer.model, 
    tokenizer=detoxification_trainer.tokenizer
)

Tranlate a given text:

In [None]:
prompt = "So he's the Top dog. he's the tallest son of a bitch."
predictor.translate_text(prompt)