# 1. Information about the submission

## 1.1 Name and number of the assignment 

**Assignment №2 RUSSE 2022 Russian Text Detoxification Based on Parallel Corpora**

## 1.2 Student name

**Insaf Ashrapov**

## 1.3 Codalab user ID

**Insafq, team name Sber2**

# 2. Technical Report

## 2.1 Methodology 

**Task description** 

The initial task was to detoxify Russian text. That is why this looks like a text to text task. However, you may try to delete bad words or try to train models that predict non-toxic alternatives to masked-bad words in condBERT.

Transformers have revolutionized the field of text-to-text tasks due to their powerful ability to capture complex relationships between words and sentences in natural language.


***Examined aproaches***

Didn't work or gave no noticeable effects:
1. Changing lr
2. Using Early Stopping. Because loss of validation says poorly about model quality.
3. Longer training. The training model, more than in the final solution, gave worse results. In addition, training a few epochs gave much worse result.
4. Changing generation hyper parameters
5. Few shots training
6. T5 outperformed gpt2


## 2.2 Discussion of results

Finetuning significantly outperformed the few-shot approach. To get a decent result, pretty long training  and fine-tuning is required. 

Final solution: After finishing t5 output still might have some rude words, so manually deleting them further improves the score . 

Method | dev | test
--- | --- | ---
few-shot learning | 0.12 | -
gpt2-finetune (half-train) | 0.36 | -
ru-t5-finetuned | 0.457 | 0.526
+delete bad word | **0.463** | **0.530**


# 3. Code

## 3.1 Requirements + imports

In [None]:
!pip install spacy
!pip install transformers
!pip install sentencepiece
!pip install sacrebleu
!pip install evaluate
# and some other your dependencies

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 6.3 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 66.4 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 66.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloa

In [None]:
from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
import pandas as pd
from sklearn.utils import shuffle
import os

from torch.utils.data import Dataset, DataLoader
from transformers import Trainer, TrainingArguments
from transformers.file_utils import cached_property
from typing import Tuple
from sklearn.model_selection import train_test_split
import gc
from tqdm.auto import tqdm, trange
import numpy as np

#to suppress warnings 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'

def cleanup():
    gc.collect()
    torch.cuda.empty_cache()

## 3.2 Download the data

In [None]:
!wget --no-cache --backups=1 "https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/train.tsv"
!wget --no-cache --backups=1 "https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/dev.tsv"
!wget --no-cache --backups=1 "https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/test.tsv"
!wget --no-cache --backups=1 "https://raw.githubusercontent.com/s-nlp/russe_detox_2022/6efda7cb6256a6693415ee7d9897306cbec3cc58/baselines/delete/toxic_vocab_extended.txt"
# if some needed file is not in the public domain use google drive or other free hosting to make them available

--2022-12-22 18:30:05--  https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1902888 (1.8M) [text/plain]
Failed to rename train.tsv to train.tsv.1: (2) No such file or directory
Saving to: ‘train.tsv’


2022-12-22 18:30:05 (44.8 MB/s) - ‘train.tsv’ saved [1902888/1902888]

--2022-12-22 18:30:05--  https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 200691 (196K) [text/plain]
Fa

In [None]:
!head -2 train.tsv

index	toxic_comment	neutral_comment1	neutral_comment2	neutral_comment3
0	и,чё,блядь где этот херой был до этого со своими доказательствами?	Ну и где этот герой был,со своими доказательствами?	Где этот герой был до этого со своими доказательствами?	и,где этот герой был до этого со своими доказательствами?


## 3.3 Preprocessing 

In [None]:
def read_preprocess_dataset(path):
    df = pd.read_csv(path, sep='\t')
    df = df.fillna('')
    df_train_toxic = []
    df_train_neutral = []

    for index, row in df.iterrows():
        references = row[['neutral_comment1', 'neutral_comment2', 'neutral_comment3']].tolist()
        
        for reference in references:
            if len(reference) > 0:
                df_train_toxic.append(row['toxic_comment'])
                df_train_neutral.append(reference)
            else:
                break

    df = pd.DataFrame({
        'toxic_comment': df_train_toxic,
        'neutral_comment': df_train_neutral
    })

    return shuffle(df)

train = read_preprocess_dataset('train.tsv')
dev = read_preprocess_dataset('dev.tsv')
test = pd.read_csv('test.tsv', sep='\t')

## 3.4 Training baseline

In [None]:
from typing import List, Dict, Union

class DataCollatorWithPadding:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = self.tokenizer.pad(
            features,
            padding=True,
        )
        ybatch = self.tokenizer.pad(
            {'input_ids': batch['labels'], 'attention_mask': batch['decoder_attention_mask']},
            padding=True,
        ) 
        batch['labels'] = ybatch['input_ids']
        batch['decoder_attention_mask'] = ybatch['attention_mask']
        
        return {k: torch.tensor(v) for k, v in batch.items()}

def cleanup():
    gc.collect()
    torch.cuda.empty_cache()
    
cleanup()        

class PairsDataset(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __getitem__(self, idx):
        assert idx < len(self.x['input_ids'])
        item = {key: val[idx] for key, val in self.x.items()}
        item['decoder_attention_mask'] = self.y['attention_mask'][idx]
        item['labels'] = self.y['input_ids'][idx]
        return item
    
    @property
    def n(self):
        return len(self.x['input_ids'])

    def __len__(self):
        return self.n # * 2

In [None]:
def evaluate_model(model, test_dataloader):
    num = 0
    den = 0

    for batch in test_dataloader:
        with torch.no_grad():
            loss = model(**{k: v.to(model.device) for k, v in batch.items()}).loss
            num += len(batch) * loss.item()
            den += len(batch)
    val_loss = num / den
    return val_loss


class EarlyStopper:
    def __init__(self, patience=1, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.min_validation_loss = np.inf

    def early_stop(self, validation_loss):
        if validation_loss < self.min_validation_loss:
            self.min_validation_loss = validation_loss
            self.counter = 0
        elif validation_loss > (self.min_validation_loss + self.min_delta):
            self.counter += 1
            if self.counter >= self.patience:
                print("EarlyStopping at {}".format(self.counter))
                return True
        return False

In [None]:
def train_loop(
    model, train_dataloader, val_dataloader, 
    max_epochs=30, 
    max_steps=1_000, 
    lr=3e-5,
    gradient_accumulation_steps=1, 
    cleanup_step=100,
    report_step=300,
    window=100,
    patience=10
):
    cleanup()
    optimizer = torch.optim.Adam(params = [p for p in model.parameters() if p.requires_grad], lr=lr)

    ewm_loss = 0
    step = 0
    model.train()
    early_stopper = EarlyStopper(patience=patience)

    for epoch in trange(max_epochs):
        print(step, max_steps)
        if step >= max_steps:
            break
        tq = tqdm(train_dataloader)
        for i, batch in enumerate(tq):
            try:
                batch['labels'][batch['labels']==0] = -100
                loss = model(**{k: v.to(model.device) for k, v in batch.items()}).loss
                loss.backward()
            except Exception as e:
                print('error on step', i, e)
                loss = None
                cleanup()
                continue
            if i and i % gradient_accumulation_steps == 0:
                optimizer.step()
                optimizer.zero_grad()
                step += 1
                if step >= max_steps:
                    break

            if i % cleanup_step == 0:
                cleanup()

            w = 1 / min(i+1, window)
            ewm_loss = ewm_loss * (1-w) + loss.item() * w
            tq.set_description(f'loss: {ewm_loss:4.4f}')

            if (i and i % report_step == 0 or i == len(train_dataloader)-1)  and val_dataloader is not None:
                model.eval()
                eval_loss = evaluate_model(model, val_dataloader)

                model.train()
                print(f'epoch {epoch}, step {i}/{step}: train loss: {ewm_loss:4.4f}  val loss: {eval_loss:4.4f}')
                
            if step % 100 == 0:
                model.save_pretrained(f't5_base_{dname}_{steps}')
        
    cleanup()

In [None]:
def train_model(x, y, dev_x, dev_y, model_name, batch_size=32, lr=3e-5, **kwargs):
    model = T5ForConditionalGeneration.from_pretrained(model_name).cuda()
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    train_dataset = PairsDataset(tokenizer(x), tokenizer(y))
    test_dataset = PairsDataset(tokenizer(dev_x), tokenizer(dev_y))
    
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, drop_last=False, shuffle=True, collate_fn=data_collator)
    val_dataloader = DataLoader(test_dataset, batch_size=batch_size, drop_last=False, shuffle=True, collate_fn=data_collator)

    train_loop(model, train_dataloader, val_dataloader, **kwargs)
    return model

In [None]:
model_name = 'sberbank-ai/ruT5-base'
cleanup()
datasets = {
    'train': train
}

In [None]:
for steps in [5000]:
    for dname, d in datasets.items():
        print(f'\n\n\n  {dname}  {steps} \n=====================\n\n')
        model = train_model(train['toxic_comment'].tolist(), train['neutral_comment'].tolist(),
                            dev['toxic_comment'].tolist(), dev['neutral_comment'].tolist(),
                            model_name=model_name, batch_size=20, report_step=100, max_epochs=1000, 
                            max_steps=steps, gradient_accumulation_steps=1, patience=10)
        model.save_pretrained(f't5_base_{dname}_{steps}')




  train  2000 




  0%|          | 0/1000 [00:00<?, ?it/s]

0 2000


  0%|          | 0/694 [00:00<?, ?it/s]

epoch 0, step 100/100: train loss: 4.5321  val loss: 6.2501
epoch 0, step 200/200: train loss: 3.3290  val loss: 6.6882
epoch 0, step 300/300: train loss: 2.6933  val loss: 7.0314
epoch 0, step 400/400: train loss: 2.2996  val loss: 7.1801
epoch 0, step 500/500: train loss: 2.1391  val loss: 7.2469
epoch 0, step 600/600: train loss: 1.9805  val loss: 7.0289
epoch 0, step 693/693: train loss: 1.8655  val loss: 7.0533
693 2000


  0%|          | 0/694 [00:00<?, ?it/s]

epoch 1, step 100/793: train loss: 1.8144  val loss: 7.0152
epoch 1, step 200/893: train loss: 1.7090  val loss: 7.0549
epoch 1, step 300/993: train loss: 1.6750  val loss: 6.9955
epoch 1, step 400/1093: train loss: 1.6641  val loss: 6.9674
epoch 1, step 500/1193: train loss: 1.6466  val loss: 6.9506
epoch 1, step 600/1293: train loss: 1.6204  val loss: 6.8568
epoch 1, step 693/1386: train loss: 1.5902  val loss: 6.7687
1386 2000


  0%|          | 0/694 [00:00<?, ?it/s]

epoch 2, step 100/1486: train loss: 1.5202  val loss: 6.9956
epoch 2, step 200/1586: train loss: 1.4985  val loss: 6.7739
epoch 2, step 300/1686: train loss: 1.4854  val loss: 7.0097
epoch 2, step 400/1786: train loss: 1.4664  val loss: 6.8590
epoch 2, step 500/1886: train loss: 1.4374  val loss: 7.0920
epoch 2, step 600/1986: train loss: 1.4294  val loss: 6.9416
2000 2000


## Inference

In [None]:
dev = pd.read_csv('test.tsv', sep='\t')
toxic_inputs = dev['toxic_comment'].tolist()

In [None]:
model1  = torch.load("t5_base_train_2000/pytorch_model.bin")
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
def paraphrase(text, model, n=None, max_length='auto', temperature=0.1, beams=3):
    texts = [text] if isinstance(text, str) else text
    inputs = tokenizer(texts, return_tensors='pt', padding=True)['input_ids'].to(model.device)
    if max_length == 'auto':
        max_length = int(inputs.shape[1] * 1.2) + 10
    result = model.generate(
        inputs, 
        num_return_sequences=n or 1, 
        do_sample=False, 
        temperature=temperature, 
        repetition_penalty=3.0, 
        max_length=max_length,
        bad_words_ids=[[2]],  # unk
        num_beams=beams,
    )
    texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]
    if not n and isinstance(text, str):
        return texts[0]
    return texts

In [None]:
para_results = []
problematic_batch = [] #if something goes wrong you can track such bathces
batch_size = 8

for i in tqdm(range(0, len(toxic_inputs), batch_size)):
    batch = [sentence for sentence in toxic_inputs[i:i + batch_size]]
    try:
        para_results.extend(paraphrase(batch, model, temperature=0.1))
    except Exception as e:
        print(i)
        para_results.append(toxic_inputs[i:i + batch_size])

with open('output.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in para_results])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i in tqdm(range(0, len(toxic_inputs), batch_size)):


  0%|          | 0/100 [00:00<?, ?it/s]

## PostProcessing with deleting bad words

In [None]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm
import random

import gensim

with open('toxic_vocab_extended.txt', 'r') as file:
    toxic_words = file.readlines()
toxic_words = [sentence.strip() for sentence in toxic_words]

In [None]:
df = pd.read_csv('output.txt', sep='\t', header=None)
toxic_inputs = df[0].tolist()

from spacy.lang.ru import Russian
nlp = Russian()


In [None]:
results = []

for sample in tqdm(toxic_inputs):    
    doc = nlp(sample)
    tokens = [token.text for token in doc]
    cleaned_sentence = [tokens[i] for i, word in enumerate(tokens) if word not in toxic_words]
        
    results.append(' '.join(cleaned_sentence))    

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for sample in tqdm(toxic_inputs):


  0%|          | 0/875 [00:00<?, ?it/s]

In [None]:
with open('delete_dev.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in results])

###Few Shot training





In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = 'sberbank-ai/rugpt3medium_based_on_gpt2'

model = AutoModelForCausalLM.from_pretrained(base_model).cuda()
tokenizer = AutoTokenizer.from_pretrained(base_model)

Downloading:   0%|          | 0.00/674 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.73G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.61M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

In [None]:
examples = train.sample(3, random_state=20)
END_OF_LINE = tokenizer('\n').input_ids[0]
template = 'Токсичный текст: {}\n Дружелюбное перефразирование:'
template2 = '\n\n'.join([template.format(' ' + row.toxic_comment) + ' ' + row.neutral_comment + '\n ---' for i, row in examples.iterrows()] + [template])

In [None]:
def generate(prompt):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    length = inputs.input_ids.shape[1]
    model.eval()
    beam_outputs = model.generate(
        **inputs, 
        max_length=length+32, 
        min_length=length+3,  # the new text should be at least 3 tokens long
        num_beams=3, 
        num_return_sequences=1, 
        early_stopping=True,
        eos_token_id=END_OF_LINE, pad_token_id=tokenizer.eos_token_id,
    )
    return tokenizer.decode(beam_outputs[0][length:], skip_special_tokens=True).strip()
dev = pd.read_csv('dev.tsv', sep='\t')
outputs_3shot = [generate(template2.format(text)) for text in tqdm(dev['toxic_comment'].values)]


  0%|          | 0/800 [00:00<?, ?it/s]

In [None]:
with open('outputs_3shot_dev.txt', 'w') as file:
    file.writelines([sentence.replace('\n', '')+'\n' for sentence in outputs_3shot])

###Finetun gpt



In [None]:
row = train.sample(1, random_state=20)

bad_text = row.toxic_comment.iloc[0]
END_OF_LINE = tokenizer('\n').input_ids[0]

inputs = tokenizer(template2.format(bad_text), return_tensors='pt').to(model.device)
length = inputs.input_ids.shape[1]

beam_outputs = model.generate(
    **inputs, 
    max_length=length+100, 
    min_length=length+3,  # the new text should be at least 3 tokens long
    num_beams=5, 
    num_return_sequences=1, 
    early_stopping=True,
    eos_token_id=END_OF_LINE,
)

for i, beam_output in enumerate(beam_outputs):
  print(tokenizer.decode(beam_output, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:203 for open-end generation.


Токсичный текст:  нет, я ленивый пидр, который нихера не понял из того что читал и надеется на шпоры :|это провал
 Дружелюбное перефразирование: Нет,я ленивый,который ничего не понял из того,что прочитал и надеется на шпаргалки :|это провал
 ---

Токсичный текст:  Охуенно блять,у меня 3 по истории за аттестацию я расстроенна;(
 Дружелюбное перефразирование: У меня три по истории за аттестацию Я расстроена
 ---

Токсичный текст:  админ ты по ходу ебанутый фантазёр:-d
 Дружелюбное перефразирование: Админ ты походу фантазёр
 ---

Токсичный текст: нет, я ленивый пидр, который нихера не понял из того что читал и надеется на шпоры :|это провал
 Дружелюбное перефразирование: Нет,я ленивый,который ничего не понял из того,что прочитал и надеется на шпоры :|это провал



In [None]:
  def compute_loss(x_text, y_text):
    x_ids = tokenizer(template.format(x_text), return_tensors='pt').to(model.device).input_ids
    y_ids = tokenizer(y_text + '\n', return_tensors='pt').to(model.device).input_ids
    input_ids = torch.cat([x_ids, y_ids], 1)
    labels = torch.cat([torch.tensor([[-100]]).to(model.device).repeat(1, x_ids.shape[1]), y_ids], 1)
    out = model(
        input_ids=input_ids,
        labels=labels
    )
    return out.loss


from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=1e-4)

model.train()
for epoch in trange(2):
    sum_loss = 0
    tq = tqdm(train.sample(frac=.2).values)
    for i, (x_text, y_text) in enumerate(tq):
        loss = compute_loss(x_text, y_text)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        sum_loss += loss.item()
        tq.set_description(str(loss.item()))
    print('epoch', epoch, 'loss', sum_loss / len(train))
model.eval();

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/2218 [00:00<?, ?it/s]

epoch 0 loss 0.37628272752083564


  0%|          | 0/2218 [00:00<?, ?it/s]

epoch 1 loss 0.3255100813834543


In [None]:
outputs_full = [generate(template.format(text)) for text in tqdm(dev['toxic_comment'].values)]

  0%|          | 0/800 [00:00<?, ?it/s]

In [None]:
with open('gpt_finetune.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in outputs_full])

## Hugginc face model pipeline

In [None]:
import numpy as np
import evaluate
from datasets import Dataset, load_metric

sacrebleu = evaluate.load("sacrebleu")
metric = load_metric("sacrebleu")

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels


def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

  metric = load_metric("sacrebleu")


Downloading builder script:   0%|          | 0.00/2.85k [00:00<?, ?B/s]

In [None]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
from transformers import AutoTokenizer
model_name = 'sberbank-ai/ruT5-base' #"Helsinki-NLP/opus-mt-ru-en"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Downloading:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.00M [00:00<?, ?B/s]

In [None]:
train = read_preprocess_dataset('train.tsv')
dev = read_preprocess_dataset('dev.tsv')
test = pd.read_csv('test.tsv', sep='\t')

source_lang = "toxic_comment"
target_lang = "neutral_comment"
prefix = "Сделай без мата: "


def preprocess_function(examples):
    inputs = [prefix + sample for sample in examples[source_lang]]
    targets = [sample for sample in examples[target_lang]]
    model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
    return model_inputs

X_train_df = Dataset.from_pandas(train)
X_test_df = Dataset.from_pandas(dev)

X_train_token = X_train_df.map(preprocess_function, batched=True)
X_test_token = X_test_df.map(preprocess_function, batched=True)

  0%|          | 0/12 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

In [None]:
training_args = Seq2SeqTrainingArguments(
    output_dir="my_awesome_opus_books_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=5,
    predict_with_generate=True,
    #fp16=True,
    push_to_hub=False,
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=X_train_token,
    eval_dataset=X_test_token,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    
)

trainer.train()

The following columns in the training set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: toxic_comment, neutral_comment, __index_level_0__. If toxic_comment, neutral_comment, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 11090
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3470
  Number of trainable parameters = 222903552


Epoch,Training Loss,Validation Loss,Bleu,Gen Len
1,2.5414,1.33204,45.1065,13.4615
2,1.5813,1.264972,46.2975,13.4068
3,1.372,1.239894,46.5959,13.4677
4,1.3174,1.229145,47.042,13.4588
5,1.2895,1.226962,47.0043,13.4203


Saving model checkpoint to my_awesome_opus_books_model/checkpoint-500
Configuration saved in my_awesome_opus_books_model/checkpoint-500/config.json
Model weights saved in my_awesome_opus_books_model/checkpoint-500/pytorch_model.bin
tokenizer config file saved in my_awesome_opus_books_model/checkpoint-500/tokenizer_config.json
Special tokens file saved in my_awesome_opus_books_model/checkpoint-500/special_tokens_map.json
Copy vocab file to my_awesome_opus_books_model/checkpoint-500/spiece.model
The following columns in the evaluation set don't have a corresponding argument in `T5ForConditionalGeneration.forward` and have been ignored: toxic_comment, neutral_comment, __index_level_0__. If toxic_comment, neutral_comment, __index_level_0__ are not expected by `T5ForConditionalGeneration.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1116
  Batch size = 16
Saving model checkpoint to my_awesome_opus_books_model/checkpoint-1000
Configuration sav

TrainOutput(global_step=3470, training_loss=1.547151047764319, metrics={'train_runtime': 1540.1254, 'train_samples_per_second': 36.004, 'train_steps_per_second': 2.253, 'total_flos': 3435319381248000.0, 'train_loss': 1.547151047764319, 'epoch': 5.0})

In [None]:
from transformers import pipeline

translator = pipeline("translation", model=model.to('cpu'), tokenizer=tokenizer)
output = translator(prefix+"нет, я ленивый пидр, который нихера не понял из того что читал и надеется на шпоры :|это провал")
output[0]['translation_text']



'Нет, я ленивый, который ничего не понял из того что читал и надеется на шпоры :это провал'

In [None]:
dev = pd.read_csv('dev.tsv', sep='\t')

outputs_dev = [translator(prefix+text)[0]['translation_text'] for text in tqdm(dev['toxic_comment'].values)]
outputs_test = [translator(prefix+text)[0]['translation_text'] for text in tqdm(test['toxic_comment'].values)]

  0%|          | 0/800 [00:00<?, ?it/s]

  0%|          | 0/875 [00:00<?, ?it/s]

In [None]:
with open('t5_finetune_hug_dev.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in outputs_dev])

with open('t5_finetune_hug_test.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in outputs_test])

!zip t5_finetune_hug_dev.zip t5_finetune_hug_dev.txt
!zip t5_finetune_hug_test.zip t5_finetune_hug_test.txt    

updating: t5_finetune_hug_dev.txt (deflated 68%)
  adding: t5_finetune_hug_test.txt (deflated 67%)
