# 1. Information about the submission

## 1.1 Name and number of the assignment 

**Text Detoxification. Assignment 3.**

## 1.2 Student name

**Nikolay Kalmykov**

## 1.3 Codalab user ID

**Nick**


# 2. Technical Report

## 2.1 Methodology 

I will use `sberbank-ai/ruT5-base` model (222M parameters), that was trained on a large corpus of Russian text using a denoising auto-encoder objective.

Also, in this task, I decided to implement the algorithm of Machine Translation for English Parallel dataset (https://huggingface.co/datasets/s-nlp/paradetox). 

The exact steps:

1) The code uses the `MarianMTModel` and `MarianTokenizer` to translate each comment in the "toxic_comment" and "neutral_comment" columns of the DataFrame from English to Russian.

2) Then, the code defines a PyTorch data collator class called "DataCollatorWithPadding" to pad and batch the pairs of comments.

3) The code uses an Adam optimizer and gradient accumulation to save GPU memory. It also computes the exponential moving average of the loss and reports the average loss every "window" steps.


* Then, I used the next hyperparameters for training:
   
  * batch_size = 18
  * num_epoch = 15
  * learning_rate = 3e-5

## 2.2 Discussion of results


Method | Style transfer accuracy | Meaning preservation | Fluency | Joint score | ChrF1| 
--- | --- | --- |--- | --- |--- | 
Baseline | 0.56 | 0.89 | 0.85 | 0.41 | 0.53
T5 (Machine Translation of Eng Parallel dataset) | 0.66 | 0.73 | 0.85 | 0.40 | 0.46
T5 (Extended dataset) |0.78 | 0.82 | 0.82 | 0.53 | 0.56

For dataset got with Machine translation, the results weren't so good. Probably, it happens because Machine Translation Systems are not perfect and can make errors while translating. These errors could remove important information that would help to detect toxicity. Also, Machine translation systems sometimes lose the context of the original text while translating, which could make it difficult to detect the presence of toxic language or understand the meaning of the text. Finally, the presence of certain words or phrases in the original text may not be present in the translated text.

So, I used also mixed (Extended dataset) with translated sentences and from original Russian Parallel Data. The scoe was higher, but also not higher than the score for model trained only on Russian Parallel data.

To improve, the score with Machine Translation algorithm, it may be review and annotate the translated dataset to ensure the quality of the data. But, this is routine work and it is better to try other models like Conditional Bert, etc.

# 3. Code

## 3.1 Requirements

In [2]:
# !pip install transformers -q
# !wget https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/train.tsv -q
# !wget https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/dev.tsv -q
# !wget https://raw.githubusercontent.com/s-nlp/russe_detox_2022/main/data/input/test.tsv -q

## 3.2 Libraries

In [1]:

import pandas as pd
import torch
import gc

from sklearn.utils import shuffle
from transformers import T5ForConditionalGeneration, AutoTokenizer, MarianMTModel, MarianTokenizer
from torch.utils.data import DataLoader, Dataset 

from typing import Tuple, List, Dict, Union
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm, trange

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


## 3.3 Machine Translation for English train dataset

In [None]:

from datasets import load_dataset

dataset = load_dataset("s-nlp/paradetox")


In [5]:
train_df = pd.DataFrame(dataset['train'])
train_df.head()
toxic_comment_en = train_df['en_toxic_comment'].tolist()
neutral_comment_en = train_df['en_neutral_comment'].tolist()

In [10]:

tokenizer_transl = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-ru')
model_transl = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-ru')

In [11]:

russian_toxic_sentences = []
russian_n_sentences = []

english_toxic_sentences = toxic_comment_en
english_neural_sentences = neutral_comment_en


for sentence_tox, sentence_n in tqdm(zip(english_toxic_sentences, english_neural_sentences)):


    inputs_tox = tokenizer_transl(sentence_tox, return_tensors='pt', padding=True)
    outputs_tox = model_transl.generate(**inputs_tox)
    translated_sentence_tox = tokenizer_transl.decode(outputs_tox[0], skip_special_tokens=True)
    
    inputs_n = tokenizer_transl(sentence_n, return_tensors='pt', padding=True)
    outputs_n = model_transl.generate(**inputs_n)
    translated_sentence_n = tokenizer_transl.decode(outputs_n[0], skip_special_tokens=True)
    
    russian_toxic_sentences.append(translated_sentence_tox)
    russian_n_sentences.append(translated_sentence_n)

In [12]:

ls_1 = russian_toxic_sentences
ls_2 = russian_n_sentences

df = pd.DataFrame(list(zip(ls_1, ls_2)), columns=['toxic_comment', 'neutral_comment'])
df.to_csv('train_translated.tsv', sep="\t", index=False)

### Reading the input dataset

In [2]:

df = pd.read_csv('train.tsv', sep='\t', index_col=0)
df = df.fillna('')


df_train_toxic = []
df_train_neutral = []

for index, row in df.iterrows():
    references = row[['neutral_comment1', 'neutral_comment2', 'neutral_comment3']].tolist()
    
    for reference in references:

        if len(reference) > 0:
            df_train_toxic.append(row['toxic_comment'])
            df_train_neutral.append(reference)
            
        else:
            break

In [3]:

df_1 = pd.DataFrame({
    'toxic_comment': df_train_toxic,
    'neutral_comment': df_train_neutral
})
df_1.head(5)

Unnamed: 0,toxic_comment,neutral_comment
0,"и,чё,блядь где этот херой был до этого со свои...","Ну и где этот герой был,со своими доказательст..."
1,"и,чё,блядь где этот херой был до этого со свои...",Где этот герой был до этого со своими доказате...
2,"и,чё,блядь где этот херой был до этого со свои...","и,где этот герой был до этого со своими доказа..."
3,"О, а есть деанон этого петуха?","О, а есть деанон"
4,"херну всякую пишут,из-за этого лайка.долбоебизм.","Чушь всякую пишут, из- за этого лайка."


In [4]:

df_2 = pd.read_csv('train_translated.tsv', sep="\t")
df_2.head()

Unnamed: 0,toxic_comment,neutral_comment
0,У него тоже были стальные яйца!,Он тоже был храбрым!
1,"Чувак должен был быть в Апи, он был бы прямо д...","Было бы неплохо, если бы он пошёл в Апи."
2,Я не собираюсь продавать эту чертову фотографи...,"Я не собираюсь продавать фотографию, я просто ..."
3,"Отбросы, которые создают КНН и другие агентств...","Новости, которые создают КНН и другие новостны..."
4,"Причина, по которой их не существует, в том, ч...","Причина, по которой их не существует, в том, ч..."


In [5]:

df = pd.concat([df_1, df_2], axis=0)
df = shuffle(df)
print(df.shape)
df.head()

(30834, 2)


Unnamed: 0,toxic_comment,neutral_comment
15752,"Только что переехал в Ванкувер Ва, оставив все...","Просто переехал в Ванкувер Ва, оставив весь эт..."
5076,"завтрака хэппи милом , пиздец поправилась ;(",завтрака хэппи милом и поправилась
730,Она такая лживая сволочь.,Она лжет.
4997,"Лол-на-я обнаружила, что девчонки из моего дом...","Ну, нет, я обнаружила, что когда мои домашние ..."
8716,"бля, сука.... почему нет такого человека, кото...","почему нет такого человека, которому бы я смог..."


### Data Structure for training

In [6]:

class PairsDataset(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __getitem__(self, idx):
        assert idx < len(self.x['input_ids'])
        item = {key: val[idx] for key, val in self.x.items()}
        item['decoder_attention_mask'] = self.y['attention_mask'][idx]
        item['labels'] = self.y['input_ids'][idx]
        return item
    
    @property
    def n(self):
        return len(self.x['input_ids'])

    def __len__(self):
        return self.n # * 2

In [7]:

class DataCollatorWithPadding:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = self.tokenizer.pad(
            features,
            padding=True,
        )
        ybatch = self.tokenizer.pad(
            {'input_ids': batch['labels'], 'attention_mask': batch['decoder_attention_mask']},
            padding=True,
        ) 
        batch['labels'] = ybatch['input_ids']
        batch['decoder_attention_mask'] = ybatch['attention_mask']
        
        return {k: torch.tensor(v) for k, v in batch.items()}

### Training and Evaluation loop

In [8]:
def cleanup():
    gc.collect()
    torch.cuda.empty_cache()
    
cleanup()

In [9]:

def evaluate_model(model, test_dataloader):
    num = 0
    den = 0

    for batch in test_dataloader:
        with torch.no_grad():
            loss = model(**{k: v.to(model.device) for k, v in batch.items()}).loss
            num += len(batch) * loss.item()
            den += len(batch)
    val_loss = num / den
    return val_loss


def train_loop(
    model, train_dataloader, val_dataloader, 
    max_epochs=30, 
    max_steps=1_000, 
    lr=3e-5,
    gradient_accumulation_steps=1, 
    cleanup_step=100,
    report_step=300,
    window=100,
):
    cleanup()
    optimizer = torch.optim.Adam(params = [p for p in model.parameters() if p.requires_grad], lr=lr)

    ewm_loss = 0
    step = 0
    model.train()

    for epoch in trange(max_epochs):
        print(step, max_steps)
        if step >= max_steps:
            break
        tq = tqdm(train_dataloader)
        for i, batch in enumerate(tq):
            try:
                batch['labels'][batch['labels']==0] = -100
                loss = model(**{k: v.to(model.device) for k, v in batch.items()}).loss
                loss.backward()
            except Exception as e:
                print('error on step', i, e)
                loss = None
                cleanup()
                continue
            if i and i % gradient_accumulation_steps == 0:
                optimizer.step()
                optimizer.zero_grad()
                step += 1
                if step >= max_steps:
                    break

            if i % cleanup_step == 0:
                cleanup()

            w = 1 / min(i+1, window)
            ewm_loss = ewm_loss * (1-w) + loss.item() * w
            tq.set_description(f'loss: {ewm_loss:4.4f}')

            if (i and i % report_step == 0 or i == len(train_dataloader)-1)  and val_dataloader is not None:
                model.eval()
                eval_loss = evaluate_model(model, val_dataloader)
                model.train()
                print(f'epoch {epoch}, step {i}/{step}: train loss: {ewm_loss:4.4f}  val loss: {eval_loss:4.4f}')
                
            if step % 1000 == 0:
                model.save_pretrained(f't5_base_train_10000')
        
    cleanup()

In [10]:

def train_model(x, y, model_name, test_size=0.1, batch_size=32, **kwargs):
    model = T5ForConditionalGeneration.from_pretrained(model_name).cuda()
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    x1, x2, y1, y2 = train_test_split(x, y, test_size=test_size, random_state=42)
    train_dataset = PairsDataset(tokenizer(x1), tokenizer(y1))
    test_dataset = PairsDataset(tokenizer(x2), tokenizer(y2))
    
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
    train_dataloader = DataLoader(train_dataset, batch_size=batch_size, drop_last=False, shuffle=True, collate_fn=data_collator)
    val_dataloader = DataLoader(test_dataset, batch_size=batch_size, drop_last=False, shuffle=True, collate_fn=data_collator)

    train_loop(model, train_dataloader, val_dataloader, **kwargs)
    return model

### Type of the model

In [11]:

model_name = 'sberbank-ai/ruT5-base'
model = T5ForConditionalGeneration.from_pretrained(model_name).to(device)
model.gradient_checkpointing_enable()
model.config.use_cache = False
tokenizer = AutoTokenizer.from_pretrained(model_name)

### Training the model

In [None]:

steps = 10000
print(f'\n\n\n  train  {steps} \n=====================\n\n')
model = train_model(df['toxic_comment'].tolist(), df['neutral_comment'].tolist(), model_name=model_name, batch_size=18, max_epochs=15, max_steps=steps)
model.save_pretrained(f't5_base_train_10000')

### Inference

In [12]:
df = pd.read_csv('test.tsv', sep='\t')
toxic_inputs = df['toxic_comment'].tolist()

In [13]:
model.eval()
state_dict = torch.load('/home/nikolay_kalm/EDA_OCR/NLP/Task_3/t5_base_train_10000/pytorch_model.bin')
model.load_state_dict(state_dict)

<All keys matched successfully>

In [14]:
def paraphrase(text, model, n=None, max_length='auto', temperature=0.0, beams=3):
    texts = [text] if isinstance(text, str) else text
    inputs = tokenizer(texts, return_tensors='pt', padding=True)['input_ids'].to(model.device)
    if max_length == 'auto':
        max_length = int(inputs.shape[1] * 1.2) + 10
    result = model.generate(
        inputs, 
        num_return_sequences=n or 1, 
        do_sample=False, 
        temperature=temperature, 
        repetition_penalty=3.0, 
        max_length=max_length,
        bad_words_ids=[[2]],  # unk
        num_beams=beams,
    )
    texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]
    if not n and isinstance(text, str):
        return texts[0]
    return texts

In [41]:
print(paraphrase(['Пошел нахуй'], model, temperature=50.0, beams=10))

['Уходи.']


In [30]:
print(paraphrase(['Я сейчас пью кофе и мне так ахуено, по всему телу энергия проходит, и такое чувство что даже простой разговор с тобой становится в 10 раз лучше'], model, temperature=50.0, beams=10))

['Я сейчас пью кофе и мне так хорошо, по всему телу энергия проходит, и такое чувство что даже простой разговор с тобой становится в 10 раз лучше']


In [45]:
print(paraphrase(['Блять я заебался жить в зиме'], model, temperature=50.0, beams=10))

['Я устал жить в зиме']


In [None]:
para_results = []
batch_size = 8

for i in tqdm(range(0, len(toxic_inputs), batch_size)):
    batch = [sentence for sentence in toxic_inputs[i:i + batch_size]]
    try:
        para_results.extend(paraphrase(batch, model, temperature=0.0))
    except Exception as e:
        print(i)
        para_results.append(toxic_inputs[i:i + batch_size])

In [34]:
with open('t5_test.txt', 'w') as file:
    file.writelines([sentence+'\n' for sentence in para_results])