## Data exploration
Let's examine dataset, provided by assignment paper

In [56]:
import pandas as pd

data_path = "../data/raw/filtered.tsv"

data = pd.read_csv(data_path, delimiter="\t")
data.head(5)

Unnamed: 0.1,Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348


As we can see the dataset is parallel, which means that normal and toxic corpuses would be similar.
Considering further data preprocessing, I had concerns that it might affect model performance,
since majority of the words are copied from reference to translation.

That's why it was decided to use [another dataset](https://github.com/s-nlp/detox/tree/0ebaeab817957bb5463819bec7fa4ed3de9a26ee/emnlp2021/data/train).
This dataset consist of 4 subsets:

In [48]:
with open('../data/external/train/train_toxic.txt', 'r') as f:
    train_toxic = f.read().split('\n')
with open('../data/external/train/train_normal.txt', 'r') as f:
    train_normal = f.read().split('\n')
with open('../data/external/test/test_toxic.txt', 'r') as f:
    test_toxic = f.read().split('\n')
with open('../data/external/test/test_normal.txt', 'r') as f:
    test_normal = f.read().split('\n')

print(f"train_toxic: {len(train_toxic) - 1} lines; example: '{train_toxic[0]}'")
print(f"train_normal: {len(train_normal) - 1} lines; example: '{train_normal[0]}'")
print(f"test_toxic: {len(test_toxic) - 1} lines; example: '{test_toxic[0]}'")
print(f"test_normal: {len(test_normal) - 1} lines; example: '{test_normal[0]}'")

train_toxic: 135390 lines; example: 'fuck you , you nixonian twit .'
train_normal: 135390 lines; example: 'just a comment regarding family trusts , they were a topical point in australia a few years ago .'
test_toxic: 10000 lines; example: 'you are a fucking idiot .'
test_normal: 10000 lines; example: 'zero chance of winning , either in the cpc , or in a general election .'


As observed, this dataset is as simple as one sentence per line.
Also, sentence format makes it easy to preprocess sentences (e.g., punctuation marks are separated with spaces from both sides).

This dataset is smaller, than assignment dataset, however it's still enough for the proposed model.

In [54]:
print("Sentences in assignment dataset:\t", len(data) * 2)
print("Sentences in that dataset:\t\t", len(train_toxic) + len(train_normal) + len(test_toxic) + len(test_normal) - 4)

Sentences in assignment dataset:	 1155554
Sentences in that dataset:		 290780


## Data preprocessing
Main ideas of data preprocessing were inspired by [s-nlp condBERT](https://github.com/s-nlp/detox/blob/main/emnlp2021/style_transfer/condBERT/condbert_compile_vocab.ipynb).
### Word-to-coefficient
First method involves a simple logistic regression that was trained to detect toxic words.

Read data

In [20]:
tox_corpus_path = '../data/external/train/train_toxic.txt'
norm_corpus_path = '../data/external/train/train_normal.txt'

from collections import Counter
c = Counter()

for fn in [tox_corpus_path, norm_corpus_path]:
    with open(fn, 'r') as corpus:
        for line in corpus.readlines():
            for tok in line.strip().split():
                c[tok] += 1

print('Unique words in a dataset:', len(c))
vocab = set(c.keys())

Unique words in a dataset: 88645


Make sure, no unknown words met

In [21]:
with open(tox_corpus_path, 'r') as tox_corpus, open(norm_corpus_path, 'r') as norm_corpus:
    corpus_tox = [' '.join([w if w in vocab else '<unk>' for w in line.strip().split()]) for line in tox_corpus.readlines()]
    corpus_norm = [' '.join([w if w in vocab else '<unk>' for w in line.strip().split()]) for line in norm_corpus.readlines()]

Let's train logistic regression

In [22]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

pipe = make_pipeline(CountVectorizer(), LogisticRegression(max_iter=1000))
X_train = corpus_tox + corpus_norm
y_train = [1] * len(corpus_tox) + [0] * len(corpus_norm)
pipe.fit(X_train, y_train);

Here, each coefficient represents how much word affects output

In [23]:
coefs = pipe[1].coef_[0]
coefs.shape

(88529,)

Thus, it's reasonable to assume that higher coefficients means higher toxicity.
According to that, we get the following mapping: word-to-toxicity

In [26]:
word2coef = {w: coefs[idx] for w, idx in pipe[0].vocabulary_.items()}
word2coef['ass'], word2coef['as']

(6.485422836551968, -0.029894512222748748)

### Token toxicities by count
In second method, for each token **number of occurrences** in a toxic and normal datasets is counted.
Then, token toxicity is got by: **token_toxicity = toxic_count / total_count**

Let's count occurences.

In [29]:
from collections import defaultdict
from tqdm import tqdm
from transformers import BertTokenizer

model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)

toxic_counter = defaultdict(lambda: 1)
nontoxic_counter = defaultdict(lambda: 1)

for text in tqdm(corpus_tox):
    for token in tokenizer.encode(text):
        toxic_counter[token] += 1
for text in tqdm(corpus_norm):
    for token in tokenizer.encode(text):
        nontoxic_counter[token] += 1

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 135390/135390 [00:38<00:00, 3543.04it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 135390/135390 [00:41<00:00, 3280.15it/s]


Now, let's calculate toxicity for each token

In [36]:
token_toxicities = [toxic_counter[i] / (nontoxic_counter[i] + toxic_counter[i]) for i in range(len(tokenizer.vocab))]

id1 = tokenizer.encode('ass', add_special_tokens=False)[0]
id2 = tokenizer.encode('as', add_special_tokens=False)[0]
token_toxicities[id1], token_toxicities[id2]

(0.9459783913565426, 0.4957811528554934)

Also, let's apply *log odds ratio* to make toxicities more represantable

In [38]:
import numpy as np

token_toxicities = np.array(token_toxicities)
token_toxicities = np.maximum(0, np.log(1/(1/token_toxicities-1)))

token_toxicities[id1], token_toxicities[id2]

(2.862835600087558, 0.0)