In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

As we have seen from EDA - the main problem of th data we have is that classes are unbalanced: toxic comments are only 10% of the whole dataset. 
This hurts training a lot - let me explain how

If you have balanced classes then it is highly likely that in every training mini-batch there will be approximately equal number of positive and negative class objects, so the model regularly gets the equal signal what to change about both classes.

But if proportion of classes is 1 to 9 as we have then the model gets much less signal on positive class than on negative. That can lead to the situation where model will try to regard all samlples as negative or most of them as negative because false estimation of positive class sents too low signal to the loss function because of 
few samples in mini-batch comparatively to negative class.

these are the links there this problem is shortly discussed

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28

I will summarize what helped me the most: my aim is to make classes balanced for training

First approach which is the most intuitive is to change loss function in such a way that the error on prediction toxic examples is higher than on non-toxic class

Lets remember how binary crossentropy looks for a single observation : loss = y_i * ln(p_i) + (1 - y_i)ln(1 - p_i), where y_i is a true class: either 1 or 0
p_i is the estimation of probability to be class == 1. We can change this loss: loss = alpha_1 * y_i * ln(p_i) + alpha_2 * (1 - y_i)ln(1 - p_i), 
where alphas are multiplieres by which we want to send to model larger signal for false prediction of positive class and lower for the negative class

you can check this guide to know more about it: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data#calculate_class_weights

We can manually create such loss function or we can use class_weight option in the model.fit method

It should look like this:

initialize class weights: class_weights = [alpha_1, alpha_2]

and then pass it to the model.fit(..., class_weight=class_weights) - which automatically applies class weightening to the loss

However, this approach has its drawbacks. In spite of the signal is strong on every positive prediction error, it's not regular
So the model makes big updates of the weights if there are some positive examples are in the batch and almost no updates if there are
no or only a few positive samples in batch. There comes next two intuitive approaches:
1. Oversampling
2. Downsampling

Oversampling is a copying random n samples of lower class so to make dataset balanced

Downsampling is a subsampling of examples of larger class

Due to these techniques regular signal on both classes while training is provided. However, the following problems araise.
Downsampling hurts training as it downsizes the amount of data - in our case we need to throw away about 180k samples in small dataset
and about 1,5M in large dataset to balance the data - that's a lot!
Oversampling makes more data which is good, but at the same time it can lead to the fail in generalization and overfitting as the model sees
only a few examples repeated many times. Nevertheless, there are some doubts that those examples are representative of all posible variants. But
the more model sees only such examples, the more it gets confident that these are the only possible ones.

My decision in tackling this problem is to apply something between. Let me show how

Assume we have train indeces - encoded comment text and train labels - 1 or 0 for toxic and non-toxic

We first split data in positive and negative classes

    pos = train_ids[np.where(train_labels == 1)[0]]
    neg = train_ids[np.where(train_labels == 0)[0]]

    pos_labels = train_labels[np.where(train_labels==1)[0]]
    neg_labels = train_labels[np.where(train_labels==0)[0]]


    def make_ds(features, labels):

        ds = tf.data.Dataset.from_tensor_slices((features, labels))#.cache()
    
        ds = ds.shuffle(1500000).repeat()
    
        return ds
    
Then create tf.data.Dataset with positive ang negative data
    
    pos_ds = make_ds(pos, pos_labels)
    neg_ds = make_ds(neg, neg_labels)

And then we use method sample_from_datasets which takes from pos_ds and neg_ps those proportions of neg and pos data, which is passed to the weights option

    resampled_ds = tf.data.experimental.sample_from_datasets([pos_ds, neg_ds], weights=[0.5, 0.5])
    resampled_ds = resampled_ds.batch(BATCH_SIZE).prefetch(AUTO)

And finally resampled_ds is passed into model.fit method 

This method may work as downsampling as well as upsampling depending of batch_size and steps_per_epoch while training

Lets consider the example

we have 20k of toxic samples and 200k of non-toxic samples

batch_size = 256

if steps_per_epoch == len(train_ids) // batch_size and every batch has 0.5*batch_size=128 toxic and non-toxic samples

Then, all the toxic data will be passed to the model at 157 step of the training, while we defined that there will be 860 steps per epoch - that means that already used toxic samples will be passed to the model repeatedly which is equal to just upsampling.

If we define steps_per_epoch == len(pos)//batch_size//2 - in order to end epoch when unique toxic samples end - then it is equal to he downsampling

So my decision is to make steps_per_epoch is somewhere in the middle so that as much as possible non-toxic data is included and toxic comments are not copied 10 times.

# Data augmentation

Although we have found a way how to make training balanced, more unique data is always make things better. But when we don't have more unique data, we can create it by augmenting given samples.

Here is the quick overview of methods i want to propose: https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610

These are 
1. Synonym Replacement: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
2. Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times
3. Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times
4. Random Deletion: Randomly remove each word in the sentence with probability p

I've taken this functions from their Github: https://github.com/jasonwei20/eda_nlp
There is nice job done!

In [None]:
len(ds) * 5

In [None]:
import random
from random import shuffle
random.seed(1)

stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 
            'ours', 'ourselves', 'you', 'your', 'yours', 
            'yourself', 'yourselves', 'he', 'him', 'his', 
            'himself', 'she', 'her', 'hers', 'herself', 
            'it', 'its', 'itself', 'they', 'them', 'their', 
            'theirs', 'themselves', 'what', 'which', 'who', 
            'whom', 'this', 'that', 'these', 'those', 'am', 
            'is', 'are', 'was', 'were', 'be', 'been', 'being', 
            'have', 'has', 'had', 'having', 'do', 'does', 'did',
            'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or',
            'because', 'as', 'until', 'while', 'of', 'at', 
            'by', 'for', 'with', 'about', 'against', 'between',
            'into', 'through', 'during', 'before', 'after', 
            'above', 'below', 'to', 'from', 'up', 'down', 'in',
            'out', 'on', 'off', 'over', 'under', 'again', 
            'further', 'then', 'once', 'here', 'there', 'when', 
            'where', 'why', 'how', 'all', 'any', 'both', 'each', 
            'few', 'more', 'most', 'other', 'some', 'such', 'no', 
            'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 
            'very', 's', 't', 'can', 'will', 'just', 'don', 
            'should', 'now', '']

import re
def get_only_chars(line):

    clean_line = ""

    line = line.replace("’", "")
    line = line.replace("'", "")
    line = line.replace("-", " ") #replace hyphens with spaces
    line = line.replace("\t", " ")
    line = line.replace("\n", " ")
    line = line.lower()

    for char in line:
        if char in 'qwertyuiopasdfghjklzxcvbnm ':
            clean_line += char
        else:
            clean_line += ' '

    clean_line = re.sub(' +',' ',clean_line) #delete extra spaces
    #if clean_line[0] == ' ':
    #    clean_line = clean_line[1:]
    return clean_line
from nltk.corpus import wordnet 

def synonym_replacement(words, n):
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stop_words]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            #print("replaced", random_word, "with", synonym)
            num_replaced += 1
        if num_replaced >= n: #only replace up to n words
            break

    #this is stupid but we need it, trust me
    sentence = ' '.join(new_words)
    new_words = sentence.split(' ')

    return new_words

def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word): 
        for l in syn.lemmas(): 
            synonym = l.name().replace("_", " ").replace("-", " ").lower()
            synonym = "".join([char for char in synonym if char in ' qwertyuiopasdfghjklzxcvbnm'])
            synonyms.add(synonym) 
    if word in synonyms:
        synonyms.remove(word)
    return list(synonyms)

def random_deletion(words, p):

    #obviously, if there's only one word, don't delete it
    if len(words) == 1:
        return words

    #randomly delete words with probability p
    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    #if you end up deleting all words, just return a random word
    if len(new_words) == 0:
        rand_int = random.randint(0, len(words)-1)
        return [words[rand_int]]

    return new_words

def random_swap(words, n):
    new_words = words.copy()
    for _ in range(n):
        new_words = swap_word(new_words)
    return new_words

def swap_word(new_words):
    random_idx_1 = random.randint(0, len(new_words)-1)
    random_idx_2 = random_idx_1
    counter = 0
    while random_idx_2 == random_idx_1:
        random_idx_2 = random.randint(0, len(new_words)-1)
        counter += 1
        if counter > 3:
            return new_words
    new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
    return new_words

def random_insertion(words, n):
    new_words = words.copy()
    for _ in range(n):
        add_word(new_words)
    return new_words

def add_word(new_words):
    synonyms = []
    counter = 0
    while len(synonyms) < 1:
        random_word = new_words[random.randint(0, len(new_words)-1)]
        synonyms = get_synonyms(random_word)
        counter += 1
        if counter >= 10:
            return
    random_synonym = synonyms[0]
    random_idx = random.randint(0, len(new_words)-1)
    new_words.insert(random_idx, random_synonym)

def eda(sentence, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=9):

    sentence = get_only_chars(sentence)
    words = sentence.split(' ')
    words = [word for word in words if word is not '']
    num_words = len(words)

    if num_words == 0:
        return [""] * num_aug
    augmented_sentences = []
    num_new_per_technique = int(num_aug/4)+1
    n_sr = max(1, int(alpha_sr*num_words))
    n_ri = max(1, int(alpha_ri*num_words))
    n_rs = max(1, int(alpha_rs*num_words))

    #sr
    for _ in range(num_new_per_technique):
        a_words = synonym_replacement(words, n_sr)
        augmented_sentences.append(' '.join(a_words))

    #ri
    for _ in range(num_new_per_technique):
        a_words = random_insertion(words, n_ri)
        augmented_sentences.append(' '.join(a_words))

    #rs
    for _ in range(num_new_per_technique):
        a_words = random_swap(words, n_rs)
        augmented_sentences.append(' '.join(a_words))

#rd
    for _ in range(num_new_per_technique):
        a_words = random_deletion(words, p_rd)
        augmented_sentences.append(' '.join(a_words))

    augmented_sentences = [get_only_chars(sentence) for sentence in augmented_sentences]
    shuffle(augmented_sentences)

#trim so that we have the desired number of augmented sentences
    if num_aug >= 1:
        augmented_sentences = np.random.choice(augmented_sentences, num_aug, replace=False)
    else:
        keep_prob = num_aug / len(augmented_sentences)
        augmented_sentences = [s for s in augmented_sentences if random.uniform(0, 1) < keep_prob]


    return augmented_sentences

In [None]:
#lets look what function does
sentence = "Hello! This function creates new examples from data"
augs = eda(sentence)
for element in augs:
    print(element)
    print()

Seems like only a slice changes've been made - let's raise the parameters

In [None]:
sentence = " Hello! This function creates new examples from data"
augs = eda(sentence, alpha_sr=0.5, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1)
for element in augs:
    print(element)
    print()

In [None]:
#try with toxic
sentence = "What a fuck are you doing man. It's a fucking bullshit"
augs = eda(sentence, alpha_sr=0.3, alpha_ri=0.3, alpha_rs=0.3, p_rd=0.3)
for element in augs:
    print(element)
    print()

In [None]:
#try with toxic
sentence = "What a fuck are you doing man. It's a fucking bullshit"
augs = eda(sentence, alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.2, p_rd=0.2)
for element in augs:
    print(element)
    print()

And also I've written my own function to augment the dataset

In [None]:
def augment_data(toxic_df, alpha_sr=0.3, alpha_ri=0.2, alpha_rs=0.2, p_rd=0.2, num_aug=4):
    pos = 0
    sentences = []
    while pos < len(toxic_df):
        for sent in toxic_df["comment_text"][pos:pos+10000]:
            sentences.extend(eda(sent, num_aug=num_aug))
        pos += 10000
        print("Processed", pos, "sentences")
    #sentences = np.concatenate(sentences).tolist()
    labels = len(sentences) * [1]
    return sentences, labels

In [None]:
large_ds = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv", usecols=["comment_text","toxic"]).query("toxic > 0.5")
small_ds = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv", usecols=["comment_text","toxic"]).query("toxic==1")

ds = pd.concat((large_ds,small_ds))
ds["length"] = ds.comment_text.str.split().apply(len)
del large_ds
del small_ds
ds = ds[ds["length"] > 0]

In [None]:
aug_sents, aug_labels = augment_data(ds)

In [None]:
aug_df = pd.DataFrame({"comment_text":aug_sents, "toxic":aug_labels})
aug_df.to_csv("aug.csv")

# Backtranslating

Another way to get more toxic data is to make backtranslation via for example google api. This works as result of backtranslation ( from english to russian and back to english) slighty differs from original sentence, while the sense saves.

You can use following code with your own credentials of google api to do it (you'll need to have an active account)

    from google.oauth2 import service_account
    from google.cloud import translate_v2 as translate
    
    credentials = service_account.Credentials.from_service_account_file("PATH_TO_YOUR_JSON.json")
    translate_client = translate.Client(credentials=credentials)
    text = "Hello you motherfucking son of a bitch? How are you doing?"
    print(text)
    result = translate_client.translate(
        text, target_language="ru")["translatedText"]
    print(result)
    back_res = translate_client.translate(result, target_language="eng")["translatedText"]
    print(back_res)
    
    def back_translate_sent(text, lang="ru"):
    res = translate_client.translate(text, target_language=lang)["translatedText"]
    back_res = translate_client.translate(res, target_language="eng")["translatedText"]
    return back_res

    def back_translate_n(text, lang_list=["ru", "tr", "it"], samples_per_sent=3):
        sentences = []
        for l in lang_list:
            sentences.append(back_translate_sent(text, lang=l))
        return sentences
    
    def back_translate_data(ds, samples_per_sent=3):
        pos = 0
        sentences = []
        while pos < len(ds):
            sentences.extend(ds["comment_text"][pos:pos+10000].apply(back_translate_n).values)
            pos += 10000
            print("Processed", pos, "sentences")
        labels = len(sentences) * [1]
        return sentences, labels



One more idea in balancing classes and training is to use comments we have but translated to the languages in test data. Models we use can handle multilingual inputs and may be it will enrich the training results

Here is the link to the public dataset with translated comments for this competition: https://www.kaggle.com/miklgr500/jigsaw-train-multilingual-coments-google-api