# Data Preprocessing and Generation

This notebook shows the data preprocessing and generation

Important note: The notebook use pseudo-absolute path and should be launched only once. So If you want to launch it second time, restart the kernel.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import torch
import numpy as np

In [2]:
import os

# Firstly upcast the path to the src folder
os.chdir('..')

In [None]:
def manual_seed(seed):
    """
    Function to set the seed value for reproducibility
    :param seed: seed value
    :return: None
    """
    # PyTorch manual seed
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

    # NumPy manual seed
    np.random.seed(seed)

# Set the seed value
seed = 42

# Call the manual seeding function
manual_seed(seed)

# Loading the raw data

In [3]:
data_path = "data/raw/filtered.tsv"

# Load the data
df = pd.read_csv(data_path, sep='\t', index_col=0)
df

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
...,...,...,...,...,...,...
577772,You didn't know that Estelle had stolen some f...,you didn't know that Estelle stole your fish f...,0.870322,0.030769,0.000121,0.949143
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.722897,0.058824,0.996124,0.215794
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.617511,0.212121,0.984538,0.000049
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.679613,0.358209,0.991945,0.000124


# Clean up

In [4]:
df['more_toxic'] = df['ref_tox'] >= df['trn_tox']


# if more toxic is false, then swap trn_tox and ref_tox and swap reference and translation
df.loc[df['more_toxic'] == False, ['ref_tox', 'trn_tox']] = df.loc[df['more_toxic'] == False, ['trn_tox', 'ref_tox']].values
df.loc[df['more_toxic'] == False, ['reference', 'translation']] = df.loc[df['more_toxic'] == False, ['translation', 'reference']].values


In [5]:
df['more_toxic'] = df['ref_tox'] >= df['trn_tox']

# get the toxicity difference
df['detox_amount'] = df['ref_tox'] - df['trn_tox']

# rename
df = df.rename(columns={'reference': 'toxic_text', 'translation': 'de-toxic_text', 'ref_tox': 'init_toxicity', 'trn_tox': 'detox_toxicity'})

# drop the columns we don't need
df = df.drop(columns=['more_toxic', 'lenght_diff', 'similarity'])

# Start all sentences from the capital letter
df['toxic_text'] = df['toxic_text'].str.capitalize()
df['de-toxic_text'] = df['de-toxic_text'].str.capitalize()

df

Unnamed: 0,toxic_text,de-toxic_text,init_toxicity,detox_toxicity,detox_amount
0,"If alkar floods her with her mental waste, it ...","If alkar is flooding her with psychic waste, t...",0.981983,0.014195,0.967788
1,You're becoming disgusting.,Now you're getting nasty.,0.999039,0.065473,0.933567
2,"Well, we can spare your life.","Well, we could spare your life, for one.",0.985068,0.213313,0.771755
3,"Monkey, you have to wake up.","Ah! monkey, you've got to snap out of it.",0.994215,0.053362,0.940853
4,I have orders to kill her.,I've got orders to put her down.,0.999348,0.009402,0.989946
...,...,...,...,...,...
577772,You didn't know that estelle stole your fish f...,You didn't know that estelle had stolen some f...,0.949143,0.000121,0.949022
577773,It'il suck the life out of you!,You'd be sucked out of your life!,0.996124,0.215794,0.780331
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.984538,0.000049,0.984489
577775,They called me a fucking hero. the truth is i ...,"They said i was a hero, but i didn't care.",0.991945,0.000124,0.991822


We dont need to drop any rows, since the detoxication level is ~50% of the original toxicity level.

In [6]:
print(df['detox_amount'].describe())

count    577777.000000
mean          0.904659
std           0.126501
min           0.500002
25%           0.870397
50%           0.963144
75%           0.992266
max           0.999681
Name: detox_amount, dtype: float64


# Save and generate the data
Now we save the data to a new file. The construction of the gpt2 corpus takes ~2 hours for gtx1660.

In [8]:
# save to data/intermediate
df.to_csv('data/interm/filtered_preprocessed.csv')

In [9]:
# Get the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [10]:
from src.data.data_preproc import generate_gpt2_corpus, generate_estimator_dataset

generate_gpt2_corpus('data/interm/filtered_preprocessed.csv', 'data/interm/gpt2_corpus.txt', estimator_token=False, device=device)

generate_estimator_dataset('data/interm/filtered_preprocessed.csv', 'data/interm/estimator_dataset.csv')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Generating corpus: 100%|[32m██████████[0m| 577777/577777 [1:53:06<00:00, 85.13it/s]


Generating estimator dataset ...
