In [1]:
import pandas as pd
import numpy as np

## Data exploration

In [2]:
df = pd.read_csv("../data/raw/filtered.tsv", sep="\t", index_col=0)

In [3]:
df.head(10)

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
5,I'm not gonna have a child... ...with the same...,I'm not going to breed kids with a genetic dis...,0.703185,0.206522,0.950956,0.035846
6,"They're all laughing at us, so we'll kick your...",they're laughing at us. We'll show you.,0.618866,0.230769,0.999492,0.000131
7,Maine was very short on black people back then.,there wasn't much black in Maine then.,0.720482,0.1875,0.96368,0.14871
8,"Briggs, what the hell's happening?","Briggs, what the hell is going on?",0.920373,0.0,0.159096,0.841071
9,"Another one simply had no clue what to do, so ...","another simply didn't know what to do, so when...",0.87754,0.101695,0.055371,0.930472


In [4]:
len(df)

577777

It seems there is typo in lenght_diff column, let's fix it.

In [9]:
df = df.rename(columns={'lenght_diff': 'length_diff'})

For our task it doesn't matter, whether reference or translation sentence was toxic or not. We also want to work with high level of toxicity, therefore I suggest:
- Get all entries where ```ref_tox```> ```toxicity_threshold``` and ```trn_tox```< ```non_toxicity_threshold```, rename ```reference``` column as ```toxic``` and ```translation``` column as ```non-toxic```
- Get all entries where ```trn_tox```> ```toxicity_threshold``` and ```ref_tox```< ```non_toxicity_threshold``` , rename ```reference``` column as ```non-toxic``` and ```translation``` column as ```toxic```
- Concatenate entries of resulting tables

In [10]:
toxicity_threshold, non_toxicity_threshold = 0.95, 0.05
df_ref_tox = df[(df['ref_tox'] > toxicity_threshold) & (df['trn_tox'] < non_toxicity_threshold)]
df_ref_tox = df_ref_tox.rename(columns={'reference': 'toxic', 'translation': 'non-toxic', 'ref_tox':'toxic_metric', 'trn_tox':'non-toxic_metric'})
print(len(df_ref_tox))
df_ref_tox.head()

207661


Unnamed: 0,toxic,non-toxic,similarity,length_diff,toxic_metric,non-toxic_metric
5,I'm not gonna have a child... ...with the same...,I'm not going to breed kids with a genetic dis...,0.703185,0.206522,0.950956,0.035846
6,"They're all laughing at us, so we'll kick your...",they're laughing at us. We'll show you.,0.618866,0.230769,0.999492,0.000131
13,"Come on, Cal, leave that shit alone.","come on, Cal, put it down.",0.660481,0.27027,0.999637,0.000279
22,"Real life starts the first time you fuck, kid.","boy, real life starts up first.",0.866697,0.319149,0.998222,0.000114
25,"Shit, this one I can't even pronounce.","gosh, I can't even pronounce this.",0.777253,0.102564,0.997452,0.00012


In [11]:
toxicity_threshold, non_toxicity_threshold = 0.95, 0.05
df_trn_tox = df[(df['ref_tox'] < non_toxicity_threshold) & (df['trn_tox'] > toxicity_threshold)]
df_trn_tox = df_trn_tox.rename(columns={'reference': 'non-toxic', 'translation': 'toxic', 'ref_tox':'non-toxic_metric', 'trn_tox':'toxic_metric'})
print(len(df_trn_tox))
df_trn_tox.head()

142044


Unnamed: 0,non-toxic,toxic,similarity,length_diff,non-toxic_metric,toxic_metric
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
10,I suppose you want me to buy you flowers and c...,you'd probably want me to buy you some chocola...,0.800661,0.16,7.8e-05,0.980341
14,So he's the Top dog.,he's the tallest son of a bitch.,0.611092,0.363636,0.00092,0.999639
15,I swore when I went out with Xander Harris... ...,"when I was dating Alex Harris, I swore I'd rat...",0.790565,0.148936,0.011613,0.996266


Now let's rearrange columns

In [12]:
df_trn_tox = df_trn_tox[['toxic', 'non-toxic', 'similarity', 'length_diff', 'toxic_metric', 'non-toxic_metric']]
print(len(df_trn_tox))
df_trn_tox.head()

142044


Unnamed: 0,toxic,non-toxic,similarity,length_diff,toxic_metric,non-toxic_metric
0,"if Alkar floods her with her mental waste, it ...","If Alkar is flooding her with psychic waste, t...",0.785171,0.010309,0.981983,0.014195
4,I have orders to kill her.,I've got orders to put her down.,0.726639,0.181818,0.999348,0.009402
10,you'd probably want me to buy you some chocola...,I suppose you want me to buy you flowers and c...,0.800661,0.16,0.980341,7.8e-05
14,he's the tallest son of a bitch.,So he's the Top dog.,0.611092,0.363636,0.999639,0.00092
15,"when I was dating Alex Harris, I swore I'd rat...",I swore when I went out with Xander Harris... ...,0.790565,0.148936,0.996266,0.011613


In [9]:
df_united = pd.concat([df_ref_tox, df_trn_tox])
print(len(df_united))
df_united.head()

349705


Unnamed: 0,toxic,non-toxic,similarity,length_diff,toxic_metric,non-toxic_metric
5,I'm not gonna have a child... ...with the same...,I'm not going to breed kids with a genetic dis...,0.703185,0.206522,0.950956,0.035846
6,"They're all laughing at us, so we'll kick your...",they're laughing at us. We'll show you.,0.618866,0.230769,0.999492,0.000131
13,"Come on, Cal, leave that shit alone.","come on, Cal, put it down.",0.660481,0.27027,0.999637,0.000279
22,"Real life starts the first time you fuck, kid.","boy, real life starts up first.",0.866697,0.319149,0.998222,0.000114
25,"Shit, this one I can't even pronounce.","gosh, I can't even pronounce this.",0.777253,0.102564,0.997452,0.00012


Let's check that everything concatenated correctly.

In [10]:
print(len(df_united[df_united['toxic_metric'] > df_united['non-toxic_metric']]))

349705


We also don't want to work with very similar sentences, for that purpose let's exclude everything, which has ```similarity``` > ```similarity_threshold``` and ```length_diff``` < ```length_diff_threshold```. 

However, we firstly need to check what data we'd exclude by this actions, in order to get more or less suitable thresholds.

In [11]:
similarity_threshold, length_diff_threshold = 0.9, 0.5

df_similar = df_united[(df_united['similarity'] >= similarity_threshold) & (df_united['length_diff'] <= length_diff_threshold)]
print(len(df_similar))
df_similar.head()

20943


Unnamed: 0,toxic,non-toxic,similarity,length_diff,toxic_metric,non-toxic_metric
41,It told you this was a waste of my fucking time.,I told you this was a waste of my time.,0.904062,0.183673,0.995877,0.000479
43,"I swear to God, the best thing I ever did in m...","I swear to God, the best thing I've ever done ...",0.932305,0.022472,0.999071,0.0009
60,Your girlfriends are dead.,your friends are dead.,0.915111,0.148148,0.993116,0.012461
152,I can't believe we haven't fucked for two year...,I can't believe we haven't had sex in two year...,0.936891,0.032967,0.988648,0.017638
185,"""Crap, I don't have time for this.","""hell, I don't have time for this.",0.921785,0.0,0.99189,0.00462


As we can see, with such thresholds we loose some suitable entries. Let's try to tighter them.

In [12]:
similarity_threshold, length_diff_threshold = 0.93, 0.2

df_similar = df_united[(df_united['similarity'] >= similarity_threshold) & (df_united['length_diff'] <= length_diff_threshold)]
print(len(df_similar))
df_similar.head()

4232


Unnamed: 0,toxic,non-toxic,similarity,length_diff,toxic_metric,non-toxic_metric
43,"I swear to God, the best thing I ever did in m...","I swear to God, the best thing I've ever done ...",0.932305,0.022472,0.999071,0.0009
152,I can't believe we haven't fucked for two year...,I can't believe we haven't had sex in two year...,0.936891,0.032967,0.988648,0.017638
741,I got to get the fuck out of here before I hur...,I have to get out of here before I hurt anyone.,0.932508,0.172414,0.981877,0.001174
774,A favor to cover up your fuck up or a favor fo...,"a favor to cover up for screwing up, or a favo...",0.936666,0.067797,0.997602,0.000676
801,"And do me a favor, would you sign the damn loa...",would you do me a favor and sign the loan forms?,0.945823,0.109091,0.999448,4.5e-05


It seem's that we shouldn't exclude anything.

Then let's save our pairs of toxic and non-toxic sentences.

In [13]:
df_final = df_united[['toxic', 'non-toxic']]
df_final.head()

Unnamed: 0,toxic,non-toxic
5,I'm not gonna have a child... ...with the same...,I'm not going to breed kids with a genetic dis...
6,"They're all laughing at us, so we'll kick your...",they're laughing at us. We'll show you.
13,"Come on, Cal, leave that shit alone.","come on, Cal, put it down."
22,"Real life starts the first time you fuck, kid.","boy, real life starts up first."
25,"Shit, this one I can't even pronounce.","gosh, I can't even pronounce this."


In [14]:
df_final.to_csv("../data/raw/suitable.csv")