# Text De-Toxification, part I: Data Exploration
### Robert Chen, B20-AI

Step 0: Imports

In [1]:
import pandas as pd
import numpy as np

## Step 1: Exploring initial `ParaMNT-detox` corpus

Let us unpack and import the data first:

In [28]:
#!/usr/bin/bash 
DATA_DIR="../data"
! unzip -q $DATA_DIR/raw/filtered_paranmt.zip -d .

Now, we can safely parse the `.tsv` file via `pandas`

In [36]:
data = pd.read_csv('filtered.tsv', delimiter='\t', index_col=0)
data.head()

Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348


Let us first check for missing values:

In [43]:
data.isna().sum()

reference      0
translation    0
similarity     0
lenght_diff    0
ref_tox        0
trn_tox        0
dtype: int64

Since there are no missing values in the dataset, we can omit the data imputing step. Let us look at the stats of non-string columns:

In [44]:
data.describe()

Unnamed: 0,similarity,lenght_diff,ref_tox,trn_tox
count,577777.0,577777.0,577777.0,577777.0
mean,0.758469,0.157652,0.541372,0.43449
std,0.092695,0.108057,0.457571,0.458904
min,0.600001,0.0,3.3e-05,3.3e-05
25%,0.681105,0.066667,0.012171,0.000707
50%,0.754439,0.141791,0.806795,0.085133
75%,0.831244,0.238095,0.990469,0.973739
max,0.95,0.4,0.999724,0.99973


Looking at the maximum values of `ref_tox` (toxicity level of referenced text) and `trn_tox` (toxicity level of translated sentence), we can see that translated sentences sometimes present a more toxic translation than the referenced text already is. That is why we also need to account this during the training.as