### Objectives of this notebook:  
1. Unzip the ParaNMT-detox dataset to get around GitHub's limit on upload file size.
2. Perform basic data understanding of the dataset.

In [1]:
# Unzipping the dataset in the raw directory:
import zipfile

with zipfile.ZipFile("../data/raw/filtered_paranmt.zip", mode="r") as archive:
    archive.printdir()

File Name                                             Modified             Size
filtered.tsv                                   2021-04-16 22:34:42    108290032


In [2]:
with zipfile.ZipFile("../data/raw/filtered_paranmt.zip", mode="r") as archive:
    dataset = archive.read("filtered.tsv").decode(encoding="utf-8")
    with open("../data/interim/filtered_paranmt.tsv", "w", encoding="utf-8") as f:
        f.write(dataset)


Now that I have unzipped the dataset, I can get acquianted with it with the help of pandas library:

In [3]:
import pandas as pd

dataset = pd.read_csv("../data/interim/filtered_paranmt.tsv", delimiter='\t')
dataset = dataset.set_index(dataset.columns[0])
dataset.index.name = "Index"

In [4]:
dataset.head()

Unnamed: 0_level_0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348


In [5]:
dataset = dataset.rename(columns={'lenght_diff': 'length_diff'})
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 577777 entries, 0 to 577776
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   reference    577777 non-null  object 
 1   translation  577777 non-null  object 
 2   similarity   577777 non-null  float64
 3   length_diff  577777 non-null  float64
 4   ref_tox      577777 non-null  float64
 5   trn_tox      577777 non-null  float64
dtypes: float64(4), object(2)
memory usage: 30.9+ MB


All the values in the dataset are already non-null.

In [6]:
dataset.dtypes

reference       object
translation     object
similarity     float64
length_diff    float64
ref_tox        float64
trn_tox        float64
dtype: object

We can infer that the dataset shape is (577777, 6).

In [7]:
dataset.describe()

Unnamed: 0,similarity,length_diff,ref_tox,trn_tox
count,577777.0,577777.0,577777.0,577777.0
mean,0.758469,0.157652,0.541372,0.43449
std,0.092695,0.108057,0.457571,0.458904
min,0.600001,0.0,3.3e-05,3.3e-05
25%,0.681105,0.066667,0.012171,0.000707
50%,0.754439,0.141791,0.806795,0.085133
75%,0.831244,0.238095,0.990469,0.973739
max,0.95,0.4,0.999724,0.99973


### Conclusion:  

Dataset contains 2 categorical columns: sentence to be detoxified and its paraphrised version.  
It contains 4 numerical columns: cosine similarity of texts, relative length difference between texts, toxicity level of the original text and toxicity level of its paraphrased version.  
Dataset is already cleaned from null values.  
Observing the toxicity values, I notice that the std of both ref_tox and trn_tox are similar, but differing mean suggests that messages are considered to be more toxic before their paraphrasing.  
Looking at the similarity value, I see that the mean is 0.758469, so the necessary conclusion is that a lot of paraphrasing loses important meaning carried by the original text, which could cause the disparity in toxicity values.