# Download ParaNMT Dataset and observe content of it

In [1]:
!wget https://github.com/skoltech-nlp/detox/releases/download/emnlp2021/filtered_paranmt.zip

--2023-10-05 18:58:29--  https://github.com/skoltech-nlp/detox/releases/download/emnlp2021/filtered_paranmt.zip
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/s-nlp/detox/releases/download/emnlp2021/filtered_paranmt.zip [following]
--2023-10-05 18:58:29--  https://github.com/s-nlp/detox/releases/download/emnlp2021/filtered_paranmt.zip
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/402743074/ea18dc6d-ab2d-49da-9cd3-2903867da5d3?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20231005%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20231005T185829Z&X-Amz-Expires=300&X-Amz-Signature=bad02ad0324f2f3bc5b6fdcdb17702b73b71861b35861cb87d42331c3544c78b&X-Amz-SignedHeaders=host&ac

In [2]:
!unzip filtered_paranmt.zip

Archive:  filtered_paranmt.zip
  inflating: filtered.tsv            


In [3]:
import pandas as pd

filtered_df = pd.read_csv("filtered.tsv", sep='\t')
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 577777 entries, 0 to 577776
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Unnamed: 0   577777 non-null  int64  
 1   reference    577777 non-null  object 
 2   translation  577777 non-null  object 
 3   similarity   577777 non-null  float64
 4   lenght_diff  577777 non-null  float64
 5   ref_tox      577777 non-null  float64
 6   trn_tox      577777 non-null  float64
dtypes: float64(4), int64(1), object(2)
memory usage: 30.9+ MB


### Note

There is no null values in dataset which is great since we do not have to clean data

In [4]:
filtered_df.head(10)

Unnamed: 0.1,Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
5,5,I'm not gonna have a child... ...with the same...,I'm not going to breed kids with a genetic dis...,0.703185,0.206522,0.950956,0.035846
6,6,"They're all laughing at us, so we'll kick your...",they're laughing at us. We'll show you.,0.618866,0.230769,0.999492,0.000131
7,7,Maine was very short on black people back then.,there wasn't much black in Maine then.,0.720482,0.1875,0.96368,0.14871
8,8,"Briggs, what the hell's happening?","Briggs, what the hell is going on?",0.920373,0.0,0.159096,0.841071
9,9,"Another one simply had no clue what to do, so ...","another simply didn't know what to do, so when...",0.87754,0.101695,0.055371,0.930472


### Note
**Please notice** that the translated text may be more toxic than the reference one! We need to take this into account. (columns 'ref_rox' and 'trn_tox' can help us)

## First idea
As we can see, we have "reference" and "translation" columns, which are perfect for our detoxification task.<br/>
We'll ignore the remaining columns because I don't understand how to use them yet.<br/>
But we still need to sort them to toxic and non-toxic columns using 'ref_tox' and 'trn_tox' features

In [5]:
filtered_df.drop(columns=['similarity',	'lenght_diff'], inplace=True)
filtered_df.head()

Unnamed: 0.1,Unnamed: 0,reference,translation,ref_tox,trn_tox
0,0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.014195,0.981983
1,1,Now you're getting nasty.,you're becoming disgusting.,0.065473,0.999039
2,2,"Well, we could spare your life, for one.","well, we can spare your life.",0.213313,0.985068
3,3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.053362,0.994215
4,4,I've got orders to put her down.,I have orders to kill her.,0.009402,0.999348


Let's divide our reference and translation texts into toxic and detoxified texts

In [6]:
parsed_dataframe = pd.DataFrame()
parsed_dataframe["toxic"] = filtered_df.apply(lambda row: row['reference'] if row['ref_tox'] > row['trn_tox'] else row['translation'], axis=1)
parsed_dataframe['detoxified'] = filtered_df.apply(lambda row: row["translation"] if row['ref_tox'] > row['trn_tox'] else row['reference'], axis=1)
# I add this column to check how much toxic detoxified text can be
parsed_dataframe['detoxified_tox'] = filtered_df.apply(lambda row: min(row['ref_tox'], row['trn_tox']), axis=1)
parsed_dataframe

Unnamed: 0,toxic,detoxified,detoxified_tox
0,"if Alkar floods her with her mental waste, it ...","If Alkar is flooding her with psychic waste, t...",0.014195
1,you're becoming disgusting.,Now you're getting nasty.,0.065473
2,"well, we can spare your life.","Well, we could spare your life, for one.",0.213313
3,"monkey, you have to wake up.","Ah! Monkey, you've got to snap out of it.",0.053362
4,I have orders to kill her.,I've got orders to put her down.,0.009402
...,...,...,...
577772,you didn't know that Estelle stole your fish f...,You didn't know that Estelle had stolen some f...,0.000121
577773,It'il suck the life out of you!,you'd be sucked out of your life!,0.215794
577774,"I can't fuckin' take that, bruv.",I really can't take this.,0.000049
577775,They called me a fucking hero. The truth is I ...,"they said I was a hero, but I didn't care.",0.000124


In [7]:
# Just out of curiosity, let's see how toxic detoxified text can be.
parsed_dataframe.sort_values(by=['detoxified_tox'], ascending=False)

Unnamed: 0,toxic,detoxified,detoxified_tox
128782,Can any of you boys tell me why that stupid Li...,can one of you tell me why this little punk ki...,0.499494
2813,this is no more stupid competition.,This ain't a bloody competition anymore.,0.499419
108021,you make me feel like a fool.,You're making me feel crazy.,0.499037
167781,"you've got an old mouth, like you're sucking l...",Your face Is all pursed up like you just sucke...,0.498766
146045,kill the blind man.,Slay that blind man.,0.498739
...,...,...,...
398469,Why every idiot writer and director in Hollywo...,I don't know why he doesn't want one director ...,0.000034
358271,"and the best part is, even though it wasn't al...","And the best thing is, even if they weren't wa...",0.000033
489393,"son of a bitch, I used to go from place to pla...","' ""I went from place to place telling friends ...",0.000033
291782,Just what sort of severance package is managem...,and what parting gift am I prepared to offer? ...,0.000033


### Note
As we can see, the maximum value of detoxified_tox is less than 0.5, and the detoxified texts don't seem to be too toxic, so we can use this approach! Maybe in future we can some threshold for toxicity level.

### Note

Now we can add this procedure inside data processing source files! (but without detoxified_tox column)

## Note
Lastly let's check what is the maximum size of text. It will be neccassary to train model

In [9]:
print(parsed_dataframe.apply(lambda row: len(row["toxic"].split(" ")), axis=1).max())
print(parsed_dataframe.apply(lambda row: len(row["detoxified"].split(" ")), axis=1).max())

253
179


The maximum length of text in dataset is 253 separate words