# Data Exploration and Preprocessing

This notebook is dedicated to the initial exploration and preprocessing of the dataset. We aim to understand the data's structure, content, and any inherent patterns or anomalies that may exist.


##### Read the uploaded TSV file to understand its structure and contents

In [6]:
import pandas as pd

In [9]:
# Load the TSV file into a DataFrame
file_path = '../data/raw/filtered.tsv'
data = pd.read_csv(file_path, sep='\t')

data.head()

Unnamed: 0.1,Unnamed: 0,reference,translation,similarity,lenght_diff,ref_tox,trn_tox
0,0,"If Alkar is flooding her with psychic waste, t...","if Alkar floods her with her mental waste, it ...",0.785171,0.010309,0.014195,0.981983
1,1,Now you're getting nasty.,you're becoming disgusting.,0.749687,0.071429,0.065473,0.999039
2,2,"Well, we could spare your life, for one.","well, we can spare your life.",0.919051,0.268293,0.213313,0.985068
3,3,"Ah! Monkey, you've got to snap out of it.","monkey, you have to wake up.",0.664333,0.309524,0.053362,0.994215
4,4,I've got orders to put her down.,I have orders to kill her.,0.726639,0.181818,0.009402,0.999348
...,...,...,...,...,...,...,...
95,95,But if somebody rob me and end up dead... Well...,"but if someone robbed me and died... well, you...",0.846546,0.048780,0.998804,0.026040
96,96,Wouldn't happen to be Latino?,"you couldn't be Latino, would you?",0.680294,0.142857,0.000896,0.748527
97,97,Really fucking annoying.,it's pretty annoying.,0.804867,0.120000,0.986967,0.001366
98,98,"'Don't you even turn around, you . . . you tur...","""don't even look back, you... you pussy.""",0.616171,0.363636,0.311901,0.999517


##### Data Exploration

In [10]:
# Descriptive statistics for numeric features
data_description = data.describe()

# Check for missing values
missing_values = data.isnull().sum()

# Distribution of Toxicity Levels
toxicity_distribution_ref = data['ref_tox'].describe()
toxicity_distribution_trn = data['trn_tox'].describe()

# Prepare data for histogram plots of toxicity levels
toxicity_levels_ref = data['ref_tox']
toxicity_levels_trn = data['trn_tox']

data_description, missing_values, toxicity_distribution_ref, toxicity_distribution_trn, toxicity_levels_ref, toxicity_levels_trn

Unnamed: 0     0
reference      0
translation    0
similarity     0
lenght_diff    0
ref_tox        0
trn_tox        0
dtype: int64

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Plotting histograms for toxicity distribution
plt.figure(figsize=(14, 6))

# Histogram for reference toxicity levels
plt.subplot(1, 2, 1)
plt.hist(toxicity_levels_ref, bins=50, color='blue', alpha=0.7)
plt.title('Distribution of Reference Toxicity Levels')
plt.xlabel('Toxicity Level')
plt.ylabel('Frequency')

# Histogram for translation toxicity levels
plt.subplot(1, 2, 2)
plt.hist(toxicity_levels_trn, bins=50, color='green', alpha=0.7)
plt.title('Distribution of Translation Toxicity Levels')
plt.xlabel('Toxicity Level')
plt.ylabel('Frequency')

# Show the plot
plt.tight_layout()
plt.show()

The histograms show the distribution of toxicity levels for both reference and translation sentences:

- The blue histogram represents the toxicity levels of the reference sentences. It shows a bimodal distribution with peaks at the lower and higher ends of the toxicity scale, indicating that the dataset contains a mix of sentences with low and high levels of toxicity.
- The green histogram represents the toxicity levels of the translation sentences. This distribution is more skewed towards lower toxicity levels, which is expected since these sentences are the detoxified versions of their reference counterparts.