# Dataset Preprocessing
Il presente Notebook mostra le operazioni di Data Engineering effettuate sul Training Set RAW ("jigsaw_train_set.csv") per ottenere il Training Set utilizzato per addestrate i modelli.

Quanto mostrato in questa sezione spiega il funzionamento della funzione "clean_data" in "dataset_preprocessing.py".

In [1]:
import pandas as pd
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from sklearn.utils import resample

In [2]:
train_data = pd.read_csv("./jigsaw_train_set.csv")
test_data = pd.read_csv("./jigsaw_test_set.csv")

# Guida al Cleaning delle Stringhe
In questa sezione vengono esplicitamente mostrate le soluzioni adottate per il Data Cleaning delle Stringhe e vengono mostrate le trasformazioni eseguite.

## Dataset Cleaning
Il Cleaning avviene in due fasi:
1. Rimozione di particolari caratteri e/o sequenze di caratteri.
2. "De-fusione" di token accorpati per effetto del primo step di Cleaning.


In [3]:
# Rimozione di "\r" e "\n"
phrase = train_data["comment_text"][108140]
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'[\r\n]+', '', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'"\r\n\r\n Do iPods come with AM or FM radios? \r\n\r\nSince the article does not say, I could assume the answer is ""no"" but you know what they say about the word assume.  It might be worthwhile to say, somewhere, that other Music Devices also include radios but iPods do not.  (Or do..whichever the case might be.)  -   "'

Frase Pulita:


'" Do iPods come with AM or FM radios? Since the article does not say, I could assume the answer is ""no"" but you know what they say about the word assume.  It might be worthwhile to say, somewhere, that other Music Devices also include radios but iPods do not.  (Or do..whichever the case might be.)  -   "'

In [4]:
# Rimozione di sequenze di ":" (esempio, "::::")
phrase = train_data["comment_text"][159566]
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'::+', '', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'":::::And for the second time of asking, when your view completely contradicts the coverage in reliable sources, why should anyone care what you feel? You can\'t even give a consistent argument - is the opening only supposed to mention significant aspects, or the ""most significant"" ones?   \r\n\r\n"'

Frase Pulita:


'"And for the second time of asking, when your view completely contradicts the coverage in reliable sources, why should anyone care what you feel? You can\'t even give a consistent argument - is the opening only supposed to mention significant aspects, or the ""most significant"" ones?   \r\n\r\n"'

In [5]:
# Rimozione di sequenze di "=" (esempio, "====")
phrase = train_data["comment_text"][62989]
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'==+', '', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'"==Change name of section As Non-DST Time==\r\nWe must change the name of that section to ""As Standard Time"" since it would be more confusing if one would read it. If you don\'t want to remove it you may do this: As Standard (Non-DST) Time"". -  "'

Frase Pulita:


'"Change name of section As Non-DST Time\r\nWe must change the name of that section to ""As Standard Time"" since it would be more confusing if one would read it. If you don\'t want to remove it you may do this: As Standard (Non-DST) Time"". -  "'

In [6]:
# Rimozione di sequenze di "*" (esempio, "**")
phrase = train_data["comment_text"][146827]
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'\*\*+', '', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'"\r\n\r\n ******* Double Standard Against Bosniaks *********** \r\n\r\nChrisO doesn\'t want me to use copy of the original investigative article that was published in 1993 by David Bernstein (Pacific News Service). The reason is because he thinks this is a personal website http://www.geocities.com/famous_bosniaks/english/general_lewis_mackenzie.html . What difference does it make? It\'s still original article published 13 years ago by Pacific News Service with full copyright notice? http://www.geocities.com/famous_bosniaks/english/general_lewis_mackenzie.html\r\n\r\nOn the other hand - he allows use of personal ""lists"" or ""groups"", such as ""mail-archive"" and Serb-run ""balkanpeace"" from Toronto when reading articles republished from Canada\'s Globe and Mail, example http://www.mail-archive.com/serbian_way@antic.org/msg00008.html\r\n\r\nAnyways, balkanpeace.org is Serb-run website in which Bosniaks, Croats and other ethnic groups are portrayed as the worst of the worst, while Se

Frase Pulita:


'"\r\n\r\n  Double Standard Against Bosniaks  \r\n\r\nChrisO doesn\'t want me to use copy of the original investigative article that was published in 1993 by David Bernstein (Pacific News Service). The reason is because he thinks this is a personal website http://www.geocities.com/famous_bosniaks/english/general_lewis_mackenzie.html . What difference does it make? It\'s still original article published 13 years ago by Pacific News Service with full copyright notice? http://www.geocities.com/famous_bosniaks/english/general_lewis_mackenzie.html\r\n\r\nOn the other hand - he allows use of personal ""lists"" or ""groups"", such as ""mail-archive"" and Serb-run ""balkanpeace"" from Toronto when reading articles republished from Canada\'s Globe and Mail, example http://www.mail-archive.com/serbian_way@antic.org/msg00008.html\r\n\r\nAnyways, balkanpeace.org is Serb-run website in which Bosniaks, Croats and other ethnic groups are portrayed as the worst of the worst, while Serb crimes are excu

In [7]:
# Rimozione di sequenze numeriche in formato di indirizzi IP (esempio, "192.168.1.1")
phrase = train_data["comment_text"][0]
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', '', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


"Explanation\r\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

Frase Pulita:


"Explanation\r\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now."

In [8]:
# Rimozione di contenuto compreso tra Parentesi Quadre (esempio, "[contentContent]")
phrase = "Frase test [contenuto tra parentesi] fine test"
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'\[[^\[\]]+\]', '', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'Frase test [contenuto tra parentesi] fine test'

Frase Pulita:


'Frase test  fine test'

In [9]:
# Rimozione di Apici, sia singoli che doppi
phrase = "\"Frase con doppi apici\", 'token'"
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r"['\"]", '', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'"Frase con doppi apici", \'token\''

Frase Pulita:


'Frase con doppi apici, token'

In [10]:
# Splitting di token in cui compare un segno di interpuzione forte ("?", "!" e ".") seguito da una lettera maiuscola
phrase = "Bisogna pulire questa frase?Sì"
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'([?!\.])([A-Z]\w*)', r'\1 \2', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'Bisogna pulire questa frase?Sì'

Frase Pulita:


'Bisogna pulire questa frase? Sì'

In [11]:
# Splitting di parole fuse (esempio, "parolaParola" diventa "parola Parola")
phrase = "questoToken va splittato"
print("Frase RAW:")
display(phrase)

cleaned_phrase = re.sub(r'([a-z])([A-Z])', r'\1 \2', phrase)
print("Frase Pulita:")
display(cleaned_phrase)

Frase RAW:


'questoToken va splittato'

Frase Pulita:


'questo Token va splittato'

## Dataset Refining
- Se all'interno di un token ci sono lettere maiuscole, queste vengono rese minuscole
- Se un token è costituito da un segno di interpunzione, questo viene rimosso dalla frase.

Quanto mostrato in questa sezione mostra il funzionamento della funzione "transform_data" in "dataset_preprocessing.py"

In [12]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Riccardo De
[nltk_data]     Cesaris\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
phrase = "CIAO A TUTTI! Questo Notebook mostra come ripulire un Dataset per un task di NLP."

In [14]:
tokens = word_tokenize(phrase)
print("Frase tokenizzata: " + str(tokens))

Frase tokenizzata: ['CIAO', 'A', 'TUTTI', '!', 'Questo', 'Notebook', 'mostra', 'come', 'ripulire', 'un', 'Dataset', 'per', 'un', 'task', 'di', 'NLP', '.']


In [15]:
lowercase_tokens = [token.lower() for token in tokens if token.isalpha()]
print("Frase processata: " + str(lowercase_tokens))

Frase processata: ['ciao', 'a', 'tutti', 'questo', 'notebook', 'mostra', 'come', 'ripulire', 'un', 'dataset', 'per', 'un', 'task', 'di', 'nlp']


In [16]:
processed_phrase = ' '.join(lowercase_tokens)

print("Frase RAW:")
display(phrase)

print("Frase Processata:")
display(processed_phrase)

Frase RAW:


'CIAO A TUTTI! Questo Notebook mostra come ripulire un Dataset per un task di NLP.'

Frase Processata:


'ciao a tutti questo notebook mostra come ripulire un dataset per un task di nlp'

## Dataset Lemmatization
Per poter agevolare la successiva vettorizzazione delle stringhe, si prevede di lemmatizzare i dataset. Il processo di lemmatizzazione consiste, sostanzialmente, nella riduzione di ogni parola alla sua forma canonica.

In [17]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN  # default to noun if POS tag not found

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    lemmatized_text = []
    for token, tag in tagged_tokens:
        pos = get_wordnet_pos(tag)
        lemmatized_token = lemmatizer.lemmatize(token, pos=pos)
        lemmatized_text.append(lemmatized_token)
    return ' '.join(lemmatized_text)

In [18]:
text_to_lemmatize = "The serene beauty of the sunset painted the sky with hues of orange and pink, captivating all who beheld it."

lemmatized_text = lemmatize_text(text_to_lemmatize)
print(lemmatized_text)

The serene beauty of the sunset paint the sky with hue of orange and pink , captivate all who behold it .


# Applicazione delle Tecniche descritte ai Dataset
1. Pulitura mediante le espressioni regolari
2. Standardizzazione, mediante l'eliminazione degli upper cases
3. Lemmatizzazione

In [19]:
def clean_data(dataset):
    ## Verranno eseguiti vari step di pulizia per dati testuali

    # Rimozione di "\r" e "\n"
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'[\r\n]+', '', x))
    # Rimozione di sequenze di ":" (esempio, "::::")
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'::+', '', x))
    # Rimozione di sequenze di "=" (esempio, "====")
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'==+', '', x))
    # Rimozione di sequenze di "*" (esempio, "**")
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'\*\*+', '', x))
    # Rimozione di sequenze numeriche in formato di indirizzi IP (esempio, "192.168.1.1")
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'\b(?:\d{1,3}\.){3}\d{1,3}\b', '', x))
    # Rimozione di contenuto compreso tra Parentesi Quadre (esempio, "[contentContent]")
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'\[[^\[\]]+\]', '', x))
    # Rimozione di Apici, sia singoli che doppi
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r"['\"]", "", x))

    ## La rimozione di particolari caratteri o sequenze di caratteri può portare alla fusione di due token diversi

    # Splitting di token in cui compare un segno di interpuzione forte ("?", "!" e ".") seguito da una lettera maiuscola
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'([?!\.])([A-Z]\w*)', r'\1 \2', x))
    # Splitting di parole fuse (esempio, "parolaParola" diventa "parola Parola")
    dataset["comment_text"] = dataset["comment_text"].apply(lambda x: re.sub(r'([a-z])([A-Z])', r'\1 \2', x))

    return dataset

In [20]:
def transform_data(dataset):
    ## Trasformazione di tutte le lettere maiuscole in minuscole e rimozione di tutti i segni di interpunzione

    phrases = dataset["comment_text"].to_list()
    phrases_cleaned = list()

    for phrase in phrases:
        tokens = word_tokenize(phrase)
        lowercase_tokens = [token.lower() for token in tokens if token.isalpha()]
        phrases_cleaned.append(' '.join(lowercase_tokens))

    dataset["comment_text"] = pd.Series(phrases_cleaned)
    
    return dataset

In [21]:
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def lemmatize_text(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text)
    tagged_tokens = nltk.pos_tag(tokens)
    lemmatized_text = []
    for token, tag in tagged_tokens:
        pos = get_wordnet_pos(tag)
        lemmatized_token = lemmatizer.lemmatize(token, pos=pos)
        lemmatized_text.append(lemmatized_token)
    return ' '.join(lemmatized_text)

## Training Set

In [22]:
train_data = clean_data(train_data)
train_data = transform_data(train_data)

In [23]:
def build_training_set(dataset):
    toxic_entries = dataset[dataset['toxic'] == 1]
    non_toxic_entries = dataset[dataset['toxic'] == 0]
    print("Numero di Frasi 'toxic': " + str(len(toxic_entries)) + ", Numero di Frasi 'non-toxic': " + str(len(non_toxic_entries)))

    non_toxic_downsampled = resample(non_toxic_entries, n_samples=len(toxic_entries), random_state=42)

    final_training_set = pd.concat([toxic_entries, non_toxic_downsampled])
    final_training_set.reset_index(drop=True, inplace=True)

    idx_to_remove = list()
    for i in range(0, len(final_training_set)):
        row = final_training_set.iloc[i]
        if row['comment_text'] is '':
            idx_to_remove.append(i)
  
    final_training_set = final_training_set.drop(idx_to_remove)
    final_training_set = final_training_set[['comment_text', 'toxic']]

    return final_training_set

In [24]:
training_set = build_training_set(train_data)
print("training_set.shape: " + str(training_set.shape))

Numero di Frasi 'toxic': 15294, Numero di Frasi 'non-toxic': 144277
training_set.shape: (30577, 2)


In [25]:
training_set.to_csv("./training_set.csv", index=False)

In [26]:
phrases = training_set['comment_text'].to_list()
lemmatized_phrases = list()

for phrase in phrases:
    lemmatized_phrases.append(lemmatize_text(phrase))

training_set['comment_text'] = lemmatized_phrases

In [27]:
training_set.to_csv("./training_set_lemmatized.csv", index=False)

## Test Set

In [28]:
test_data = test_data[['comment_text', 'toxic']]

In [29]:
test_data = clean_data(test_data)
test_data = transform_data(test_data)

In [30]:
test_data.to_csv("./test_set.csv", index=False)

In [31]:
phrases = test_data['comment_text'].to_list()
lemmatized_phrases = list()

for phrase in phrases:
    lemmatized_phrases.append(lemmatize_text(phrase))

test_data['comment_text'] = lemmatized_phrases

KeyboardInterrupt: 

In [None]:
test_data.to_csv("./test_set_lemmatized.csv", index=False)