# Data Cleaning and Text Standardization.

a. Uniform text formats (e.g., case normalization, Hint: standardize the letters in lower case).
If necessary, clean the comment text (e.g. URLs, subreddit refs, …).

b. Stop words are not contributing much to our ML tasks, such as "the", "a", since they carry
very little information. Take care of these kinds of words.

c. Reduce words to their base or root form using Stemming/Lemmatization. This helps in
reducing inflected words to a common base form. (Hint: Consider using libraries like NLTK
or spaCy for tokenization).


In [2]:
# import needed python libraries

%matplotlib inline
from tqdm import tqdm
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

import html
import spacy
nlp = spacy.load("en_core_web_sm", disable=["parser","ner","textcat"])
from langdetect import detect

In [3]:
df_supervised   = pd.read_csv( "data_supervised.csv")
df_unsupervised = pd.read_csv( "data_unsupervised.csv")
df_target       = pd.read_csv( "target_supervised.csv")

print(df_supervised.shape, df_unsupervised.shape, df_target.shape)

(296042, 4) (1107946, 4) (5000, 2)


Uniform text formats (e.g., case normalization, Hint: standardize the letters in lower case). If necessary, clean the comment text (e.g. URLs, subreddit refs, …).



In [4]:
remove_pattern = r'https?://\S+|www\.\S+|r/\w+|u/\w+'

df_supervised['body_normalized'] = (
    df_supervised['body']
    .fillna('')                                     # Gestisce i NaN
    .astype(str)                                    # Assicura formato stringa
    .str.lower()                                    # Case normalization (Punto a.)
    .apply(html.unescape)                           # Decodifica HTML (es. &amp; -> &)
    .str.replace(remove_pattern, ' ', regex=True) # Rimuove URL, r/, u/
    .str.replace(r'\s+', ' ', regex=True)           # Rimuove doppi spazi
    .str.strip()                                    # Pulisce spazi inizio/fine
)

df_unsupervised['body_normalized'] = (
    df_unsupervised['body']
    .fillna('')
    .astype(str)
    .str.lower()
    .apply(html.unescape)
    .str.replace(remove_pattern, ' ', regex=True)
    .str.replace(r'\s+', ' ', regex=True)
    .str.strip()
)


                                                body  \
0  I don't think we'd get nearly as much fanficti...   
1  Thanks. I made it up, that's how I got over my...   
2  Are you sure you aren't confusing Cyclops (the...   
3                             dont do this to me bro   
4        That's what we do when we can't find a mate   

                                     body_normalized  
0  i don't think we'd get nearly as much fanficti...  
1  thanks. i made it up, that's how i got over my...  
2  are you sure you aren't confusing cyclops (the...  
3                             dont do this to me bro  
4        that's what we do when we can't find a mate  


In [5]:
# CHECKK!!!
df_supervised[["body", 'body_normalized']].head()

Unnamed: 0,body,body_normalized
0,I don't think we'd get nearly as much fanficti...,i don't think we'd get nearly as much fanficti...
1,"Thanks. I made it up, that's how I got over my...","thanks. i made it up, that's how i got over my..."
2,Are you sure you aren't confusing Cyclops (the...,are you sure you aren't confusing cyclops (the...
3,dont do this to me bro,dont do this to me bro
4,That's what we do when we can't find a mate,that's what we do when we can't find a mate


b. Stop words are not contributing much to our ML tasks, such as "the", "a", since they carry very little information. Take care of these kinds of words.

c. Reduce words to their base or root form using Stemming/Lemmatization. This helps in reducing inflected words to a common base form. (Hint: Consider using libraries like NLTK or spaCy for tokenization).

In [6]:
def process_text_full(text_series, batch_size=2000):
    clean_texts = []

    total_docs = len(text_series)

    # tqdm show the process bar
    for doc in tqdm(nlp.pipe(text_series, batch_size=batch_size), total=total_docs, desc="Processing"):

        tokens = []
        for token in doc:
            # 1. Filtering Stop Words e punctation (b)
            if not token.is_stop and not token.is_punct and not token.is_space:
                # 2. Take the lemma using spaCy (c)
                tokens.append(token.lemma_)

        clean_texts.append(" ".join(tokens))

    return clean_texts

print("Elaboration of SUPERVISED dataset (smaller)...")
df_supervised['body_clean'] = process_text_full(df_supervised['body_normalized'].astype(str))

print("Elaboration of UNSUPERVISED  dataset (bigger)...")
df_unsupervised['body_clean'] = process_text_full(df_unsupervised['body_normalized'].astype(str))

Elaborazione dataset SUPERVISED (più piccolo)...


Processing:  21%|██        | 62000/296042 [02:30<09:28, 411.62it/s] 


KeyboardInterrupt: 