# Cargar y Examinar el dataset
Importar pandas, cargar el dataframe y examinar las características de la variable 'Text' mediante descripción estaística básica y una muestra.

Conservar solo dos columnas ['Text' & 'IsToxic'] por simplicidad del modelo, eficiencia computacional, reducir la introducción de ruido, evitar el sobreajuste y mejorar su interpretabilidad.


In [None]:
#Importar librería para cargar el dataset y extraer columnas
%pip install -q pandas

import pandas as pd

# cargar el dataset desde la carpeta Data
data = pd.read_csv('Data/youtoxic_english_1000.csv')
df = data[['Text', 'IsToxic']]

print(df.head())
print(df.dtypes)
print("="*50)

print(f"Analisis descriptivo de la variable Text: \n{df['Text'].describe()}")
print("="*50)

print(f"Muestra aleatoria de la variable Text: \n{df['Text'].sample(10)}")

Note: you may need to restart the kernel to use updated packages.
                                                Text  IsToxic
0  If only people would just take a step back and...    False
1  Law enforcement is not trained to shoot to app...     True
2  \nDont you reckon them 'black lives matter' ba...     True
3  There are a very large number of people who do...    False
4  The Arab dude is absolutely right, he should h...    False
Text       object
IsToxic      bool
dtype: object
Analisis descriptivo de la variable Text: 
count              1000
unique              997
top       run them over
freq                  3
Name: Text, dtype: object
Muestra aleatoria de la variable Text: 
428    THANK YOU! right on.black people are going to ...
512    When did coming towards the officer is the sam...
107    This is a genuine failure of leadership by the...
47     This moron is talking about being peaceful?  T...
625    They just need to arrest all those protesters ...
942    i like how blac

In [None]:
# Encontrar valores repetidos en la columna Text
duplicates = df[df.duplicated(['Text'], keep=False)]
print(f"Valores duplicados en la variable Text: \n{duplicates}")

Valores duplicados en la columna Text: 
              Text  IsToxic
592  RUN THEM OVER     True
642  run them over     True
657  run them over     True
677  run them over     True
699  RUN THEM OVER     True


# Limpiar 'Text' de Caracteres Especiales 
Uso de expresiones regulares y operaciones con strings para eliminar o reemplazar caracteres especiales, URLs y símbolos.

In [30]:
import re

def clean_text(text):
    # Eliminar URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Eliminar “handles” de Twitter, hashtags y números
    text = re.sub(r'\@\w+|\#|\d+', '', text)
    # Eliminar signos de puntuación
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Aplicar la función clean_text a la columna Text
df['Cleaned_Text'] = df['Text'].apply(clean_text)

# Muestra aleatoria de la columna Cleaned_Text
df['Cleaned_Text'].sample(10)

767    As a black person all I have to say is THANK Y...
677                                        run them over
419    I cant tell if this is a racist cop or a dumb ...
119            The only good CNN anchor I have ever seen
734    finally a black person with a unbiased opinion...
410    Whether youre right or wrong about the Michael...
63     This Muslim man does not have anything in mind...
831    Well said Peggy your message should be shared ...
595                                       alllivesmatter
836    Yea We should beat up all white people rob bla...
Name: Cleaned_Text, dtype: object

# Normalización de 'Text'
Convierte texto a minúsculas, elimina acentos y normaliza espacios en blanco. Maneja abreviaturas y contracciones comunes.

In [31]:
import unicodedata

def normalize_text(text):
    # Convierte el texto a minúsculas
    text = text.lower()
    # Elimina los acentos
    text = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
    # Normaliza los espacios en blanco
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Aplica la función normalize_text a la columna Cleaned_Text
df['Normalized_Text'] = df['Cleaned_Text'].apply(normalize_text)

# Muestra aleatoria de la columna Normalized_Text
df['Normalized_Text'].sample(10)

992    there are a lot of disgusting people in the co...
241                            confederation ill be back
558    the cop had a fractured eye socket and the fis...
810    this older white conservative says peggy for v...
955    they cant help it but these motherfuckers are ...
159    why the hell are they even interviewing him th...
698    its hilarious seeing these morons get hit by cars
367    your gentle giant seems to have robbed a store...
879    the cops need to look at the video and start a...
935                                 the lady at hahahaha
Name: Normalized_Text, dtype: object

# Eliminar Stopwords
Importar NLTK y descargar stopwords para el idioma Inglés.
Las stopwords se utilizan comúnmente en la minería de texto y el procesamiento del lenguaje natural para eliminar palabras que se usan tan ampliamente que contienen muy poca información útil.

In [32]:
%pip install -q nltk

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    # Separar el texto en palabras
    words = text.split()
    # Eliminar stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Unir las palabras nuevamente en un solo string
    return ' '.join(filtered_words)

# Aplica la función remove_stopwords a la columna Normalized_Text
df['Processed_Text'] = df['Normalized_Text'].apply(remove_stopwords)

# Muestra aleatoria de la columna Processed_Text
df['Processed_Text'].sample(10)

Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to /Users/aitor/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


853                            bunch complete dumb asses
167               guy idiot thats best lost already bozo
572                         would eating coal could roll
499                                      good one stefan
143    wow think im going buy bullet proof vestshit b...
22                                      word provocateur
4      arab dude absolutely right shot extra time sho...
932    put civilians perspective two people get fight...
725                  peggy im native american love honey
193            people fuck america american fucking hate
Name: Processed_Text, dtype: object

In [34]:
def clean_emoji(text):
    
    # Eliminar emojis
    text = re.sub(r'[^\u0000-\u007F]+', '', text)
    return text

# Aplica la función clean_emoji a la columna Processed_Text
df['NoEmoji_Processed_Text'] = df['Processed_Text'].apply(clean_emoji)

# Muestra aleatoria de la columna NoEmoji_Processed_Text
df['NoEmoji_Processed_Text'].sample(10)

313                  informative really appreciate video
649                  someone taken one team plowed right
233    im point think cop points gun know im told som...
338    take medical cannabis guy liked couple face bo...
544    need able call people guns ca citizens complet...
43      smirconnish evenhanded guy world guy masri idiot
752                                           peggy hero
804    amen peggy hubbard color makes difference pers...
776    need people like peggy hubbard support mother ...
497    picture cops injuries wasnt ambulance injured ...
Name: NoEmoji_Processed_Text, dtype: object