# Cargar y Examinar el dataset
Importar pandas, cargar el dataframe y examinar las características de la variable 'Text' mediante descripción estaística básica y una muestra.

Conservar solo dos columnas ['Text' & 'IsToxic'] por simplicidad del modelo, eficiencia computacional, reducir la introducción de ruido, evitar el sobreajuste y mejorar su interpretabilidad.


In [50]:
# Importar pandas para cargar el dataset y extraer columnas
%pip install -q pandas

import pandas as pd

# cargar el dataset desde la carpeta Data
data = pd.read_csv('Data/youtoxic_english_1000.csv')
df = data[['Text', 'IsToxic']]

print(df.head())
print(df.dtypes)
print("="*50)

print(f"Analisis descriptivo de la variable Text: \n{df['Text'].describe()}")
print("="*50)

print(f"Muestra aleatoria de la variable Text: \n{df['Text'].sample(10)}")

print("="*50)
print(f"Analisis descriptivo de la variable IsToxic: \n{df['IsToxic'].describe()}")

Note: you may need to restart the kernel to use updated packages.
                                                Text  IsToxic
0  If only people would just take a step back and...    False
1  Law enforcement is not trained to shoot to app...     True
2  \nDont you reckon them 'black lives matter' ba...     True
3  There are a very large number of people who do...    False
4  The Arab dude is absolutely right, he should h...    False
Text       object
IsToxic      bool
dtype: object
Analisis descriptivo de la variable Text: 
count              1000
unique              997
top       run them over
freq                  3
Name: Text, dtype: object
Muestra aleatoria de la variable Text: 
164    Bassem Masri is an uneducated idiot. Just list...
901                          7:24 7:30 XDDDDDDDDDDDDDDDD
133    I made a song addressing Ferguson and the issu...
236             to bad those weapons were not discharged
522    "Almost floor shaving quality cigars..."\n\n\n...
394    moly looooooooo

In [38]:
# Encontrar valores repetidos en la columna Text
duplicates = df[df.duplicated(['Text'], keep=False)]
print(f"Valores duplicados en la variable Text: \n{duplicates}")

Valores duplicados en la variable Text: 
              Text  IsToxic
592  RUN THEM OVER     True
642  run them over     True
657  run them over     True
677  run them over     True
699  RUN THEM OVER     True


# Limpiar 'Text' de Caracteres Especiales 
Uso de expresiones regulares y operaciones con strings para eliminar o reemplazar caracteres especiales, URLs y símbolos.

In [39]:
import re

def clean_text(text):
    # Eliminar URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Eliminar “handles” de Twitter, hashtags y números
    text = re.sub(r'\@\w+|\#|\d+', '', text)
    # Eliminar signos de puntuación
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Aplicar la función clean_text a la columna Text
df['Cleaned_Text'] = df['Text'].apply(clean_text)

# Muestra aleatoria de la columna Cleaned_Text
df['Cleaned_Text'].sample(10)

884        In the point they almost set a person on fire
759    So proud of a woman  mother that stands on pri...
259    You know its crazy I actually came across your...
74                                  plain n simple truth
772                        PEGGY HUBBARD for PRESIDENT  
401                As always Mr Molyneux excellent video
465     Kajieme Powell Shooting Video Disturbs For So...
238    Wait so who was rioting again The people stand...
315    Great video Levelheaded cogent and fair Unfort...
621                                 All liberal nonsense
Name: Cleaned_Text, dtype: object

# Normalización de 'Text'
Convierte texto a minúsculas, elimina acentos y normaliza espacios en blanco. Maneja abreviaturas y contracciones comunes.

In [44]:
import unicodedata

def normalize_text(text):
    # Convierte el texto a minúsculas
    text = text.lower()
    # Normaliza los espacios en blanco
    text = re.sub(r'\s+', ' ', text).strip()
    # Normailza las contrracciones comunes en inglés
    text = text.replace("what's", "what is ")
    text = text.replace("'s", " ")
    text = text.replace("'ve", " have ")
    text = text.replace("can't", "cannot ")
    text = text.replace("n't", " not ")
    text = text.replace("i'm", "i am ")
    text = text.replace("'re", " are ")
    text = text.replace("'d", " would ")
    text = text.replace("'ll", " will ")
    # Elimina los acentos
    # text = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
    return text

# Aplica la función normalize_text a la columna Cleaned_Text
df['Normalized_Text'] = df['Cleaned_Text'].apply(normalize_text)

# Muestra aleatoria de la columna Normalized_Text
df['Normalized_Text'].sample(10)

828    god bless this lady i wish there were more peo...
362    my god look at this raw footage shocking fergu...
415    as a white guy if you are really concerned wit...
421    apparently stefan never read whats the matter ...
343    mr stefan molyneux all mr wilson had to do was...
767    as a black person all i have to say is thank y...
138    if that cop shot a white guy there would not b...
463    i am not against most of what is being present...
412    link between poor little tryvon martin and lit...
160    oh my god i fucking love that arab guy such wi...
Name: Normalized_Text, dtype: object

# Eliminar Stopwords
Importar NLTK y descargar stopwords para el idioma Inglés.
Las stopwords se utilizan comúnmente en la minería de texto y el procesamiento del lenguaje natural para eliminar palabras que se usan tan ampliamente que contienen muy poca información útil.

In [45]:
%pip install -q nltk

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    # Separar el texto en palabras
    words = text.split()
    # Eliminar stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Unir las palabras nuevamente en un solo string
    return ' '.join(filtered_words)

# Aplica la función remove_stopwords a la columna Normalized_Text
df['NoStopWords_Text'] = df['Normalized_Text'].apply(remove_stopwords)

# Muestra aleatoria de la columna Processed_Text
df['NoStopWords_Text'].sample(10)

Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to /Users/aitor/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


126    trayvon martin michael brown common real reaso...
888    truly tragedynot death kid way city reacted wa...
388    well know truth today yet still hear hands don...
175    people say blacks play victim according commen...
846    real criminals wall street guys steal money in...
766                                          agree sista
390    everyone knows white people die cops blacks ev...
975    americas days coming end emigration rate multi...
797                                       right dynamite
378    well ya go young white man killed nonwhite say...
Name: NoStopWords_Text, dtype: object

# Eliminar Emojis 😵‍💫

In [46]:
def clean_emoji(text):
    
    # Eliminar emojis
    text = re.sub(r'[^\u0000-\u007F]+', '', text)
    return text

# Aplica la función clean_emoji a la columna Processed_Text
df['NoEmoji_Processed_Text'] = df['NoStopWords_Text'].apply(clean_emoji)

# Muestra aleatoria de la columna NoEmoji_Processed_Text
df['NoEmoji_Processed_Text'].sample(10)

311    michael brown positively identified assailant ...
484                                shoot someone run lol
453    music references sex drugs stealing murder etc...
376    youtube comwatchvyfbbtdvhg research didnt come...
544    need able call people guns ca citizens complet...
119                            good cnn anchor ever seen
257    funny thing government trusted judge fucking d...
732     love woman voice matters fuck black lives matter
487     advertising ignorance creates much crime opinion
854    ignorance law excuse usc deprivation rights co...
Name: NoEmoji_Processed_Text, dtype: object

# Transformar el texto con método TF-IDF
Transformar el texto procesado en una representación numérica utilizando el método TF-IDF (Term Frequency-Inverse Document Frequency), considerando n-gramas y otras características de 'NoEmoji_Precessed_Text'. El resultado será un dataframe de vectores.

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Iniciar el TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=5000)

# Aprender y transformar
tfidf_matrix = tfidf_vectorizer.fit_transform(df['NoEmoji_Processed_Text'])

# Convertir la matriz en un DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Mostrar la forma del DataFrame
print(tfidf_df.shape)
print("="*50)
# Visualizar una muestra aleatoria
tfidf_df.sample(10)

(1000, 5000)


Unnamed: 0,aaannnyything,ability,able,absolutely,absolutely nothing,abuse,according,accountable,accountable actions,accounts,...,young white,youre,youth,youtube,youve,zimmerman,zimmerman case,zimmerman michael,zimmermans,zionist
688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
491,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
138,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
673,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
648,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Combinar y guardar el procesado como un dataset

In [57]:
# Combinar el dataframe 'tfidf' con las etiquetas 'IsToxic'
features = tfidf_df
labels = df['IsToxic']

processed_df = pd.concat([features, labels], axis=1)

# Guardar el dataframe procesado en un archivo CSV dentro de la carpeta Data
processed_df.to_csv('Data/youtoxic_english_1000_processed.csv', index=False)