# Cargar y Examinar el dataset
Importar pandas, cargar el dataframe y examinar las características de la variable 'Text' mediante descripción estaística básica y una muestra.

Conservar solo dos columnas ['Text' & 'IsToxic'] por simplicidad del modelo, eficiencia computacional, reducir la introducción de ruido, evitar el sobreajuste y mejorar su interpretabilidad.


In [1]:
# Importar pandas para cargar el dataset y extraer columnas
%pip install -q pandas

import pandas as pd

# cargar el dataset desde la carpeta Data
data = pd.read_csv('Data/youtoxic_english_1000.csv')
df = data[['Text', 'IsToxic']]

print(df.head())
print(df.dtypes)
print("="*50)

print(f"Analisis descriptivo de la variable Text: \n{df['Text'].describe()}")
print("="*50)

print(f"Muestra aleatoria de la variable Text: \n{df['Text'].sample(10)}")

print("="*50)
print(f"Analisis descriptivo de la variable IsToxic: \n{df['IsToxic'].describe()}")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
                                                Text  IsToxic
0  If only people would just take a step back and...    False
1  Law enforcement is not trained to shoot to app...     True
2  \nDont you reckon them 'black lives matter' ba...     True
3  There are a very large number of people who do...    False
4  The Arab dude is absolutely right, he should h...    False
Text       object
IsToxic      bool
dtype: object
Analisis descriptivo de la variable Text: 
count              1000
unique              997
top       run them over
freq                  3
Name: Text, dtype: object
Muestra aleatoria de la variable Text: 
238    Wait, so who was rioting again? The people sta...

In [2]:
# Encontrar valores repetidos en la columna Text
duplicates = df[df.duplicated(['Text'], keep=False)]
print(f"Valores duplicados en la variable Text: \n{duplicates}")

Valores duplicados en la variable Text: 
              Text  IsToxic
592  RUN THEM OVER     True
642  run them over     True
657  run them over     True
677  run them over     True
699  RUN THEM OVER     True


# Limpiar 'Text' de Caracteres Especiales 
Uso de expresiones regulares y operaciones con strings para eliminar o reemplazar caracteres especiales, URLs y símbolos.

In [3]:
import re

def clean_text(text):
    # Eliminar URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Eliminar “handles” de Twitter, hashtags y números
    text = re.sub(r'\@\w+|\#|\d+', '', text)
    # Eliminar signos de puntuación
    text = re.sub(r'[^\w\s]', '', text)
    return text

# Aplicar la función clean_text a la columna Text
df['Cleaned_Text'] = df['Text'].apply(clean_text)

# Muestra aleatoria de la columna Cleaned_Text
df['Cleaned_Text'].sample(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Cleaned_Text'] = df['Text'].apply(clean_text)


258    What is the relevance of marijuana here\n\nThe...
837                                 blacklivesdontmatter
895    Some people have seen Mississippi Burning way ...
355     agree people are up in arms but they need to ...
51     The world is changing im mad at the police too...
808                  No Leave it alone Dont take it down
978             I like how they say hurr instead of here
97                                   This guy is a moron
256    What does this video have to do with what actu...
781    Dear Peggy can you please be all of our Mother...
Name: Cleaned_Text, dtype: object

# Normalización de 'Text'
Convierte texto a minúsculas, elimina acentos y normaliza espacios en blanco. Maneja abreviaturas y contracciones comunes.

In [4]:
import unicodedata

def normalize_text(text):
    # Convierte el texto a minúsculas
    text = text.lower()
    # Normaliza los espacios en blanco
    text = re.sub(r'\s+', ' ', text).strip()
    # Normailza las contrracciones comunes en inglés
    text = text.replace("what's", "what is ")
    text = text.replace("'s", " ")
    text = text.replace("'ve", " have ")
    text = text.replace("can't", "cannot ")
    text = text.replace("n't", " not ")
    text = text.replace("i'm", "i am ")
    text = text.replace("'re", " are ")
    text = text.replace("'d", " would ")
    text = text.replace("'ll", " will ")
    # Elimina los acentos
    # text = ''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')
    return text

# Aplica la función normalize_text a la columna Cleaned_Text
df['Normalized_Text'] = df['Cleaned_Text'].apply(normalize_text)

# Muestra aleatoria de la columna Normalized_Text
df['Normalized_Text'].sample(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Normalized_Text'] = df['Cleaned_Text'].apply(normalize_text)


414    michale brown got what he was asking for the b...
58     can anyone tell me why it is racist if a white...
391    can someone please turn that into an autotune ...
45     he had to take a break because bassem was maki...
532    ana kasparian is the most beautiful woman in t...
834             really black people are the most violent
193    the more those people fuck up america the more...
378    well here ya go young white man killed by a no...
11     moral of the story dont reach for a cops gun b...
950     why is the cop mooing at people in the beginning
Name: Normalized_Text, dtype: object

# Eliminar Stopwords
Importar NLTK y descargar stopwords para el idioma Inglés.
Las stopwords se utilizan comúnmente en la minería de texto y el procesamiento del lenguaje natural para eliminar palabras que se usan tan ampliamente que contienen muy poca información útil.

In [5]:
%pip install -q nltk

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    # Separar el texto en palabras
    words = text.split()
    # Eliminar stopwords
    filtered_words = [word for word in words if word.lower() not in stop_words]
    # Unir las palabras nuevamente en un solo string
    return ' '.join(filtered_words)

# Aplica la función remove_stopwords a la columna Normalized_Text
df['NoStopWords_Text'] = df['Normalized_Text'].apply(remove_stopwords)

# Muestra aleatoria de la columna Processed_Text
df['NoStopWords_Text'].sample(10)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


[nltk_data] Downloading package stopwords to /Users/aitor/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['NoStopWords_Text'] = df['Normalized_Text'].apply(remove_stopwords)


549                                      joe rogan idiot
836    yea beat white people rob black owned stores c...
169    muslims like spit white people brought muslims...
888    truly tragedynot death kid way city reacted wa...
993    whites moveout ferguson lets see nice becomes ...
949    saw article many cases went pretty much one pa...
560           hey shout thai writing shirt lol thaipride
915    dont want justice see mob mentality action hes...
787    women get nobel peace prize instead antichrist...
315    great video levelheaded cogent fair unfortunat...
Name: NoStopWords_Text, dtype: object

# Eliminar Emojis 😵‍💫

In [6]:
def clean_emoji(text):
    
    # Eliminar emojis
    text = re.sub(r'[^\u0000-\u007F]+', '', text)
    return text

# Aplica la función clean_emoji a la columna Processed_Text
df['NoEmoji_Processed_Text'] = df['NoStopWords_Text'].apply(clean_emoji)

# Muestra aleatoria de la columna NoEmoji_Processed_Text
df['NoEmoji_Processed_Text'].sample(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['NoEmoji_Processed_Text'] = df['NoStopWords_Text'].apply(clean_emoji)


471    video pushing clerk around really made lose sy...
867    police shot black guy blacks riot black shoot ...
472    easy point negative black man point good point...
58     anyone tell racist white man shoots black man ...
180    say black piece came bit stiff due repeated us...
442    dude someone please tell dude listen violent m...
338    take medical cannabis guy liked couple face bo...
204    majority people ferguson african american elec...
310    major difference zimmerman case one shooter zi...
554                   look mike brown world instrumental
Name: NoEmoji_Processed_Text, dtype: object

# Transformar el texto con método TF-IDF
Transformar el texto procesado en una representación numérica utilizando el método TF-IDF (Term Frequency-Inverse Document Frequency), considerando n-gramas y otras características de 'NoEmoji_Precessed_Text'. El resultado será un dataframe de vectores.

In [7]:
%pip install -q scikit-learn

from sklearn.feature_extraction.text import TfidfVectorizer

# Iniciar el TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=5000)

# Aprender y transformar
tfidf_matrix = tfidf_vectorizer.fit_transform(df['NoEmoji_Processed_Text'])

# Convertir la matriz en un DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Mostrar la forma del DataFrame
print(tfidf_df.shape)
print("="*50)
# Visualizar una muestra aleatoria
tfidf_df.sample(10)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
(1000, 5000)


Unnamed: 0,aaannnyything,ability,able,absolutely,absolutely nothing,abuse,according,accountable,accountable actions,accounts,...,young white,youre,youth,youtube,youve,zimmerman,zimmerman case,zimmerman michael,zimmermans,zionist
256,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.089258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
454,0.0,0.0,0.0,0.0,0.0,0.181738,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
335,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
408,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
65,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
234,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Combinar y guardar el procesado como un dataset

In [8]:
# Combinar el dataframe 'tfidf' con las etiquetas 'IsToxic'
features = tfidf_df
labels = df['IsToxic']

processed_df = pd.concat([features, labels], axis=1)

# Guardar el dataframe procesado en un archivo CSV dentro de la carpeta Data
processed_df.to_csv('Data/youtoxic_english_1000_processed.csv', index=False)