## Preprocesamiento de texto
Implementar técnicas de procesamiento de lenguaje natural:

1. Tokenización
2. Eliminación de stopwords
3. Lematización o stemming
4. Vectorizar el texto utilizando técnicas como TF-IDF o word embeddings.


In [1]:
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import unicodedata

In [2]:
# Cargar el modelo de spaCy
nlp = spacy.load("en_core_web_sm")  # Cambia a "en_core_web_sm" para inglés

In [3]:
def clean_text(text):
    """
    Limpia el texto eliminando emojis, URLs, y otros caracteres especiales.
    
    Parameters:
    text (str): Texto a limpiar
    
    Returns:
    str: Texto limpio
    """
    
    # Eliminar URLs
    text = re.sub(r'http\S+|www.\S+', '', text)
    

    # Mantener letras, números y espacios, eliminando otros caracteres especiales
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)

    # Convertir a minúsculas
    text = text.lower()
    
    # Eliminar espacios múltiples
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [4]:
# Función para preprocesar el texto: tokenización, eliminación de stopwords y lematización
def preprocess_text(text):
    # Limpiar el texto
    text = clean_text(text)
    
    # Procesar el texto con spaCy
    doc = nlp(text)
    
    # Tokenización, lematización y eliminación de stopwords
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]
    
    return ' '.join(tokens)

In [5]:
# Cargar el dataset
df = pd.read_csv('../data/youtube.csv')  
df.head(2)

Unnamed: 0,CommentId,VideoId,Text,IsToxic,IsAbusive,IsThreat,IsProvocative,IsObscene,IsHatespeech,IsRacist,IsNationalist,IsSexist,IsHomophobic,IsReligiousHate,IsRadicalism
0,Ugg2KwwX0V8-aXgCoAEC,04kJtp6pVXI,If only people would just take a step back and...,False,False,False,False,False,False,False,False,False,False,False,False
1,Ugg2s5AzSPioEXgCoAEC,04kJtp6pVXI,Law enforcement is not trained to shoot to app...,True,True,False,False,False,False,False,False,False,False,False,False


In [6]:
# Aplicar preprocesamiento a la columna de texto
df['processed_text'] = df['Text'].apply(preprocess_text)
df.head(2)

Unnamed: 0,CommentId,VideoId,Text,IsToxic,IsAbusive,IsThreat,IsProvocative,IsObscene,IsHatespeech,IsRacist,IsNationalist,IsSexist,IsHomophobic,IsReligiousHate,IsRadicalism,processed_text
0,Ugg2KwwX0V8-aXgCoAEC,04kJtp6pVXI,If only people would just take a step back and...,False,False,False,False,False,False,False,False,False,False,False,False,people step case not people situation lump mes...
1,Ugg2s5AzSPioEXgCoAEC,04kJtp6pVXI,Law enforcement is not trained to shoot to app...,True,True,False,False,False,False,False,False,False,False,False,False,law enforcement train shoot apprehend train sh...


In [9]:
# Seleccionar solo las columnas "IsToxic" y "processed_text"
df_filtered = df[['IsToxic', 'processed_text']]

# Mostrar el DataFrame filtrado
print(df_filtered.head())

   IsToxic                                     processed_text
0    False  people step case not people situation lump mes...
1     True  law enforcement train shoot apprehend train sh...
2     True  not reckon black life matter banner hold white...
3    False  large number people like police officer call c...
4    False  arab dude absolutely right shoot extra time sh...


In [10]:
# Guardar dataframe con la nueva columna del texto procesado
df_filtered.to_csv('../data/youtube_procesado.csv', index=False)

In [8]:
# Vectorización usando TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['processed_text'])

# Convertir la matriz TF-IDF a un DataFrame para su visualización
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

# Mostrar algunas filas del DataFrame original y la matriz TF-IDF
print(df[['Text', 'processed_text']].head())
print("\nMatriz TF-IDF (primeras 5 filas, 10 primeras columnas):")
print(tfidf_df.iloc[:5, :10])

                                                Text  \
0  If only people would just take a step back and...   
1  Law enforcement is not trained to shoot to app...   
2  \nDont you reckon them 'black lives matter' ba...   
3  There are a very large number of people who do...   
4  The Arab dude is absolutely right, he should h...   

                                      processed_text  
0  people step case not people situation lump mes...  
1  law enforcement train shoot apprehend train sh...  
2  not reckon black life matter banner hold white...  
3  large number people like police officer call c...  
4  arab dude absolutely right shoot extra time sh...  

Matriz TF-IDF (primeras 5 filas, 10 primeras columnas):
   aaannnyythe  ability  able  aboutdemocrat  absolute  absolutely  absurd  \
0          0.0      0.0   0.0            0.0       0.0    0.000000     0.0   
1          0.0      0.0   0.0            0.0       0.0    0.000000     0.0   
2          0.0      0.0   0.0            0