# **Ejercicio 3:** Preprocesamiento

 **Nombre:** Aaròn Yumancela
 
 **Objetivos:**
* Comprender y aplicar normalización, tokenización, stopwords, stemming y n-gramas.
* Medir el impacto de cada paso en el vocabulario y los tokens. 

In [1]:
# Ejercicio 3: Preprocesamiento

import kagglehub
import os 
import pandas as pd
import re

# 0. Cargar el Corpus (IMDB)


print("Descargando dataset...")
try:
    path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
    print(f"Archivos descargados en: {path}")
except Exception as e:
    # Ruta de respaldo típica en Kaggle
    path = '/kaggle/input/imdb-dataset-of-50k-movie-reviews'
    print(f"Usando ruta alternativa: {path}")

# Ruta del CSV
ruta_archivo_csv = os.path.join(path, 'IMDB Dataset.csv')

# Cargar el DataFrame
df = pd.read_csv(ruta_archivo_csv, encoding='utf-8')
print(f"Dataset cargado exitosamente. Total de reseñas: {len(df)}")

# Ver el DataFrame (en notebook se vería la tabla)
df

# Tomar el primer documento del corpus
doc = df.iloc[0]['review']
doc


Descargando dataset...
Archivos descargados en: /kaggle/input/imdb-dataset-of-50k-movie-reviews
Dataset cargado exitosamente. Total de reseñas: 50000


"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [2]:
# Reemplazo rápido (como en tu código base)
doc.replace('<br />', '').replace('>', '').replace('/', '')

import re

def clean_text(doc):
    # Quitar etiquetas HTML (como <br />, <i>, etc.)
    doc = re.sub(pattern=r'<.*?>', repl='', string=doc)
    # Quitar algunos signos concretos
    doc = doc.replace('.', ' ') \
             .replace(',', '') \
             .replace('(', '') \
             .replace(')', '') \
             .replace('"', '') \
             .replace("'", '') \
             .replace("\x08", '')
    return doc

# Aplicar al corpus completo y GUARDAR en nueva columna
df['review_clean'] = df['review'].apply(clean_text)

# Ver nuevamente el primer documento ya limpio
doc = df.iloc[0]['review_clean']
doc

# Tokenizar rápido por espacios (solo para probar)
doc.split()


['One',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'Oz',
 'episode',
 'youll',
 'be',
 'hooked',
 'They',
 'are',
 'right',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me',
 'The',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'Oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'GO',
 'Trust',
 'me',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid',
 'This',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs',
 'sex',
 'or',
 'violence',
 'Its',
 'is',
 'hardcore',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word',
 'It',
 'is',
 'called',
 'OZ',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'Oswald',
 'Maximum',
 'Security',
 'State',
 'Penitentary',
 'It',
 'focuses',
 'mainly',
 'on',
 

In [3]:
# 2. Normalización: minúsculas y solo letras (a-z y espacios)

def normalize_text(doc):
    # a minúsculas
    doc = doc.lower()
    # reemplazar todo lo que NO sea letra ni espacio por espacio
    doc = re.sub(r'[^a-z\s]', ' ', doc)
    # colapsar espacios múltiples y hacer strip
    doc = re.sub(r'\s+', ' ', doc).strip()
    return doc

# Aplicar sobre la columna ya limpia
df['review_norm'] = df['review_clean'].apply(normalize_text)

# Revisar el primer documento normalizado
df['review_norm'].iloc[0]


'one of the other reviewers has mentioned that after watching just oz episode youll be hooked they are right as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to many aryans muslims gangstas latinos christians italians irish and more so scuffles death stares dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pic

In [7]:
# 3. Stopwords
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# En Kaggle normalmente NO hace falta descargar:
# nltk.download('stopwords')
# nltk.download('punkt')

stop_words = set(stopwords.words('english'))

def remove_stopwords(doc):
    tokens = word_tokenize(doc)
    tokens = [t for t in tokens if t not in stop_words]
    return ' '.join(tokens)

df['review_no_stop'] = df['review_norm'].apply(remove_stopwords)
df['review_no_stop'].iloc[0]


'one reviewers mentioned watching oz episode youll hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romance oz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away well mannered middle class inmates tu

In [5]:
# 4. Stemming
from nltk.stem import porter

stemmer = porter.PorterStemmer()

def stem_text(doc):
    # doc ya no tiene stopwords
    tokens = doc.split()           # ya está limpio, podemos usar split
    stems = [stemmer.stem(t) for t in tokens]
    return ' '.join(stems)

# Aplicar a todo el corpus
df['review_stem'] = df['review_no_stop'].apply(stem_text)

# Probar con un documento (como tu doc_)
doc_ = df['review_no_stop'].iloc[0]
print("Sin stemming:\n", doc_[:200])
print("\nCon stemming:\n", stem_text(doc_)[:200])

# Si quieres seguir usando tu lógica de ejemplo:
x = stemmer.stem(' '.join(doc_.split()))  # OJO: esto solo te devuelve el stem de la ÚLTIMA palabra
print("\nEjemplo con tu línea original corregida (no muy útil, pero se mantiene la idea):")
print(x)

doc_


Sin stemming:
 one reviewers mentioned watching oz episode youll hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls 

Con stemming:
 one review mention watch oz episod youll hook right exactli happen first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex vi

Ejemplo con tu línea original corregida (no muy útil, pero se mantiene la idea):
one reviewers mentioned watching oz episode youll hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos

'one reviewers mentioned watching oz episode youll hooked right exactly happened first thing struck oz brutality unflinching scenes violence set right word go trust show faint hearted timid show pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home many aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements never far away would say main appeal show due fact goes shows wouldnt dare forget pretty pictures painted mainstream audiences forget charm forget romance oz doesnt mess around first episode ever saw struck nasty surreal couldnt say ready watched developed taste oz got accustomed high levels graphic violence violence injustice crooked guards wholl sold nickel inmates wholl kill order get away well mannered middle class inmates tu

In [6]:
# 5. Comparar vocabulario

# a) Vocabulario ORIGINAL 
from nltk.tokenize import word_tokenize

tokens_original = []
for texto in df['review']:
    tokens_original.extend(word_tokenize(texto))

vocab_original = set(tokens_original)

print("Tamaño vocabulario ORIGINAL:", len(vocab_original))
print("Total de tokens ORIGINALES:", len(tokens_original))

# b) Vocabulario después del preprocesamiento completo (normalización + stopwords + stemming)
tokens_final = []
for texto in df['review_stem']:
    tokens_final.extend(texto.split())

vocab_final = set(tokens_final)

print("\nTamaño vocabulario PROCESADO:", len(vocab_final))
print("Total de tokens PROCESADOS:", len(tokens_final))


Tamaño vocabulario ORIGINAL: 194756
Total de tokens ORIGINALES: 13974186

Tamaño vocabulario PROCESADO: 82107
Total de tokens PROCESADOS: 6027079
