# Ejercicio 3 : Preprocesamiento
## Objetivo de la práctica
1. Comprender y aplicar normalización, tokenización, stopwords, stemming y n-gramas.
2. Medir el impacto de cada paso en el vocabulario y los tokens.

In [34]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


# Carga el corpus
df = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv", encoding='utf-8')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [35]:
# Toma primera reseña del dataset y lo guarda en doc
doc = df.iloc[0]['review']
doc

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [36]:
# Didide texto en tokens simples, separados por espacios
# doc.split()
# Solo divide al texto por espacios 

In [37]:
import re
# Elemina etiqeutas HTML muy comunes en el dataset usando regex y quita puntacion especifica y divide tokens otra vez
#doc_=re.sub(pattern='<.*?>', repl='', string=doc).replace('.', '').replace(',', '').replace('(', '').replace(')', '').replace('""', '')
#doc_.split()


# Crea una funcion reusable para la limpieza de cualquier texto, no tokenitiza
def clean_text(doc):
    return re.sub(pattern='<.*?>', repl='', string=doc).replace('.', '').replace(',', '').replace('(', '').replace(')', '').replace('""', '')

In [38]:
# Aplica la funcion clean_text a la data
# Prepara el corpus completo para: tokenizacion, stopwords, steaming/lematizacion/ construccion indice invertido y creacion TF-IDF
df['cleaned_review'] = df['review'].apply(clean_text)
print(df['cleaned_review'].head(10))

0    One of the other reviewers has mentioned that ...
1    A wonderful little production The filming tech...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
5    Probably my all-time favorite movie a story of...
6    I sure would like to see a resurrection of a u...
7    This show was an amazing fresh & innovative id...
8    Encouraged by the positive comments about this...
9    If you like original gut wrenching laughter yo...
Name: cleaned_review, dtype: object


In [39]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [40]:
from nltk.stem import porter
stemmer = porter.PorterStemmer()

## Normalizacion, Tokenizar y eliminar Stopwords




In [41]:
def tokenize_and_remove_stopwords(text):
    # Tokenizacion
    tokens = word_tokenize(text.lower()) # Normalizacion sencilla 1
    
    # Eliminar stopwords
    stop_words = set(stopwords.words('english'))
    
    # Normalizacion sencilla 2: Aqui se filtran solo palabras alfabéticas
    filtered_tokens = [token for token in tokens if token not in stop_words and token.isalpha()]
    
    return filtered_tokens

# Aplicar tokenizacion y eliminacion de stopwords
df['tokens'] = df['cleaned_review'].apply(tokenize_and_remove_stopwords)

print("Tokens despues de stopwords - Primeras 2 revisiones:")
print(df['tokens'].head(2))

Tokens despues de stopwords - Primeras 2 revisiones:
0    [one, reviewers, mentioned, watching, oz, epis...
1    [wonderful, little, production, filming, techn...
Name: tokens, dtype: object


## Stemming

In [42]:
def apply_stemming(tokens):
    return [stemmer.stem(token) for token in tokens]

# Aplicar stemming
df['stemmed_tokens'] = df['tokens'].apply(apply_stemming)

print("Tokens despues de stemming - Primeras 2 revisiones:")
print(df['stemmed_tokens'].head(2))

Tokens despues de stemming - Primeras 2 revisiones:
0    [one, review, mention, watch, oz, episod, hook...
1    [wonder, littl, product, film, techniqu, fashi...
Name: stemmed_tokens, dtype: object


## Indice invertidos

In [43]:
def build_inverted_index(df):
    inverted_index = {}
    
    for doc_id, tokens in enumerate(df['stemmed_tokens']):
        for token in set(tokens):  # Usar set para evitar duplicados dentro del mismo documento
            if token not in inverted_index:
                inverted_index[token] = []
            inverted_index[token].append(doc_id)
    
    return inverted_index

# Construir indice invertido
inverted_index = build_inverted_index(df)

print("Indice invertido - Primeros 5 terminos:")
for i, (term, doc_ids) in enumerate(list(inverted_index.items())[:5]):
    print(f"{term}: aparece en {len(doc_ids)} documentos")

Indice invertido - Primeros 5 terminos:
well: aparece en 13595 documentos
emerald: aparece en 13 documentos
classic: aparece en 3526 documentos
one: aparece en 28148 documentos
home: aparece en 3023 documentos


In [44]:
# Celda nueva - Ver la evolución completa del texto
print("=== Comparacion del texto===")
print("\n1. Texto original:")
print(df.iloc[0]['review'])

# print("\n2. DESPUÉS DE clean_text():")
# print(df.iloc[0]['cleaned_review'])

# print("\n3. DESPUÉS DE tokenize_and_remove_stopwords():")
# print(df.iloc[0]['tokens'])

print("\n4. Texto Final despues de procesamiento")
print(df.iloc[0]['stemmed_tokens'])


    

=== Comparacion del texto===

1. Texto original:
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would 

In [45]:
def contar_vocabulario_original(df):
    # Terminos unicos antes del procesamiento
    original_terms = set()
    for review in df['review']:
        tokens = word_tokenize(review.lower())
        original_terms.update(tokens)
    return len(original_terms)

def contar_vocabulario_final(df):
    # Terminos unicos despues del procesamiento
    stemmed_terms = set()
    for tokens_list in df['stemmed_tokens']:
        stemmed_terms.update(tokens_list)
    return len(stemmed_terms)

In [46]:
# Aplicar las funciones
vocab_original = contar_vocabulario_original(df)
vocab_final = contar_vocabulario_final(df)

print("=== COMPARACION VOCABULARIO ===")
print(f"Terminos antes del procesamiento: {vocab_original:,}")
print(f"Terminos después de todo el procesamiento: {vocab_final:,}")

=== COMPARACION VOCABULARIO ===
Terminos antes del procesamiento: 164,024
Terminos después de todo el procesamiento: 134,212


In [47]:
def contar_todos_terminos_original(df):
    # Cuenta abosultamente todos los terminos del texto original
    total_tokens = 0
    for review in df['review']:
        tokens = word_tokenize(review.lower())
        total_tokens += len(tokens)
    return total_tokens

def contar_todos_terminos_final(df):
    # Cuenta absolutamente todos los terminos despues del preprocesamiento
    total_tokens = 0
    for tokens_list in df['stemmed_tokens']:
        total_tokens += len(tokens_list)
    return total_tokens

In [48]:
# Aplicar las funciones
todos_original = contar_todos_terminos_original(df)
todos_final = contar_todos_terminos_final(df)

print("=== CONTEO DE TODOS LOS TERMINOS ===")
print(f"TODOS los terminos antes del procesamiento: {todos_original:,}")
print(f"TODOS los terminos después del procesamiento: {todos_final:,}")


=== CONTEO DE TODOS LOS TERMINOS ===
TODOS los terminos antes del procesamiento: 13,970,596
TODOS los terminos después del procesamiento: 5,715,633
