# Ejercicio 3: Preprocesamiento

## Objetivo de la práctica

1. Comprender y aplicar normalización, tokenización, stopwords, stemming y n-gramas.
2. Medir el impacto de cada paso en el vocabulario y los tokens.

#### 0. Cargar el Corpus

Vamos a trabajar con el corpus de Movie Reviews de IMDB

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv', encoding='utf8')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


#### 1. Limpieza 

Limpiar los documentos de caracteres que no corresponden

In [3]:
doc = df.iloc[0]['review']
doc

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [4]:
doc.replace('<br />', '').replace('>', '').replace('/', '')

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [5]:
import re

In [6]:
def clean_text(doc):
    return re.sub(pattern=r'<.*?>', repl='', string=doc).replace('.', ' ').replace(',', '').replace('(', '').replace(')', '').replace('"', '').replace("'", '').replace("\x08", '')

In [7]:
df['review'].apply(clean_text)

0        One of the other reviewers has mentioned that ...
1        A wonderful little production  The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically theres a family where a little boy J...
4        Petter Matteis Love in the Time of Money is a ...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot bad dialogue bad acting idiotic direc...
49997    I am a Catholic taught in parochial elementary...
49998    Im going to have to disagree with the previous...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [8]:
doc.split()

['One',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'Oz',
 'episode',
 "you'll",
 'be',
 'hooked.',
 'They',
 'are',
 'right,',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me.<br',
 '/><br',
 '/>The',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'Oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence,',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'GO.',
 'Trust',
 'me,',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid.',
 'This',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs,',
 'sex',
 'or',
 'violence.',
 'Its',
 'is',
 'hardcore,',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word.<br',
 '/><br',
 '/>It',
 'is',
 'called',
 'OZ',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'Oswald',
 'Maximum',
 'Security',
 'State',
 'Penitentar

#### 2. Normalización

Convertir todos los tokens a minúsculas.

Elimina puntuación y símbolos no alfabéticos.

In [9]:
import re
import string

def clean_and_normalize(doc):
    
    doc = re.sub(pattern=r'<.*?>' , repl = '', string= doc)
    
    doc = doc.lower()
    doc = re.sub(r'[^a-zA-Z]', ' ', doc)
    doc = re.sub(r'\s+', ' ', doc).strip()
    
    return doc

In [10]:
df['clean_review'] = df['review'].apply(clean_and_normalize)

print("Ejemplo de review original")
print(df.iloc[1]['review'])
print("\n Ejemplo  después de limpieza y normalización")
print(df.iloc[1]['clean_review'])

Ejemplo de review original
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface

#### 3. Eliminación de Stopwords

Eliminar las palabras vacías (stopwords) usando una lista estándar de la librería _nltk_.

In [11]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [13]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


stop_words = set(stopwords.words('english'))

def remove_stopwords(doc):
    tokens = word_tokenize(doc)
    filtered_tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(filtered_tokens)

In [14]:
df['stopwords_removed'] = df['clean_review'].apply(remove_stopwords)

print("Review limpia")
print(df.iloc[1]['clean_review'])
print("\n Sin stopwords")
print(df.iloc[1]['stopwords_removed'])

Review limpia
a wonderful little production the filming technique is very unassuming very old time bbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great master s of comedy and his life the realism really comes home with the little things the fantasy of the guard which rather than use the traditional dream techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwell s murals decorating every surface are terribly well done

 Sin stopwords
wonderful little production f

#### 4. Stemming 

Reducir el espacio de palabras

In [26]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()

def apply_stemming(doc):
    tokens = word_tokenize(doc)
    
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    
    return ' '.join(stemmed_tokens)

In [27]:
df['processed_review'] = df['stopwords_removed'].apply(apply_stemming)

print("No stopwords")
print(df.iloc[1]['stopwords_removed'])
print("\n Stemming")
print(df.iloc[1]['processed_review'])

No stopwords
wonderful little production filming technique unassuming old time bbc fashion gives comforting sometimes discomforting sense realism entire piece actors extremely well chosen michael sheen got polari voices pat truly see seamless editing guided references williams diary entries well worth watching terrificly written performed piece masterful production one great master comedy life realism really comes home little things fantasy guard rather use traditional dream techniques remains solid disappears plays knowledge senses particularly scenes concerning orton halliwell sets particularly flat halliwell murals decorating every surface terribly well done

 Stemming
wonder littl product film techniqu unassum old time bbc fashion give comfort sometim discomfort sens realism entir piec actor extrem well chosen michael sheen got polari voic pat truli see seamless edit guid refer william diari entri well worth watch terrificli written perform piec master product one great master come

#### 5. Verificar la diferencia

Comparar el tamaño del diccionario de términos del corpus antes y después de aplicar el preprocesamiento 

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
vectorizer_original = CountVectorizer()
vectorizer_processed = CountVectorizer()

vectorizer_original.fit_transform(df['review'])
original_vocab = vectorizer_original.vocabulary_
original_vocab_size = len(original_vocab)

In [30]:
print(f"Tamaño del vocabulario antes del preprocesamiento: {original_vocab_size} términos")

Tamaño del vocabulario antes del preprocesamiento: 101895 términos


In [31]:
vectorizer_processed.fit_transform(df['processed_review'])
processed_vocab = vectorizer_processed.vocabulary_
processed_vocab_size = len(processed_vocab)

In [32]:
print(f"Tamaño del vocabulario despues del preprocesamiento: {processed_vocab_size} términos")

Tamaño del vocabulario despues del preprocesamiento: 70938 términos


In [39]:
reduction = original_vocab_size - processed_vocab_size
reduction_percent = (reduction / original_vocab_size) * 100

In [38]:
print(f"\nReducción de: {reduction} términos ({reduction_percent:.2f}% menos)")


Reducción de: 30957 términos (33.21% menos)
