# Stiven Saldaña

# Ejercicio 3: Preprocesamiento
## Objetivo de la práctica
1. Comprender y aplicar normalización, tokenización, stopwords, stemming y n-gramas.
2. Medir el impacto de cada paso en el vocabulario y los tokens.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


# 0. Cargar el Corpus
Vamos a trabajar con el corpus de Movie Reviews de IMDB

In [105]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.stem import PorterStemmer
import re

In [96]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv', encoding='utf-8')
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


### El prprocesamiento sirve para limpiar los datos y mejorar las busquedas

### Paso 1 limpiar los documentos

In [38]:
# se quita todas las etiquetas br
def clean_text(doc):
    return re.sub(pattern=r'<.*?>', repl=' ', string=doc)

In [39]:
# aqui se toma el primer documento del corpus
doc = df.iloc[0]['review']
doc

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [40]:
# aqui se incorpora a todas las review 
#  la funcion Clean_text 
df['review'].apply(clean_text)

0        One of the other reviewers has mentioned that ...
1        A wonderful little production.   The filming t...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [61]:
# se aplica la funcion clean text y split
doc_limpio = clean_text(doc)
doc_limpio.split()

['One',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching',
 'just',
 '1',
 'Oz',
 'episode',
 "you'll",
 'be',
 'hooked.',
 'They',
 'are',
 'right,',
 'as',
 'this',
 'is',
 'exactly',
 'what',
 'happened',
 'with',
 'me.',
 'The',
 'first',
 'thing',
 'that',
 'struck',
 'me',
 'about',
 'Oz',
 'was',
 'its',
 'brutality',
 'and',
 'unflinching',
 'scenes',
 'of',
 'violence,',
 'which',
 'set',
 'in',
 'right',
 'from',
 'the',
 'word',
 'GO.',
 'Trust',
 'me,',
 'this',
 'is',
 'not',
 'a',
 'show',
 'for',
 'the',
 'faint',
 'hearted',
 'or',
 'timid.',
 'This',
 'show',
 'pulls',
 'no',
 'punches',
 'with',
 'regards',
 'to',
 'drugs,',
 'sex',
 'or',
 'violence.',
 'Its',
 'is',
 'hardcore,',
 'in',
 'the',
 'classic',
 'use',
 'of',
 'the',
 'word.',
 'It',
 'is',
 'called',
 'OZ',
 'as',
 'that',
 'is',
 'the',
 'nickname',
 'given',
 'to',
 'the',
 'Oswald',
 'Maximum',
 'Security',
 'State',
 'Penitentary.',
 'It',
 'focuses',
 'main

# 2. Normalización
## Convertir todos los tokens a minúsculas.

Elimina puntuación y símbolos no alfabéticos.

In [42]:
# funcion de normalizacion del texto se quitan todos los numeros y simbolos
def normalizar_texto(doc_limpio): 
    doc_limpio = re.sub(pattern=r'[^a-zA-Z\s]', repl=' ', string=doc_limpio)
    doc_minusculas = doc_limpio.lower()
    doc_final = re.sub(r'\s+', ' ', doc_minusculas).strip()
    return doc_final

In [43]:
print(normalizar_texto(doc_limpio))

one of the other reviewers has mentioned that after watching just oz episode you ll be hooked they are right as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to many aryans muslims gangstas latinos christians italians irish and more so scuffles death stares dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wouldn t dare forget pretty pi

In [44]:
doc_normalizado = normalizar_texto(doc_limpio)

# 3. Eliminación de Stopwords
Eliminar las palabras vacías (stopwords) usando una lista estándar de la librería nltk.

In [45]:
# se remueven las palabras que no aportan en el texto
def remove_stopwords(doc_normalizado):
    tokens = word_tokenize(doc_normalizado)
    filtered = [w for w in tokens if w not in ENGLISH_STOP_WORDS]
    return " ".join(filtered)
doc_removed =  remove_stopwords(doc_normalizado)

In [46]:
doc_removed

'reviewers mentioned watching just oz episode ll hooked right exactly happened thing struck oz brutality unflinching scenes violence set right word trust faint hearted timid pulls punches regards drugs sex violence hardcore classic use word called oz nickname given oswald maximum security state penitentary focuses mainly emerald city experimental section prison cells glass fronts face inwards privacy high agenda em city home aryans muslims gangstas latinos christians italians irish scuffles death stares dodgy dealings shady agreements far away say main appeal fact goes shows wouldn t dare forget pretty pictures painted mainstream audiences forget charm forget romance oz doesn t mess episode saw struck nasty surreal couldn t say ready watched developed taste oz got accustomed high levels graphic violence just violence injustice crooked guards ll sold nickel inmates ll kill order away mannered middle class inmates turned prison bitches lack street skills prison experience watching oz com

# 4 Stemming

Reducir el espacio de palabras
# stemming permite reducir una palabra a su raíz

* cuidades --> cuidad
* animales --> animal
* escuelas --> escuela


In [47]:
from nltk.stem import porter
stemmer = porter.PorterStemmer()

In [48]:
# funcion que aplica stemming llevando a las palabras a su raiz 
def apply_stemming(doc_removed):
    tokens = word_tokenize(doc_removed)
    stemmed_tokens = [stemmer.stem(palabra) for palabra in tokens]
    doc_final = " ".join(stemmed_tokens)
    return doc_final

In [62]:
# se imprime el documento 1 ya procesado
a = apply_stemming(doc_removed)
print(a)

review mention watch just oz episod ll hook right exactli happen thing struck oz brutal unflinch scene violenc set right word trust faint heart timid pull punch regard drug sex violenc hardcor classic use word call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home aryan muslim gangsta latino christian italian irish scuffl death stare dodgi deal shadi agreement far away say main appeal fact goe show wouldn t dare forget pretti pictur paint mainstream audienc forget charm forget romanc oz doesn t mess episod saw struck nasti surreal couldn t say readi watch develop tast oz got accustom high level graphic violenc just violenc injustic crook guard ll sold nickel inmat ll kill order away manner middl class inmat turn prison bitch lack street skill prison experi watch oz comfort uncomfort view that touch darker


In [63]:
# se imprime el documento 1 antes de su procesamiento completo
b = doc
b

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

# 5. Verificar la diferencia
Comparar el tamaño del diccionario de términos del corpus antes y después de aplicar el preprocesamiento

In [59]:
df['limpio'] = df['review'].apply(clean_text)
df['stemming'] = df['step3_no_stop'].apply(apply_stemming)

In [68]:
antes = len(set(" ".join(df['limpio']).split()))
despues = len(set(" ".join(df['stemming']).split()))

print(f"Tamaño vocabulario antes: {antes}")
print(f"Tamaño vocabulario después: {despues}")
print(f"Reducción: {antes - despues} términos")

Tamaño vocabulario antes: 412542
Tamaño vocabulario después: 68952
Reducción: 343590 términos


# En la muestra que se realizó del primer documento

De esta manera se puede obetener un resultado que evidencia la forma en la que impacta el procesamiento en un documento o corpus antes y despues. 

In [67]:
print(f"Tamaño vocabulario documento antes: {len(b)}")
print(f"Tamaño vocabulario documento después: {len(a)}")
print(f"Reducción: {len(b) - len(a)} términos")

Tamaño vocabulario documento antes: 1761
Tamaño vocabulario documento después: 936
Reducción: 825 términos
