# Ejercicio 5: Modelo Probabilístico

## Objetivo de la práctica
- Aplicar paso a paso técnicas de preprocesamiento, evaluando el impacto de cada etapa en el número de tokens y en el vocabulario final.

## Parte 0: Carga del Corpus

In [1]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

## Parte 1: Tokenización

### Actividad 
1. Tokeniza los documentos.

In [2]:
#Tokenizar el documento
def tokenize(doc):
    return doc.split()
tokenized_docs = [tokenize(doc) for doc in newsgroupsdocs]
print("Documento sin tokenizar:", newsgroupsdocs[:5])  # Print first document before tokenization
print("Tokenized documents:", tokenized_docs[:5])  # Print first 5 tokenized documents

Documento sin tokenizar: ["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n", 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n

## Parte 2: Normalización

### Actividad 
1. Convierte todos los tokens a minúsculas.
2. Elimina puntuación y símbolos no alfabéticos.

In [3]:
#Convertir a minúsculas
def to_lowercase(doc):
    return [word.lower() for word in doc]
lowercase_docs = [to_lowercase(doc) for doc in tokenized_docs]
print("Lowercase documents:", lowercase_docs[:5])  # Print first 5 lowercase documents

Lowercase documents: [['i', 'am', 'sure', 'some', 'bashers', 'of', 'pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'pens', 'massacre', 'of', 'the', 'devils.', 'actually,', 'i', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved.', 'however,', 'i', 'am', 'going', 'to', 'put', 'an', 'end', 'to', "non-pittsburghers'", 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'pens.', 'man,', 'they', 'are', 'killing', 'those', 'devils', 'worse', 'than', 'i', 'thought.', 'jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats.', 'he', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs.', 'bowman', 'should', 'let', 'jagr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'pens', 'are', 'going', 'to', 'beat', 'the', 'pulp', 'out', 'of', 'jersey', 'anyway.', 'i', 'was', 'ver

In [4]:
#Eliminar puntuacion y simbolos no alfabeticos
import re
def remove_punctuation(doc):
    return [re.sub(r'[^a-zA-Z]', '', word) for word in doc if re.sub(r'[^a-zA-Z]', '', word)]
cleaned_docs = [remove_punctuation(doc) for doc in lowercase_docs]
print("Cleaned documents:", cleaned_docs[:5])  # Print first 5 cleaned documents

Cleaned documents: [['i', 'am', 'sure', 'some', 'bashers', 'of', 'pens', 'fans', 'are', 'pretty', 'confused', 'about', 'the', 'lack', 'of', 'any', 'kind', 'of', 'posts', 'about', 'the', 'recent', 'pens', 'massacre', 'of', 'the', 'devils', 'actually', 'i', 'am', 'bit', 'puzzled', 'too', 'and', 'a', 'bit', 'relieved', 'however', 'i', 'am', 'going', 'to', 'put', 'an', 'end', 'to', 'nonpittsburghers', 'relief', 'with', 'a', 'bit', 'of', 'praise', 'for', 'the', 'pens', 'man', 'they', 'are', 'killing', 'those', 'devils', 'worse', 'than', 'i', 'thought', 'jagr', 'just', 'showed', 'you', 'why', 'he', 'is', 'much', 'better', 'than', 'his', 'regular', 'season', 'stats', 'he', 'is', 'also', 'a', 'lot', 'fo', 'fun', 'to', 'watch', 'in', 'the', 'playoffs', 'bowman', 'should', 'let', 'jagr', 'have', 'a', 'lot', 'of', 'fun', 'in', 'the', 'next', 'couple', 'of', 'games', 'since', 'the', 'pens', 'are', 'going', 'to', 'beat', 'the', 'pulp', 'out', 'of', 'jersey', 'anyway', 'i', 'was', 'very', 'disappoin

## Parte 3: Eliminación de Stopwords

### Actividad 
1. Elimina las palabras vacías usando una lista estándar.

In [5]:
#eliminar las stopwords
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
def remove_stopwords(doc):
    return [word for word in doc if word not in stop_words]
filtered_docs = [remove_stopwords(doc) for doc in cleaned_docs]
print("Filtered documents:", filtered_docs[:5])  # Print first 5 filtered documents

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\steve\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Filtered documents: [['sure', 'bashers', 'pens', 'fans', 'pretty', 'confused', 'lack', 'kind', 'posts', 'recent', 'pens', 'massacre', 'devils', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'however', 'going', 'put', 'end', 'nonpittsburghers', 'relief', 'bit', 'praise', 'pens', 'man', 'killing', 'devils', 'worse', 'thought', 'jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats', 'also', 'lot', 'fo', 'fun', 'watch', 'playoffs', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'games', 'since', 'pens', 'going', 'beat', 'pulp', 'jersey', 'anyway', 'disappointed', 'see', 'islanders', 'lose', 'final', 'regular', 'season', 'game', 'pens', 'rule'], ['brother', 'market', 'highperformance', 'video', 'card', 'supports', 'vesa', 'local', 'bus', 'mb', 'ram', 'anyone', 'suggestionsideas', 'diamond', 'stealth', 'pro', 'local', 'bus', 'orchid', 'farenheit', 'ati', 'graphics', 'ultra', 'pro', 'highperformance', 'vlb', 'card', 'please', 'post', 'email', 'thank', 'matt'], ['finally

## Parte 4: Stemming o Lematización

### Actividad
1. Aplica stemming.
2. Aplica lematización.
3. Compara ambas técnicas.

In [6]:
#aplicar stemming
from nltk.stem import PorterStemmer
nltk.download('punkt')
stemmer = PorterStemmer()
def stem_words(doc):
    return [stemmer.stem(word) for word in doc]
stemmed_docs = [stem_words(doc) for doc in filtered_docs]
print("Stemmed documents:", stemmed_docs[:5])  # Print first 5 stemmed documents

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\steve\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Stemmed documents: [['sure', 'basher', 'pen', 'fan', 'pretti', 'confus', 'lack', 'kind', 'post', 'recent', 'pen', 'massacr', 'devil', 'actual', 'bit', 'puzzl', 'bit', 'reliev', 'howev', 'go', 'put', 'end', 'nonpittsburgh', 'relief', 'bit', 'prais', 'pen', 'man', 'kill', 'devil', 'wors', 'thought', 'jagr', 'show', 'much', 'better', 'regular', 'season', 'stat', 'also', 'lot', 'fo', 'fun', 'watch', 'playoff', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'coupl', 'game', 'sinc', 'pen', 'go', 'beat', 'pulp', 'jersey', 'anyway', 'disappoint', 'see', 'island', 'lose', 'final', 'regular', 'season', 'game', 'pen', 'rule'], ['brother', 'market', 'highperform', 'video', 'card', 'support', 'vesa', 'local', 'bu', 'mb', 'ram', 'anyon', 'suggestionsidea', 'diamond', 'stealth', 'pro', 'local', 'bu', 'orchid', 'farenheit', 'ati', 'graphic', 'ultra', 'pro', 'highperform', 'vlb', 'card', 'pleas', 'post', 'email', 'thank', 'matt'], ['final', 'said', 'dream', 'mediterranean', 'new', 'area', 'greater', 'y

In [7]:
#aplicar lemmatization
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
def lemmatize_words(doc):
    return [lemmatizer.lemmatize(word) for word in doc]
lemmatized_docs = [lemmatize_words(doc) for doc in filtered_docs]
print("Lemmatized documents:", lemmatized_docs[:5])  # Print first 5 lemmatized documents

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\steve\AppData\Roaming\nltk_data...


Lemmatized documents: [['sure', 'bashers', 'pen', 'fan', 'pretty', 'confused', 'lack', 'kind', 'post', 'recent', 'pen', 'massacre', 'devil', 'actually', 'bit', 'puzzled', 'bit', 'relieved', 'however', 'going', 'put', 'end', 'nonpittsburghers', 'relief', 'bit', 'praise', 'pen', 'man', 'killing', 'devil', 'worse', 'thought', 'jagr', 'showed', 'much', 'better', 'regular', 'season', 'stats', 'also', 'lot', 'fo', 'fun', 'watch', 'playoff', 'bowman', 'let', 'jagr', 'lot', 'fun', 'next', 'couple', 'game', 'since', 'pen', 'going', 'beat', 'pulp', 'jersey', 'anyway', 'disappointed', 'see', 'islander', 'lose', 'final', 'regular', 'season', 'game', 'pen', 'rule'], ['brother', 'market', 'highperformance', 'video', 'card', 'support', 'vesa', 'local', 'bus', 'mb', 'ram', 'anyone', 'suggestionsideas', 'diamond', 'stealth', 'pro', 'local', 'bus', 'orchid', 'farenheit', 'ati', 'graphic', 'ultra', 'pro', 'highperformance', 'vlb', 'card', 'please', 'post', 'email', 'thank', 'matt'], ['finally', 'said', '