# Ejercicio 4: Modelo Probabilístico
## Objetivo de la práctica
* Comprender los componentes del modelo vectorial mediante cálculos manuales y observación directa.
* Aplicar el modelo de espacio vectorial con TF-IDF para recuperar documentos relevantes.
* Comparar la recuperación con BM25 frente a TF-IDF.
* Analizar visualmente las diferencias entre los modelos.
* Evaluar si los rankings generados son consistentes con lo que * considerarías documentos relevantes.

## Parte 0: Carga del Corpus

In [1]:
#Importar Librerias
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('punkt_tab')
nltk.download('stopwords')
from nltk.corpus import stopwords
from collections import defaultdict


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

## Parte 1: Cálculo de TF, DF, IDF y TF-IDF
Actividad
* Utiliza el corpus cargado.
* Construye la matriz de términos (TF), y calcula la frecuencia de documentos (DF)
* Calcula TF-IDF utilizando sklearn.
* Visualiza los valores en un DataFrame para analizar las diferencias entre los términos.

In [3]:
#Pasar el corpus a dataframe
df = pd.DataFrame(newsgroupsdocs)
df

Unnamed: 0,0
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...
18842,\nNot in isolated ground recepticles (usually ...
18843,I just installed a DX2-66 CPU in a clone mothe...
18844,\nWouldn't this require a hyper-sphere. In 3-...


In [4]:
#Preprocesar el dataframe
def preprocess_text(text):
    #Convertir a minúsculas
    text = text.lower()
    #Eliminar caracteres no alfabéticos
    text = re.sub(r"[^a-záéíóúñü\s]", " ", text)
    #Tokenización
    tokens = nltk.word_tokenize(text, language="spanish")
    #Stopwords en español
    stop_es = set(stopwords.words("spanish"))
    #Remover tokens muy cortos y stopwords
    tokens = [t for t in tokens if len(t) > 2 and t not in stop_es]

    return tokens


In [5]:
#Aplicar preprocesamiento
df['prep'] = df[0].apply(preprocess_text)
display(df.head())

Unnamed: 0,0,prep
0,\n\nI am sure some bashers of Pens fans are pr...,"[sure, some, bashers, pens, fans, are, pretty,..."
1,My brother is in the market for a high-perform...,"[brother, the, market, for, high, performance,..."
2,\n\n\n\n\tFinally you said what you dream abou...,"[finally, you, said, what, you, dream, about, ..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[think, the, scsi, card, doing, the, dma, tran..."
4,1) I have an old Jasmine drive which I cann...,"[have, old, jasmine, drive, which, can, not, u..."


## Matriz TF

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

corpus_ready = df['prep'].apply(lambda tokens: " ".join(tokens)).tolist()

vectorizer = CountVectorizer()
tf_matrix = vectorizer.fit_transform(corpus_ready)

tf_df = pd.DataFrame(
    tf_matrix.toarray(),
    columns=vectorizer.get_feature_names_out()
)

display(tf_df.head())


Unnamed: 0,aaa,aaaaa,aaaaaaaaaaaa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aaaaarrrrgh,aaaall,aaack,aaaggghhh,aaah,aaahh,...,zzq,zzrk,zzs,zzum,zzvsi,zzy,zzz,zzzoh,zzzzzz,zzzzzzt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Frecuencia de Documentos DF



In [7]:
# DF = número de documentos con frecuencia > 0 por término
df_vector = np.sum(tf_matrix.toarray() > 0, axis=0)

df_series = pd.Series(df_vector, index=vectorizer.get_feature_names_out())

df_series.head()

Unnamed: 0,0
aaa,30
aaaaa,2
aaaaaaaaaaaa,1
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,1
aaaaarrrrgh,1


## Matriz TF-IDF

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus_ready)

tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_vectorizer.get_feature_names_out()
)

display(tfidf_df.head())

Unnamed: 0,aaa,aaaaa,aaaaaaaaaaaa,aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaauuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugggggggggggggggg,aaaaarrrrgh,aaaall,aaack,aaaggghhh,aaah,aaahh,...,zzq,zzrk,zzs,zzum,zzvsi,zzy,zzz,zzzoh,zzzzzz,zzzzzzt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Visualizar resultados

In [13]:
n_cols = tfidf_df.shape[1]
block = 200
mid = n_cols // 2

print("DataFrame TF-IDF:")
display(tfidf_df.iloc[:, mid - block : mid + block])

DataFrame TF-IDF:


Unnamed: 0,lsujpv,lsvh,lsz,ltb,ltbh,ltc,ltcs,ltd,ltdjd,lte,...,lymph,lymphocyte,lymphocytes,lymphoma,lyn,lynch,lynchings,lynda,lyndon,lynette
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18843,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
