# Ejercicio 4: Modelo Probabilístico

## Objetivo de la práctica
- Comprender los componentes del modelo vectorial mediante cálculos manuales y observación directa.
- Aplicar el modelo de espacio vectorial con TF-IDF para recuperar documentos relevantes.
- Comparar la recuperación con BM25 frente a TF-IDF.
- Analizar visualmente las diferencias entre los modelos.
- Evaluar si los rankings generados son consistentes con lo que considerarías documentos relevantes.

## Parte 0: Carga del Corpus

In [None]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [None]:
newsgroupsdocs

### Limpieza del texto

Primero nos encargamos de los caracteres raros.

In [None]:
import re

def clean_text(text):
    text = text.replace('\n', ' ').replace('\t', ' ').replace(".", " ").replace(",", " ")
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

corpus = [clean_text(text) for text in newsgroupsdocs]

In [None]:
corpus

In [None]:
len(corpus)

18846

Quitamos los documentos vacíos.

In [None]:
corpus = list(filter(lambda doc: doc and doc.strip(), corpus))

In [None]:
len(corpus)

18313

Todas las palabras deben ser capturadas en una lista.

In [None]:
words = set(" ".join(corpus).lower().split())

In [None]:
len(words)

149879

Ahora, utilizamos el Stemming para disminuir el tamaño de espacio de palabras. El Stemming vuelve a palabras complejas a su forma original o inicial.

In [None]:
from nltk.stem import porter
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")
words = set([stemmer.stem(word) for word in words])

In [None]:
len(words)

127061

Finalmente, realizamos una limpieza de palabras que no aportan valor.

In [None]:
# @title
!pip install wordfreq



In [None]:
from wordfreq import zipf_frequency

def is_valid_word(token):
  # muy largo y corto
  if len(token) < 2 or len(token) > 25:
      return False

  # si tiene muchos numeros y letras
  if sum(c.isdigit() for c in token) > len(token) / 2:
      return False

  # letras y numeros intercalados
  if re.search(r'[a-zA-Z]\d[a-zA-Z]', token):
      return False
  # frecuencia en ingles
  freq = zipf_frequency(token, 'en')
  if freq < 1.0 and not re.match(r'^\d{4}$', token):
    return False

  return True

words = [word for word in words if is_valid_word(word)]

In [None]:
len(words)

30730

Construimos una función para normalizar las palabras en base a los pasos anteriormente tomados.

In [None]:
def clean_words(words):
  words = set([word for word in words if is_valid_word(stemmer.stem(word))])

  return words

## Parte 1: Cálculo de TF, DF, IDF y TF-IDF

### Actividad
1. Utiliza el corpus cargado.
2. Construye la matriz de términos (TF), y calcula la frecuencia de documentos (DF)
3. Calcula TF-IDF utilizando sklearn.
4. Visualiza los valores en un DataFrame para analizar las diferencias entre los términos.

### Cálculo de TF-IDF de manera manual

Primero obtenemos TF de cada documento.

In [None]:
from collections import Counter
import numpy as np

def get_tf(doc, vocab_index):
  words = clean_words(doc.lower().split())
  total = len(words)
  counts = Counter(words)
  tf = np.zeros(len(vocab_index))

  if total == 0:
    return tf

  keys = vocab_index.keys()

  for word, count in counts.items():
    if word in keys:
      tf[vocab_index[word]] = count / total
  return np.array(tf)

Después IDF.

In [None]:
def get_idf(corpus, vocab_index):
  n_docs = len(corpus)
  df = np.zeros(len(vocab_index))

  for doc in corpus:
    words = clean_words(doc.lower().split())
    for w in words:
      if w in vocab_index:
        df[vocab_index[w]] += 1
  idf = np.log((n_docs + 1) / (df + 1)) + 1
  return idf

Finalmente TF-IDF.

In [None]:
def get_tf_idf(tf, idf):
  return tf * idf

Ahora calculamos las matrices para nuestro corpus. Utilizamos un índice para disminuir el costo computacional.

In [None]:
vocab_index = {word: index for index, word in enumerate(words)}

In [None]:
tf = np.array([get_tf(doc, vocab_index) for doc in corpus])

In [None]:
idf = get_idf(corpus, vocab_index)

In [None]:
tf_idf = get_tf_idf(tf, idf)

In [None]:
tf_idf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Construimos un diccionario para visualizar los datos mediante la librería pandas.

In [None]:
import pandas as pd

df = pd.DataFrame(tf_idf, columns=words)

In [None]:
df.value_counts()

### Cálculo de TF-IDF usando sklearn

In [None]:
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
from wordfreq import zipf_frequency

def limpiar_token(token):
  if not re.match(r'^[a-zA-Z0-9]+$', token):
    return False
  if len(token) < 2 or len(token) > 25:
    return False
  if sum(c.isdigit() for c in token) > len(token) / 2:
    return False
  if re.search(r'[a-zA-Z]\d[a-zA-Z]', token):
    return False

  if zipf_frequency(token, 'en') < 1.0:
    return False
  return True

def mi_tokenizador(texto):
    tokens = re.findall(r'\b\w+\b', texto.lower())
    return [t for t in tokens if limpiar_token(t)]

vectorizador = TfidfVectorizer(tokenizer=mi_tokenizador)

tf_idf = vectorizador.fit_transform(corpus)
terminos = vectorizador.get_feature_names_out()

df = pd.DataFrame(tf_idf.todense(), columns=terminos)
df


Unnamed: 0,000miles,000plus,000usd,00mhz,00us,0a,0b,0c,0d,0e,...,zwarte,zwingli,zwischen,zx,zy,zygon,zyxel,zz,zzzs,zzzzzz
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18308,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18309,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18310,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
18311,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Parte 2: Ranking de documentos usando TF-IDF

### Actividad

1. Dada una consulta, construye el vector de consulta
2. Calcula la similitud coseno entre la consulta y cada documento usando los vectores TF-IDF
3. Genera un ranking de los documentos ordenados por relevancia.
4. Muestra los resultados en una tabla.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def rank_documents(query, tfidf_matrix, vectorizer, corpus):
  # query
  query_vector = vectorizer.transform([query])

  # coseno
  cosine_sim = cosine_similarity(query_vector, tfidf_matrix).flatten()

  ranking = np.argsort(-cosine_sim)

  results = pd.DataFrame({
    'Documento': [corpus[i] for i in ranking],
    'Similaridad': cosine_sim[ranking]
  })

  return results


In [None]:
query = "vegetarian onion diet"
results_df = rank_documents(query, tf_idf, vectorizador, corpus)

print("TF-IDF rank del query:", query)
print(results_df)

TF-IDF rank del query: vegetarian onion diet
                                               Documento  Similaridad
0      Not sure of this but I think some millipedes c...     0.178475
1      Millipedes I understand are vegetarian and the...     0.163694
2      Need Diet for Diverticular Disease and ideas f...     0.151149
3      If one is a vegan a vegetarian taht eats no an...     0.150996
4      I remember hearing a few years back about a ne...     0.149680
...                                                  ...          ...
18308  Not in isolated ground recepticles usually an ...     0.000000
18309  I just installed a DX266 CPU in a clone mother...     0.000000
18310  Wouldnt this require a hypersphere In 3space 4...     0.000000
18311  After a tip from Gary Crum crumfcom cc utah ed...     0.000000
18312  I am sure some bashers of Pens fans are pretty...     0.000000

[18313 rows x 2 columns]


## Parte 3: Ranking con BM25

### Actividad

1. Implementa un sistema de recuperación usando el modelo BM25.
2. Usa la misma consulta del ejercicio anterior.
3. Calcula el score BM25 para cada documento y genera un ranking.
4. Compara manualmente con el ranking de TF-IDF.

## Parte 4: Comparación visual entre TF-IDF y BM25

### Actividad

1. Utiliza un gráfico de barras para visualizar los scores obtenidos por cada documento según TF-IDF y BM25.
2. Compara los rankings visualmente.
3. Identifica: ¿Qué documentos obtienen scores más altos en un modelo que en otro?
4. Sugiere: ¿A qué se podría deber esta diferencia?

## Parte 5: Evaluación con consulta relevante

### Actividad

1. Elige una consulta y define qué documentos del corpus deberían considerarse relevantes.
2. Evalúa Precision@3 o MAP para los rankings generados con TF-IDF y BM25.
3. Responde: ¿Cuál modelo da mejores resultados respecto a tu criterio de relevancia?