# Ejercicio 4: Modelo Probabilístico

## Objetivo de la práctica
- Comprender los componentes del modelo vectorial mediante cálculos manuales y observación directa.
- Aplicar el modelo de espacio vectorial con TF-IDF para recuperar documentos relevantes.
- Comparar la recuperación con BM25 frente a TF-IDF.
- Analizar visualmente las diferencias entre los modelos.
- Evaluar si los rankings generados son consistentes con lo que considerarías documentos relevantes.

## Parte 0 Carga del corpus

In [2]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

## Parte 1: Cálculo de TF, DF, IDF y TF-IDF

### Actividad 
1. Utiliza el corpus cargado.
2. Construye la matriz de términos (TF), y calcula la frecuencia de documentos (DF)
3. Calcula TF-IDF utilizando sklearn.
4. Visualiza los valores en un DataFrame para analizar las diferencias entre los términos.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import pandas as pd

In [4]:
#Crear la matriz de términos (TF)
vectorizer = CountVectorizer(stop_words='english') 
tf_matrix = vectorizer.fit_transform(newsgroupsdocs)

print("TF Matrix Shape: ", tf_matrix.shape)
#tamano del vector
print("Vector size: ", len(tf_matrix.toarray()[0]))

TF Matrix Shape:  (18846, 134101)
Vector size:  134101


In [5]:
#Calcular la frecuencia de documentos (DF)
df = (tf_matrix > 0).sum(axis=0)
df = pd.DataFrame(df.tolist()[0], index=vectorizer.get_feature_names_out(), columns=['DF'])

In [6]:
#Calcular TF-IDF
tfidf_transformer = TfidfTransformer()
tfidf_matrix = tfidf_transformer.fit_transform(tf_matrix)

In [7]:
#Crear un DataFrame con los valores TF y TF-IDF
tf_df = pd.DataFrame(tf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

In [8]:
print("Matriz TF (Frecuencia de términos):")
display(tf_df.iloc[:10, :10])

print("Matriz TF-IDF:")
display(tfidf_df.iloc[:10, :10])

print("Frecuencia de documentos (DF):")
display(df.head())

Matriz TF (Frecuencia de términos):


Unnamed: 0,00,000,0000,00000,000000,00000000,0000000004,00000000b,00000001,00000001b
0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0
7,3,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0


Matriz TF-IDF:


Unnamed: 0,00,000,0000,00000,000000,00000000,0000000004,00000000b,00000001,00000001b
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.241059,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Frecuencia de documentos (DF):


Unnamed: 0,DF
0,402
0,455
0,10
0,7
0,1


## Parte 2: Ranking de documentos usando TF-IDF

### Actividad 

1. Dada una consulta, construye el vector de consulta
2. Calcula la similitud coseno entre la consulta y cada documento usando los vectores TF-IDF
3. Genera un ranking de los documentos ordenados por relevancia.
4. Muestra los resultados en una tabla.

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

#Definir la consulta
consulta = "chicken"

#Vectorizar la consulta usando el mismo vectorizer y transformer ya entrenados
consulta_tf = vectorizer.transform([consulta])
consulta_tfidf = tfidf_transformer.transform(consulta_tf)

In [14]:
#Calcular similitud coseno entre la consulta y los documentos
similitudes = cosine_similarity(consulta_tfidf, tfidf_matrix)
#imprimir similitudes
print("Similitudes con la consulta '{}':".format(consulta))
print(similitudes)

Similitudes con la consulta 'chicken':
[[0. 0. 0. ... 0. 0. 0.]]


In [11]:
#Ordenar documentos por similitud (de mayor a menor)
ranking = similitudes[0].argsort()[::-1]
ranking_scores = similitudes[0][ranking]

In [16]:
#Crear tabla con los resultados (solo primeros 5 documentos para vista simple)
top_n = 5
resultados = pd.DataFrame({

    #indice del documento
    'ID Documento': [i for i in ranking[:top_n]],
    #titulo del documento
    'Documento': [newsgroupsdocs[i][:50].replace('\n', ' ') + "..." for i in ranking[:top_n]],
    #similitud
    'Puntaje de Similitud': ranking_scores[:top_n]
})

# Mostrar tabla
display(resultados)

Unnamed: 0,ID Documento,Documento,Puntaje de Similitud
0,1315,But remember that had God extinguished the bl...,0.291393
1,13222,You are right in supposing that the problem is...,0.245888
2,15919,"Wetteland comes off the DL on April 23rd, and...",0.226
3,9048,I am 35 and am recovering from a case of Chick...,0.196238
4,16902,"[stuff deleted...] As I recall, the auth...",0.18825


In [22]:
#Imprimir documento 1315
print("Document 1315: ", newsgroupsdocs[16902])

Document 1315:  

[stuff deleted...]

  
As I recall, the author of the _original_ article that started the thread
claimed that he disliked the changing of the names for a variety of reasons. 
Roger, on one front you flamed him rather severely on the grounds that his
was a "jingoistic rant", but you also supported the name-changing on the
grounds that the current names are inappropriate because of the individuals
they represent. FWIW, I do not think the flaming was warranted, nor do I 
think you enhanced what credibility you have with it at all.  Just an 
observation...

However, that aside, the real question is whether you like the idea of
changing the names based on the reasons given for it (making it easier for
the 'casual fan'), or whether you like the idea of unique divisional names
based on individuals who do deserve the honour.  IMO, the latter is a nice
and unique touch that differs from other sports.  In addition, I do not
think that changing divisional names will have an effe

## Parte 3: Ranking con BM25

### Actividad 

1. Implementa un sistema de recuperación usando el modelo BM25.
2. Usa la misma consulta del ejercicio anterior.
3. Calcula el score BM25 para cada documento y genera un ranking.
4. Compara manualmente con el ranking de TF-IDF.

In [1]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import csr_matrix

# BM25 parameters
k1 = 1.5  # term frequency scaling parameter
b = 0.75  # document length normalization parameter

# Use the same vectorizer for consistency
doc_term_matrix = tf_matrix  # Reuse the existing term frequency matrix
doc_lengths = doc_term_matrix.sum(axis=1).A1  # Document lengths
avg_doc_length = np.mean(doc_lengths)  # Average document length
N = doc_term_matrix.shape[0]  # Number of documents

# Calculate IDF component for BM25
df_array = np.squeeze(np.asarray((doc_term_matrix > 0).sum(axis=0)))
idf_bm25 = np.log((N - df_array + 0.5) / (df_array + 0.5) + 1.0)

# Vectorize the query
query_vec = consulta_tf  # Reuse the existing query vector

# Calculate BM25 scores
bm25_scores = np.zeros(N)

# Convert sparse matrices to arrays for easier manipulation
query_array = query_vec.toarray()[0]
non_zero_indices = np.nonzero(query_array)[0]

for idx in non_zero_indices:
    # Get term frequencies for this term across all documents
    tf = doc_term_matrix[:, idx].toarray().flatten()
    
    # Calculate BM25 term relevance
    numerator = tf * (k1 + 1)
    denominator = tf + k1 * (1 - b + b * doc_lengths / avg_doc_length)
    term_relevance = numerator / denominator
    
    # Multiply by IDF and add to the scores
    bm25_scores += idf_bm25[idx] * term_relevance

# Rank documents by BM25 score
bm25_ranking = np.argsort(-bm25_scores)
bm25_top_scores = bm25_scores[bm25_ranking]

# Create table with BM25 results (top 5 documents)
bm25_resultados = pd.DataFrame({
    'ID Documento': [i for i in bm25_ranking[:top_n]],
    'Documento': [newsgroupsdocs[i][:50].replace('\n', ' ') + "..." for i in bm25_ranking[:top_n]],
    'Puntaje BM25': bm25_top_scores[:top_n]
})

# Display BM25 results
print("Ranking BM25 para la consulta '{}':".format(consulta))
display(bm25_resultados)

# Compare with TF-IDF ranking
print("\nComparación de rankings:")
comparacion = pd.DataFrame({
    'TF-IDF Doc ID': [i for i in ranking[:top_n]],
    'TF-IDF Score': ranking_scores[:top_n],
    'BM25 Doc ID': [i for i in bm25_ranking[:top_n]],
    'BM25 Score': bm25_top_scores[:top_n]
})
display(comparacion)

NameError: name 'tf_matrix' is not defined