**Ejercicio 4: Modelo Probabilístico**

**Objetivo de la práctica**



*   Comprender los componentes del modelo vectorial mediante cálculos manuales y observación directa.
*   Aplicar el modelo de espacio vectorial con TF-IDF para recuperar documentos relevantes.
*   Comparar la recuperación con BM25 frente a TF-IDF.
*   Analizar visualmente las diferencias entre los modelos.
*   Evaluar si los rankings generados son consistentes con lo que considerarías documentos relevantes.

**Parte 0: Carga del Corpus**

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

**Parte 1: Cálculo de TF, DF, IDF y TF-IDF**

**Actividad**

1.   Utiliza el corpus cargado.
2.   Construye la matriz de términos (TF), y calcula la frecuencia de documentos (DF).
3.  Calcula TF-IDF utilizando sklearn.
4.  Visualiza los valores en un DataFrame para analizar las diferencias entre los términos.

In [3]:
# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(use_idf=True, lowercase=True)

# just send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(newsgroupsdocs)

In [4]:
df = pd.DataFrame(tfidf_vectorizer_vectors.T.todense(), index=tfidf_vectorizer.get_feature_names_out())
# Sort by TF-IDF descending to see the most important terms first
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,18836,18837,18838,18839,18840,18841,18842,18843,18844,18845
00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.217494,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zzzzzz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zzzzzzt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
³ation,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ýé,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
print(tfidf_vectorizer_vectors.shape)


(18846, 134410)


**Parte 2: Ranking de documentos usando TF-IDF**

**Actividad**

1.  Dada una consulta, construye el vector de consulta
2.  Calcula la similitud coseno entre la consulta y cada documento usando los vectores TF-IDF
3.  Genera un ranking de los documentos ordenados por relevancia.
4.  Muestra los resultados en una tabla.

In [33]:
query = "ÿhooked zzzzzz ýé ýé"

In [34]:
vertorized_query=tfidf_vectorizer.transform([query])

In [None]:
df = pd.DataFrame(vertorized_query.T.todense(), index=tfidf_vectorizer.get_feature_names_out())
df

Unnamed: 0,0
00,0.000000
000,0.000000
0000,0.000000
00000,0.000000
000000,0.000000
...,...
zzzzzz,0.408248
zzzzzzt,0.000000
³ation,0.000000
ýé,0.816497
