# Proyecto Sistemas Recomendadores

Este colab toma inspiraciones de:

-  [Kaggle [1]](https://www.kaggle.com/code/minanabil11111212/news-recommendations-system)
- [Kaggle [2]](https://www.kaggle.com/code/mendhiri/mc855d5y1169-news-recommendation-system#CATATAN-PENTING:-JALANKAN-NOTEBOOK-INI-DI-KAGGLE)
- [Práctico LightFM](https://github.com/PUC-RecSys-Class/RecSysPUC-2025-2/blob/master/practicos/06_lightfm.ipynb)

## SetUp del Proyecto

### Importación de Librerías

In [1]:
from IPython.display import clear_output

In [6]:
# Toma aprox. 30 sec.
!pip install git+https://github.com/daviddavo/lightfm
!pip install sentence-transformers
!pip install -q deepctr-torch
clear_output(wait=True)
print("Librería instalada.")

Librería instalada.


In [82]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import kagglehub
import random
import torch
import json
import os
import re

from scipy import sparse
from scipy.sparse import csr_matrix
from tqdm.notebook import tqdm
from collections import defaultdict, Counter
from sentence_transformers import SentenceTransformer
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from deepctr_torch.inputs import SparseFeat, get_feature_names, DenseFeat
from deepctr_torch.models import DeepFM

import lightfm
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm import cross_validation
from lightfm.evaluation import precision_at_k as lightfm_prec_at_k
from lightfm.evaluation import recall_at_k as lightfm_recall_at_k

clear_output(wait=True)
print("Librerías importadas.")

Librerías importadas.


### Definición de Random Seed

In [8]:
def set_seed(seed=77):
    random.seed(seed)
    np.random.seed(seed)

set_seed(77)

### Importación de Dataset - MIND small

In [9]:
path = kagglehub.dataset_download("arashnic/mind-news-dataset")

behaviors_path = os.path.join(path, "MINDsmall_train", "behaviors.tsv")
entity_embedding_path = os.path.join(path, "MINDsmall_train", "entity_embedding.vec")
news_path = os.path.join(path, "MINDsmall_train", "news.tsv")
relation_embedding_path = os.path.join(path, "MINDsmall_train", "relation_embedding.vec")

clear_output(wait=True)
print("Datos importados correctamente.")

Datos importados correctamente.


### Carga de Datos

In [10]:
# Útil para CBF
entity_embeddings = {}
with open(entity_embedding_path, 'r') as f:
    for line in f:
        parts = line.strip().split('\t')
        entity_embeddings[parts[0]] = np.array(list(map(float, parts[1:])))

# No usable por ahora, requiere enfoque híbrido y más complejo
relation_embeddings = {}
with open(relation_embedding_path, 'r') as f:
    for line in f:
        parts = line.strip().split('\t')
        relation_embeddings[parts[0]] = np.array(list(map(float, parts[1:])))

In [11]:
news_raw = pd.read_csv(news_path, sep="\t", header=None, names=[
    'NewsID', 'Category', 'SubCategory', 'Title',
    'Abstract', 'URL', 'TitleEntities', 'AbstractEntities'])

behaviors_raw = pd.read_csv(behaviors_path, sep="\t", header=None, names=[
    'ImpressionID', 'UserID', 'Time', 'History', 'Impressions'])

## Manejo de Datos

### Limpieza de Datos

In [12]:
news = news_raw.drop_duplicates(subset=['NewsID'])
news = news.drop(columns=['URL'])
news = news.dropna(subset=['Abstract', 'TitleEntities', 'AbstractEntities']).copy()

news['TitleEntities'] = news['TitleEntities'].fillna('[]')
news['AbstractEntities'] = news['AbstractEntities'].fillna('[]')

In [13]:
behaviors = behaviors_raw.drop_duplicates(subset=['ImpressionID'])
behaviors['History'] = behaviors['History'].fillna('')
behaviors['Time'] = pd.to_datetime(behaviors['Time'])

### Interacciones Explícitas

In [14]:
interaction_flat = []

for _, row in behaviors.iterrows():
    user = row.UserID
    impressions = row.Impressions.split() # 'N12345-1 N67890-0...'

    for item in impressions:
        if '-' in item:
            news_id, label = item.rsplit('-', 1)
            interaction_flat.append((user, news_id, int(label)))

interactions = pd.DataFrame(interaction_flat, columns=['UserID', 'NewsID', 'Label'])

# Nos quedamos con las interacciones que tienen IDs de News existentes
valid_news_ids = set(news['NewsID'])
interactions = interactions[interactions['NewsID'].isin(valid_news_ids)].copy()

# Añadimos la columna categoría para Precision@K
news_category = news[['NewsID', 'Category']]
interactions = interactions.merge(news_category, on='NewsID', how='left')

### Datasets

In [15]:
news.head()

Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,TitleEntities,AbstractEntities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...","[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [16]:
behaviors.head()

Unnamed: 0,ImpressionID,UserID,Time,History,Impressions
0,1,U13740,2019-11-11 09:05:58,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,2019-11-12 18:11:30,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,2019-11-14 07:01:48,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...
3,4,U34670,2019-11-11 05:28:05,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0
4,5,U8125,2019-11-12 16:11:21,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...


In [17]:
interactions.head()

Unnamed: 0,UserID,NewsID,Label,Category
0,U13740,N55689,1,sports
1,U13740,N35729,0,news
2,U91836,N20678,0,sports
3,U91836,N39317,0,news
4,U91836,N20495,0,news


### Separación de conjuntos

In [18]:
behaviors = behaviors.sort_values(by='Time')

# Ocupamos una división temporal, no aleatoria | 80%/10%/10%
total_rows = len(behaviors)
train_end  = int(total_rows * 0.8)
val_end    = int(total_rows * 0.9)

train_behaviors = behaviors.iloc[:train_end]
val_behaviors   = behaviors.iloc[train_end:val_end]
test_behaviors  = behaviors.iloc[val_end:]

print(f"Train | {len(train_behaviors)}")
print(f"Val   | {len(val_behaviors)}")
print(f"Test  | {len(test_behaviors)}")

Train | 125572
Val   | 15696
Test  | 15697


### CF - Ajustes para Filtrado Colaborativo

In [19]:
train_interactions = interactions[interactions['UserID'].isin(train_behaviors['UserID'])].copy()

In [20]:
user_enc = LabelEncoder()
news_enc = LabelEncoder()

train_interactions['UserID_enc'] = user_enc.fit_transform(train_interactions['UserID'])
train_interactions['NewsID_enc'] = news_enc.fit_transform(train_interactions['NewsID'])

In [21]:
n_users_train = train_interactions['UserID_enc'].nunique()
n_items_train = train_interactions['NewsID_enc'].nunique()

positive_interactions_train = train_interactions[train_interactions['Label'] == 1].copy()

ratings_matrix = csr_matrix((
    positive_interactions_train['Label'],
    (positive_interactions_train['UserID_enc'], positive_interactions_train['NewsID_enc'])
), shape=(n_users_train, n_items_train)) # Usamos los N° de train

ratings_matrix.shape

(45848, 19037)

### CBF - Ajustes para Filtrado Basado en Contenido

In [22]:
# Mapeos
news2id_cbf = {nid: i for i, nid in enumerate(news['NewsID'])}  # NewsID -> CBF Index
id2news_cbf = {i: nid for nid, i in news2id_cbf.items()}        # CBF Index -> NewsID

#### CBF TF-IDF

In [23]:
# Dejamos todo el texto en una sola columna
news['full_text'] = (news['Title'] + ' ' + news['Abstract']).str.lower()

In [24]:
# TF-IDF
vectorizer = TfidfVectorizer(stop_words='english', max_features=7777)
tfidf_matrix = vectorizer.fit_transform(news['full_text'])
tfidf_matrix.shape

(48612, 7777)

#### CBF BERT

In [25]:
bert_model = SentenceTransformer('all-MiniLM-L6-v2')
bert_matrix = bert_model.encode(news['full_text'].values, show_progress_bar=True)
bert_matrix.shape

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1520 [00:00<?, ?it/s]

(48612, 384)

#### CBF Entity Embeddings

In [26]:
# Dejamos todas las entities en una sola columna
news['all_entities_str'] = news['TitleEntities'].astype(str) + news['AbstractEntities'].astype(str)

In [27]:
def get_entity_vector(entity_str, entity_embeddings):
    vectors = []

    if not entity_embeddings:
        return np.array([])

    embedding_dim = list(entity_embeddings.values())[0].shape[0]
    wikidata_ids = re.findall(r'"WikidataId": "(\w+)"', entity_str)

    for wikidata_id in wikidata_ids:
        if wikidata_id in entity_embeddings:
            vectors.append(entity_embeddings[wikidata_id])

    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(embedding_dim)

In [28]:
entity_vectors = news['all_entities_str'].apply(
    lambda x: get_entity_vector(x, entity_embeddings)
)

entity_matrix = np.stack(entity_vectors.values)
entity_matrix.shape

(48612, 100)

### LightFM - Ajustes para Modelo Híbrido

In [29]:
all_categories = news['Category'].unique().tolist()
all_subcategories = news['SubCategory'].unique().tolist()

#### LigthFM TF-IDF

In [30]:
item_feature_names_tfidf = vectorizer.get_feature_names_out().tolist()

dataset_tfidf = Dataset()
dataset_tfidf.fit(
    users=interactions['UserID'].unique(),
    items=news['NewsID'].unique(),
    item_features=item_feature_names_tfidf + all_categories + all_subcategories
)

In [31]:
(interactions_train_lfm, weights_train_lfm) = dataset_tfidf.build_interactions(
    (row.UserID, row.NewsID, row.Label)
    for _, row in interactions[interactions['UserID'].isin(train_behaviors['UserID'])].iterrows()
    if row.Label == 1
)

def create_enriched_feature_generator(news_df, vectorizer_model):
    # Pre-calculamos la matriz TF-IDF para no hacerlo fila por fila (lento)
    news_features = vectorizer_model.transform(news_df['full_text'])
    feature_names = vectorizer_model.get_feature_names_out()

    for i in range(news_df.shape[0]):
        # Datos de la fila actual
        row = news_df.iloc[i]
        news_id = row['NewsID']
        category = row['Category']
        subcategory = row['SubCategory']

        # 1. Extraemos las palabras clave (TF-IDF)
        feature_indices = news_features[i].nonzero()[1]
        words = [feature_names[idx] for idx in feature_indices]

        # 2. Combinamos: Palabras + Categoria + Subcategoria
        # LightFM aplanará esto internamente, asociando todo a este ID de noticia
        final_features = words + [category, subcategory]

        yield (news_id, final_features)

item_features_generator_tfidf = create_enriched_feature_generator(news, vectorizer)
item_features_lfm_tfidf = dataset_tfidf.build_item_features(
    item_features_generator_tfidf,
    normalize=False # El TF-IDF ya está normalizado
)

print(f"Matriz de Interacciones LFM: {interactions_train_lfm.shape}")
print(f"Matriz de Features LFM:      {item_features_lfm_tfidf.shape}")

Matriz de Interacciones LFM: (50000, 48612)
Matriz de Features LFM:      (48612, 56607)


In [32]:
model_lfm_tfidf = LightFM(loss='warp',
                          random_state=77,
                          no_components=30)        # Factores latentes. Opciones: (30, 50, 100)

model_lfm_tfidf.fit(interactions_train_lfm,
                    item_features=item_features_lfm_tfidf,
                    epochs=10,                     # 10 > 20 > 30 en resultados
                    num_threads=os.cpu_count(),    # Usar núcelos disponibles
                    verbose=True)

Epoch: 100%|██████████| 10/10 [00:50<00:00,  5.06s/it]


<lightfm.lightfm.LightFM at 0x7f420ca6c440>

#### LightFM BERT

In [33]:
bert_dim_names = [f"bert_dim_{i}" for i in range(bert_matrix.shape[1])]

dataset_bert = Dataset()
dataset_bert.fit(
    users=interactions['UserID'].unique(),
    items=news['NewsID'].unique(),
    # Las features son: Dimensiones BERT + Categorías + Subcategorías
    item_features=bert_dim_names + all_categories + all_subcategories
)

(interactions_train_bert, weights_train_bert) = dataset_bert.build_interactions(
    (row.UserID, row.NewsID, row.Label)
    for _, row in interactions[interactions['UserID'].isin(train_behaviors['UserID'])].iterrows()
    if row.Label == 1
)

def create_bert_feature_generator(news_df, matrix_embeddings):
    for i in range(news_df.shape[0]):
        row = news_df.iloc[i]
        news_id = row['NewsID']

        # Obtenemos el vector denso de BERT para esta noticia
        embedding_vector = matrix_embeddings[i]

        # Creamos un diccionario: {nombre_feature: peso}
        # Las dimensiones de BERT tienen pesos continuos (floats)
        bert_features = {f"bert_dim_{j}": embedding_vector[j] for j in range(len(embedding_vector))}

        # Agregamos Categoría y Subcategoría (peso 1.0 por defecto)
        meta_features = {row['Category']: 1.0, row['SubCategory']: 1.0}

        # Unimos ambos diccionarios
        final_features = {**bert_features, **meta_features}

        yield (news_id, final_features)

item_features_generator_bert = create_bert_feature_generator(news, bert_matrix)
item_features_lfm_bert = dataset_bert.build_item_features(
    item_features_generator_bert,
    normalize=False
)

print(f"Matriz de Interacciones LFM: {interactions_train_lfm.shape}")
print(f"Matriz de Features LFM:      {item_features_lfm_tfidf.shape}")

Matriz de Interacciones LFM: (50000, 48612)
Matriz de Features LFM:      (48612, 56607)


In [34]:
model_lfm_bert = LightFM(loss='warp',
                         random_state=77,
                         no_components=30)

model_lfm_bert.fit(interactions_train_bert,
                   item_features=item_features_lfm_bert,
                   epochs=10,
                   num_threads=os.cpu_count(),
                   verbose=True)

Epoch: 100%|██████████| 10/10 [04:57<00:00, 29.71s/it]


<lightfm.lightfm.LightFM at 0x7f420c4292b0>

### DeepFM - Ajustes para Modelo Híbrido

#### DeepFM BERT

In [None]:
# https://github.com/PUC-RecSys-Class/RecSysPUC-2025-2/blob/master/practicos/07_deepfm.ipynb

In [36]:
# Usamos PCA para comprimir las 384 dimensiones de BERT a 32.
# Esto mantiene la información semántica pero acelera el entrenamiento x10.
pca = PCA(n_components=32, random_state=77)
bert_pca = pca.fit_transform(bert_matrix)

# Nombres para las columnas de BERT
bert_feat_names = [f'bert_{i}' for i in range(bert_pca.shape[1])]

# Agregamos los vectores de BERT al dataframe de noticias
df_bert_pca = pd.DataFrame(bert_pca, columns=bert_feat_names)
df_bert_pca['NewsID'] = news['NewsID'].values # Usamos IDs originales para el join

In [41]:
# Unimos interacciones + Metadatos (Cat/Subcat) + BERT PCA
df_deepfm_train = train_interactions[['UserID', 'NewsID', 'Label']].merge(news, on='NewsID', how='left')
df_deepfm_train = df_deepfm_train.merge(df_bert_pca, on='NewsID', how='left')

# Encoders separados para DeepFM
# => Requiere enteros de 0 a N consecutivos
lbe_user = LabelEncoder()
lbe_news = LabelEncoder()
lbe_cat = LabelEncoder()
lbe_subcat = LabelEncoder()

# Ajustamos los encoders con TODOS los datos disponibles (news) para
# no tener errores con categorías que aparezcan en test pero no en train
lbe_user.fit(behaviors['UserID'])
lbe_news.fit(news['NewsID'])
lbe_cat.fit(news['Category'])
lbe_subcat.fit(news['SubCategory'])

# Transformamos el set de entrenamiento
df_deepfm_train['user_id_code'] = lbe_user.transform(df_deepfm_train['UserID'])
df_deepfm_train['news_id_code'] = lbe_news.transform(df_deepfm_train['NewsID'])
df_deepfm_train['cat_code'] = lbe_cat.transform(df_deepfm_train['Category'])
df_deepfm_train['subcat_code'] = lbe_subcat.transform(df_deepfm_train['SubCategory'])

In [42]:

# Definición de Features (SparseFeat)


# Definimos las columnas que entrarán al modelo.
# embedding_dim=16 es un estándar bueno para empezar

embedding_dim = 16

feature_columns = [
    SparseFeat('user_id_code', vocabulary_size=len(lbe_user.classes_), embedding_dim=embedding_dim),
    SparseFeat('news_id_code', vocabulary_size=len(lbe_news.classes_), embedding_dim=embedding_dim),
    SparseFeat('cat_code', vocabulary_size=len(lbe_cat.classes_), embedding_dim=embedding_dim),
    SparseFeat('subcat_code', vocabulary_size=len(lbe_subcat.classes_), embedding_dim=embedding_dim)
]

# B. Features Numéricas (Dense) - BERT
dense_features = [DenseFeat(name, dimension=1) for name in bert_feat_names]

# DeepFM usa las mismas features para la parte lineal (FM) y la profunda (DNN)
linear_feature_columns = feature_columns + dense_features
dnn_feature_columns = feature_columns + dense_features

feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)

# Generamos el diccionario de entrada para el modelo
train_model_input = {name: df_deepfm_train[name].values for name in feature_names}

In [43]:
# Entrenamiento

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# dnn_hidden_units=(128, 64): Arquitectura clásica, suficiente para MIND-Small
# dnn_dropout=0.2: Ayuda a evitar el overfitting
model_deepfm = DeepFM(linear_feature_columns, dnn_feature_columns,
                      task='binary',
                      dnn_hidden_units=(128, 64),
                      dnn_dropout=0.2,
                      l2_reg_embedding=1e-5,
                      device=device)

# Usamos binary_crossentropy porque predecimos probabilidad de click (0 o 1)
model_deepfm.compile("adam", "binary_crossentropy", metrics=["auc"])

history = model_deepfm.fit(train_model_input,
                           df_deepfm_train['Label'].values,
                           batch_size=2048, # Batch grande para velocidad
                           epochs=5,        # 5 épocas, mejor punto antes de sobreajustar
                           verbose=2,
                           validation_split=0.1)

cpu
Train on 4860935 samples, validate on 540104 samples, 2374 steps per epoch
Epoch 1/5
228s - loss:  0.1666 - auc:  0.6900 - val_auc:  0.7269
Epoch 2/5
231s - loss:  0.1520 - auc:  0.7499 - val_auc:  0.7259
Epoch 3/5
233s - loss:  0.1485 - auc:  0.7692 - val_auc:  0.7238
Epoch 4/5
231s - loss:  0.1440 - auc:  0.7931 - val_auc:  0.7162
Epoch 5/5
230s - loss:  0.1381 - auc:  0.8197 - val_auc:  0.7154


## Funciones de Utilidad

### Helpers

In [44]:
def fit_knn(matrix):
    knn_model = NearestNeighbors(metric='cosine', algorithm='brute')
    knn_model.fit(matrix)
    return knn_model

In [45]:
def get_newID(id):
    return news['NewsID'].iloc[id]

In [46]:
test_users_list = test_behaviors['UserID'].unique()

def get_userID(id):
    if id < len(test_users_list):
        return test_users_list[id]
    else:
        print(f"ID {id} fuera de rango. Devolviendo el primer usuario de prueba.")
        return test_users_list[0]

### Filtrado Colaborativo - CF

In [47]:
knn_model_cf = fit_knn(ratings_matrix)

In [48]:
def get_cf_recommendations(user_id, top_n=10):

    # 1. COLD START
    if user_id not in user_enc.classes_:
        popular_news_train_ids = positive_interactions_train['NewsID'].value_counts().head(top_n).index
        recs_df = news[news['NewsID'].isin(popular_news_train_ids)]
        recs_df = recs_df.set_index('NewsID').loc[popular_news_train_ids].reset_index()
        return recs_df[['NewsID', 'Title', 'Category', 'SubCategory']]

    # 2. OBTENER VECTORES
    user_idx = user_enc.transform([user_id])[0]
    user_vector = ratings_matrix[user_idx]

    # 3. KNN
    distances, indices = knn_model_cf.kneighbors(user_vector, n_neighbors=min(20, n_users_train))

    # 4. OBTENER SCORES
    neighbor_indices = indices.flatten()[1:]
    news_scores = defaultdict(float)
    for neighbor_idx in neighbor_indices:
        neighbor_news = ratings_matrix[neighbor_idx].nonzero()[1]
        for news_idx in neighbor_news:
            news_scores[news_idx] += 1.0

    # 5. FILTRAR VISTAS
    user_seen_news_train = set(train_interactions[
        (train_interactions['UserID'] == user_id) &
        (train_interactions['Label'] == 1)
    ]['NewsID'])

    # 6. ORDENAR Y FILTRAR
    recommendations_ids = []
    for news_idx, score in sorted(news_scores.items(), key=lambda x: x[1], reverse=True):

        # news_enc está fiteado solo en train, por lo que news_idx (de train) debería ser correcto
        news_id = news_enc.inverse_transform([news_idx])[0]

        if news_id not in user_seen_news_train: # <-- Comparamos vs train
            recommendations_ids.append(news_id)
        if len(recommendations_ids) >= top_n:
            break

    # 7. RELLENAR
    if len(recommendations_ids) < top_n:
        # Usamos las populares del set de TRAIN
        popular_train = positive_interactions_train['NewsID'].value_counts().index
        for news_id in popular_train:
            if news_id not in user_seen_news_train and news_id not in recommendations_ids:
                recommendations_ids.append(news_id)
            if len(recommendations_ids) >= top_n:
                break

    recommendations_df = news[news['NewsID'].isin(recommendations_ids)]
    recommendations_df = recommendations_df.set_index('NewsID').loc[recommendations_ids].reset_index()

    return recommendations_df[['NewsID', 'Title', 'Category', 'SubCategory']]

### Content Based - CBF

In [49]:
knn_model_tfidf = fit_knn(tfidf_matrix)
knn_model_bert = fit_knn(bert_matrix)
knn_model_entity = fit_knn(entity_matrix)

In [50]:
def get_cbf_recommendations(matrix, news_id, top_n=10):

    if news_id not in news2id_cbf:
        print("newsID inválido.")
        return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory'])

    idx = news2id_cbf[news_id]

    # --- Selección de Matriz ---
    if matrix == "tf-idf":
        target_vector = tfidf_matrix[idx]
        model = knn_model_tfidf

    elif matrix == "entity":
        target_vector = entity_matrix[idx].reshape(1, -1)
        model = knn_model_entity

    elif matrix == "bert":
        target_vector = bert_matrix[idx].reshape(1, -1)
        model = knn_model_bert

    else:
        print("Matriz inválida.")
        return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory'])

    distances, indices_knn = model.kneighbors(
        target_vector, n_neighbors=top_n + 1
    )

    recommended_indices = indices_knn.flatten()[1:]
    id2news_map = {i: nid for nid, i in news2id_cbf.items()}
    recommended_news_ids = [id2news_map[i] for i in recommended_indices]

    recommendations_df = news[news['NewsID'].isin(recommended_news_ids)]
    recommendations_df = recommendations_df.set_index('NewsID').loc[recommended_news_ids].reset_index()

    return recommendations_df[['NewsID', 'Title', 'Category', 'SubCategory']]

### LightFM

In [51]:
user_id_map_lfm = dataset_tfidf.mapping()[0]
item_id_map_lfm = dataset_tfidf.mapping()[2]

# Creamos un 'inverso' del mapa de ítems (int -> str)
item_id_map_inv_lfm = {v: k for k, v in item_id_map_lfm.items()}

# Usamos 'positive_interactions_train' para fallback
popular_news_ids_fallback = positive_interactions_train['NewsID'].value_counts().head(10).index

def get_lightfm_tfidf_recommendations(user_id_str, top_n=10):

    # Manejo de Cold Start (Usuario nuevo)
    if user_id_str not in user_id_map_lfm:
        recs = news[news['NewsID'].isin(popular_news_ids_fallback)]
        return recs[['NewsID', 'Title', 'Category', 'SubCategory']].head(top_n)

    # Mapeamos UserID (str) a UserID (int de lightfm)
    user_id_lfm = user_id_map_lfm[user_id_str]

    # IDs de todos los ítems
    n_items_lfm = item_features_lfm_tfidf.shape[0]
    item_ids_lfm = np.arange(n_items_lfm)

    # Predecimos scores
    scores = model_lfm_tfidf.predict(user_id_lfm,
                                     item_ids_lfm,
                                     item_features=item_features_lfm_tfidf,
                                     num_threads=4)

    # Ordenamos y obtenemos Top N
    top_items_lfm = np.argsort(-scores)[:top_n]
    recommended_news_ids = [item_id_map_inv_lfm[lfm_id] for lfm_id in top_items_lfm]
    recommendations = news[news['NewsID'].isin(recommended_news_ids)]
    recommendations = recommendations.set_index('NewsID').loc[recommended_news_ids].reset_index()

    return recommendations[['NewsID', 'Title', 'Category', 'SubCategory']]

### DeepFM

In [73]:
# 1. Pre-cálculo de matriz de candidatos (incluye BERT)
candidate_news = news[['NewsID', 'Category', 'SubCategory']].drop_duplicates('NewsID').copy()
candidate_news = candidate_news.merge(df_bert_pca, on='NewsID', how='left') # Join con BERT PCA

# Codificación una sola vez
candidate_news['news_id_code'] = lbe_news.transform(candidate_news['NewsID'])
candidate_news['cat_code'] = lbe_cat.transform(candidate_news['Category'])
candidate_news['subcat_code'] = lbe_subcat.transform(candidate_news['SubCategory'])

# Arrays estáticos para velocidad
candidate_codes_news = candidate_news['news_id_code'].values
candidate_codes_cat = candidate_news['cat_code'].values
candidate_codes_subcat = candidate_news['subcat_code'].values
# Matriz de features densas (BERT) para candidatos: Forma (N_noticias, 32)
candidate_bert_values = candidate_news[bert_feat_names].values


def get_deepfm_recommendations(user_id_str, top_n=10):
    # Cold Start
    if user_id_str not in lbe_user.classes_:
        return news.head(top_n)[['NewsID', 'Title', 'Category', 'SubCategory']]

    # Código Usuario
    user_code = lbe_user.transform([user_id_str])[0]
    n_items = len(candidate_news)

    # Input Dictionary
    model_input = {
        'user_id_code': np.full(n_items, user_code),
        'news_id_code': candidate_codes_news,
        'cat_code': candidate_codes_cat,
        'subcat_code': candidate_codes_subcat
    }

    # Agregamos las columnas de BERT al input
    # Cada columna bert_0, bert_1... es un array de largo n_items
    for i, col_name in enumerate(bert_feat_names):
        model_input[col_name] = candidate_bert_values[:, i]

    # Predicción
    pred_scores = model_deepfm.predict(model_input, batch_size=4096)

    # Ranking
    top_indices = np.argsort(-pred_scores.flatten())[:top_n]
    recommended_news_ids = candidate_news.iloc[top_indices]['NewsID'].values

    # Formato
    recs = news[news['NewsID'].isin(recommended_news_ids)]
    recs = recs.set_index('NewsID').loc[recommended_news_ids].reset_index()

    return recs[['NewsID', 'Title', 'Category', 'SubCategory']]

### Novedad y Diversdidad

In [53]:
# Calculamos la popularidad p(i) en el set de entrenamiento
# Usamos un mapa para la probabilidad p(i)
total_interactions = len(train_interactions)
item_counts = train_interactions['NewsID'].value_counts()
item_popularity = item_counts / total_interactions

# Pre-cálculo para Novelty: -log2(p(i))
novelty_map = {}
for news_id, p in item_popularity.items():
    if p > 0:
        novelty_map[news_id] = -np.log2(p)
    else:
        novelty_map[news_id] = 0

# Creamos un mapa de NewsID a su índice en la matriz BERT (para Diversity)
news_to_bert_index = {nid: i for i, nid in enumerate(news['NewsID'])}

In [54]:
def calculate_novelty(recommended_news_ids, novelty_scores_map):
    scores = [novelty_scores_map.get(nid, 0) for nid in recommended_news_ids]
    return np.mean(scores) if scores else 0

def calculate_diversity(recommended_news_ids, bert_matrix, news_to_bert_index):
    if len(recommended_news_ids) < 2:
        return np.nan # No se puede calcular la diversidad con menos de 2 ítems

    # Obtenemos los vectores BERT para los ítems recomendados
    bert_indices = [news_to_bert_index.get(nid) for nid in recommended_news_ids if nid in news_to_bert_index]

    # Aseguramos que todos los índices sean válidos
    if len(bert_indices) < 2:
        return np.nan

    vectors = bert_matrix[bert_indices]

    # Calculamos matriz de similitud de coseno
    # Usamos 1.0 - CS para obtener la matriz de distancias
    distance_matrix = 1.0 - cosine_similarity(vectors)

    # Sumamos distancias (quitando la diagonal: distancia i a i = 0)
    # Suma la matriz superior triangular (sin repetir pares)
    num_items = len(vectors)
    upper_triangle_sum = np.sum(np.triu(distance_matrix, k=1))

    # Normalizamos por el número total de pares únicos (n*(n-1)/2)
    num_pairs = num_items * (num_items - 1) / 2

    return upper_triangle_sum / num_pairs if num_pairs > 0 else np.nan


In [55]:

def evaluate_exploratory_metrics(model_functions_dict, test_behaviors, k=10, sample_size=100):
    """Orquesta el cálculo de Novedad y Diversidad para todos los modelos."""

    print("\n" + "=" * 70)
    print(f"✨ Métrica: Exploración (Novedad y Diversidad) (k={k})")
    print("=" * 70)

    test_users = test_behaviors['UserID'].unique()
    # Usamos los mismos usuarios de prueba que en la evaluación de precisión
    sample_users = test_users[:sample_size]

    results = defaultdict(lambda: {'novelty': [], 'diversity': []})

    for user_id in tqdm(sample_users, desc="Progreso"):
        for model_name, recommend_func in model_functions_dict.items():

            # Generamos Recomendaciones
            recommendations = recommend_func(user_id, k)
            if recommendations.empty:
                continue

            recommended_news = recommendations['NewsID'].head(k).tolist()

            # Calculamos Novedad
            novelty = calculate_novelty(recommended_news, novelty_map)
            if novelty > 0:
                results[model_name]['novelty'].append(novelty)

            # Calculamos Diversidad
            # Nota: Necesitamos la matriz BERT y el mapeo.
            diversity = calculate_diversity(recommended_news, bert_matrix, news_to_bert_index)
            if not np.isnan(diversity):
                 results[model_name]['diversity'].append(diversity)

    final_results = []
    for model_name, data in results.items():
        avg_novelty = np.mean(data['novelty']) if data['novelty'] else 0
        avg_diversity = np.mean(data['diversity']) if data['diversity'] else 0

        final_results.append({
            'Model': model_name,
            'Avg Novelty': f"{avg_novelty:.4f}",
            'Avg Diversity': f"{avg_diversity:.4f}",
        })

    print("\n✅ Evaluación exploratoria completada!\n")
    return pd.DataFrame(final_results).sort_values(by='Avg Novelty', ascending=False)

### Métricas

In [56]:
def get_historical_categories(behaviors_df, news_df):
    behavior_interactions = []
    for row in behaviors_df.itertuples(index=False):
        user = row.UserID
        if pd.isna(row.Impressions):
             continue

        for item in row.Impressions.split():
            if item.endswith('-1'):
                news_id = item.split('-')[0]
                behavior_interactions.append({'UserID': user, 'NewsID': news_id})

    temp_interactions = pd.DataFrame(behavior_interactions)
    temp_interactions = temp_interactions.merge(news_df[['NewsID', 'Category']], on='NewsID', how='left')

    user_category_map = defaultdict(set)
    for row in temp_interactions.itertuples(index=False):
        if pd.notna(row.Category):
            user_category_map[row.UserID].add(row.Category)

    return user_category_map

In [57]:
def evaluate_exact_news_prediction(
    model_functions_dict,
    test_set_behaviors,
    k=10,
    sample_size=100
):

    print("=" * 70)
    print(f"🎯 Métrica: Predicción de NOTICIAS EXACTAS (k={k})")
    print("=" * 70)

    test_user_clicks = defaultdict(set)
    for _, row in test_set_behaviors.iterrows():
        user = row.UserID
        if pd.notna(row.Impressions):
            for item in row.Impressions.split():
                if item.endswith('-1'):
                    test_user_clicks[user].add(item.split('-')[0])

    test_users_with_clicks = [u for u in test_user_clicks.keys() if len(test_user_clicks[u]) > 0]

    if len(test_users_with_clicks) > sample_size:
        sample_users = random.sample(test_users_with_clicks, sample_size)
    else:
        sample_users = test_users_with_clicks

    if not sample_users:
        print("¡No se encontraron usuarios con clics en el set de prueba!")
        return pd.DataFrame()

    print(f"📊 Evaluando para {len(sample_users)} usuarios:")

    results = defaultdict(lambda: {
        'hits': 0, 'total_recalled': 0, 'total_relevant': 0,
        'total_recommended': 0, 'total_relevant_recommended': 0
    })

    for user_id in tqdm(sample_users, desc="Progreso"):
        actual_clicked = test_user_clicks[user_id]
        if not actual_clicked:
            continue

        for model_name, recommend_func in model_functions_dict.items():

            # Llamamos a la función (todas tienen la misma firma)
            recommendations = recommend_func(user_id, k)

            if recommendations.empty:
                continue

            recommended_news = set(recommendations['NewsID'].head(k).tolist())

            if len(recommended_news.intersection(actual_clicked)) > 0:
                results[model_name]['hits'] += 1

            relevant_recommended = len(recommended_news.intersection(actual_clicked))
            results[model_name]['total_recalled'] += relevant_recommended
            results[model_name]['total_relevant'] += len(actual_clicked)
            results[model_name]['total_recommended'] += len(recommended_news)
            results[model_name]['total_relevant_recommended'] += relevant_recommended

    final_results = []
    n_users = len(sample_users)

    for model_name, data in results.items():
        hit_rate = (data['hits'] / n_users * 100) if n_users > 0 else 0
        recall = (data['total_recalled'] / data['total_relevant'] * 100) if data['total_relevant'] > 0 else 0
        precision_val = (data['total_relevant_recommended'] / data['total_recommended'] * 100)
        precision = precision_val if data['total_recommended'] > 0 else 0

        final_results.append({
            'Model': model_name,
            f'Hit Rate @{k}': f"{hit_rate:.2f}%",
            f'Recall @{k}': f"{recall:.2f}%",
            f'Precision @{k}': f"{precision:.2f}%"
        })

    print("\n✅ ¡Evaluación completada!\n")
    return pd.DataFrame(final_results).sort_values(by=f'Hit Rate @{k}', ascending=False)

In [58]:
def orchestrate_evaluation_categories(
    model_functions_dict,
    train_set_behaviors,
    test_set_behaviors,
    news_df,
    k=10,
    sample_size=100
):
    print("=" * 70)
    print(f"🎯 Métrica: Predicción por CATEGORÍAS (k={k})")
    print("=" * 70)

    user_historical_categories = get_historical_categories(train_set_behaviors, news_df)

    test_users = test_set_behaviors['UserID'].unique()
    sample_users = [u for u in test_users if u in user_historical_categories][:sample_size]

    if not sample_users:
        print("¡No se encontraron usuarios de prueba con historial de categorías!")
        return pd.DataFrame()

    print(f"📊 Evaluando para {len(sample_users)} usuarios:")

    results = defaultdict(lambda: {
        'hits': 0, 'total_recalled': 0, 'total_relevant': 0,
        'total_recommended': 0, 'total_relevant_recommended': 0
    })

    for user_id in tqdm(sample_users, desc="Progreso"):
        user_categories = user_historical_categories.get(user_id, set())
        if not user_categories:
            continue

        for model_name, recommend_func in model_functions_dict.items():

            recommendations = recommend_func(user_id, k)

            if recommendations.empty:
                continue

            recommended_categories = set(recommendations['Category'].head(k).tolist())

            if len(recommended_categories.intersection(user_categories)) > 0:
                results[model_name]['hits'] += 1

            relevant_recommended = len(recommended_categories.intersection(user_categories))
            results[model_name]['total_recalled'] += relevant_recommended
            results[model_name]['total_relevant'] += len(user_categories)
            results[model_name]['total_recommended'] += len(recommended_categories)
            results[model_name]['total_relevant_recommended'] += relevant_recommended

    final_results = []
    n_users = len(sample_users)

    for model_name, data in results.items():
        hit_rate = (data['hits'] / n_users * 100) if n_users > 0 else 0
        recall = (data['total_recalled'] / data['total_relevant'] * 100) if data['total_relevant'] > 0 else 0
        precision_val = (data['total_relevant_recommended'] / data['total_recommended'] * 100)
        precision = precision_val if data['total_recommended'] > 0 else 0

        final_results.append({
            'Model': model_name,
            f'Hit Rate @{k}': f"{hit_rate:.2f}%",
            f'Recall @{k}': f"{recall:.2f}%",
            f'Precision @{k}': f"{precision:.2f}%"
        })

    print("\n✅ ¡Evaluación completada!\n")
    return pd.DataFrame(final_results).sort_values(by=f'Hit Rate @{k}', ascending=False)

In [59]:
# Creamos un mapa de historial de entrenamiento (lo usaremos para sacar el 'seed_id')
train_history_map = train_behaviors[train_behaviors['Impressions'].notna()].groupby('UserID')['Impressions'].apply(
    lambda x: ' '.join(x).split()
).to_dict()

def get_cbf_tfidf_recommendations_wrapper(user_id, k):
    """Wrapper para el modelo CBF TF-IDF"""
    # Encontramos un 'seed_news_id' (última noticia que vio en el set de train)
    history = train_history_map.get(user_id, [])
    clicked_history = [item.split('-')[0] for item in history if item.endswith('-1')]

    if not clicked_history:
        return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory']) # Usuario sin historial

    seed_news_id = clicked_history[-1]

    # Comprobar si el seed_id es válido para este modelo
    if seed_news_id not in news2id_cbf:
         return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory']) # Seed no está en el mapa

    return get_cbf_recommendations("tf-idf", seed_news_id, top_n=k)

def get_cbf_bert_recommendations_wrapper(user_id, k):
    """Wrapper para el modelo CBF BERT"""
    history = train_history_map.get(user_id, [])
    clicked_history = [item.split('-')[0] for item in history if item.endswith('-1')]

    if not clicked_history:
        return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory'])

    seed_news_id = clicked_history[-1]

    if seed_news_id not in news2id_cbf:
         return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory'])

    # Llamamos a la función genérica con 'bert'
    return get_cbf_recommendations("bert", seed_news_id, top_n=k)

def get_cbf_entity_recommendations_wrapper(user_id, k):
    """Wrapper para el modelo CBF Entity"""
    # Encontramos un 'seed_news_id'
    history = train_history_map.get(user_id, [])
    clicked_history = [item.split('-')[0] for item in history if item.endswith('-1')]

    if not clicked_history:
        return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory'])

    seed_news_id = clicked_history[-1]

    # Comprobar si el seed_id es válido
    if seed_news_id not in news2id_cbf:
         return pd.DataFrame(columns=['NewsID', 'Title', 'Category', 'SubCategory'])

    return get_cbf_recommendations("entity", seed_news_id, top_n=k)

user_id_map_bert = dataset_bert.mapping()[0]
item_id_map_bert = dataset_bert.mapping()[2]
item_id_map_inv_bert = {v: k for k, v in item_id_map_bert.items()}

def get_lightfm_bert_recommendations(user_id_str, top_n=10):
    # Cold Start
    if user_id_str not in user_id_map_bert:
        recs = news[news['NewsID'].isin(popular_news_ids_fallback)]
        return recs[['NewsID', 'Title', 'Category', 'SubCategory']].head(top_n)

    # Predicción
    user_id_lfm = user_id_map_bert[user_id_str]
    n_items_lfm = item_features_lfm_bert.shape[0]
    item_ids_lfm = np.arange(n_items_lfm)

    scores = model_lfm_bert.predict(user_id_lfm,
                                    item_ids_lfm,
                                    item_features=item_features_lfm_bert,
                                    num_threads=4)

    top_items_lfm = np.argsort(-scores)[:top_n]
    recommended_news_ids = [item_id_map_inv_bert[lfm_id] for lfm_id in top_items_lfm]

    recommendations = news[news['NewsID'].isin(recommended_news_ids)]
    recommendations = recommendations.set_index('NewsID').loc[recommended_news_ids].reset_index()

    return recommendations[['NewsID', 'Title', 'Category', 'SubCategory']]

## Recomendaciones

In [60]:
n_recommendations = 5

### CF KNN - Collaborative Filter

In [61]:
user_id = get_userID(78)
cf_recommendation = get_cf_recommendations(user_id, top_n=n_recommendations)
cf_recommendation

Unnamed: 0,NewsID,Title,Category,SubCategory
0,N55689,"Charles Rogers, former Michigan State football...",sports,football_nfl
1,N35729,Porsche launches into second story of New Jers...,news,newsus
2,N33619,College gymnast dies following training accide...,news,newsus
3,N53585,"Rip Taylor's Cause of Death Revealed, Memorial...",tv,tvnews
4,N63970,Dean Foods files for bankruptcy,finance,finance-companies


### CBF - Content Based Filter

#### CBF TF-IDF

In [62]:
matrix = "tf-idf"
news_id = get_newID(0)
cbf_recommendations = get_cbf_recommendations(matrix, news_id, top_n=n_recommendations)
cbf_recommendations

Unnamed: 0,NewsID,Title,Category,SubCategory
0,N9056,This Is What Queen Elizabeth Is Doing About th...,lifestyle,lifestyleroyals
1,N38133,The cutest photos of royal children and their ...,lifestyle,lifestyleroyals
2,N43522,Prince Charles is Getting Into Fashion,lifestyle,lifestylevideo
3,N60671,Prince Charles Teared Up When Prince William T...,lifestyle,lifestyleroyals
4,N63823,Prince Charles Hit by One of the Most Incredib...,lifestyle,lifestyleroyals


#### CBF BERT

In [63]:
matrix = "bert"
news_id = get_newID(0)
cbf_recommendations = get_cbf_recommendations(matrix, news_id, top_n=n_recommendations)
cbf_recommendations

Unnamed: 0,NewsID,Title,Category,SubCategory
0,N55244,A Brief History of Royals Wearing Denim,lifestyle,lifestylebuzz
1,N42227,Queen Elizabeth's Favorite Beauty Products Hav...,lifestyle,lifestyleroyals
2,N34718,This is Who Can Wear and Borrow the Crown Jewels,lifestyle,lifestylevideo
3,N39208,21 Things You Never Knew About America's 'Roya...,lifestyle,lifestylesmartliving
4,N24668,101 Photos of the Youngest Royals Hanging Out ...,lifestyle,lifestyleroyals


#### CBF Entity Embeddings

In [64]:
matrix = "entity"
news_id = get_newID(0)
cbf_recommendations = get_cbf_recommendations(matrix, news_id, top_n=n_recommendations)
cbf_recommendations

Unnamed: 0,NewsID,Title,Category,SubCategory
0,N38133,The cutest photos of royal children and their ...,lifestyle,lifestyleroyals
1,N39683,Prince Charles just turned 71 here are the b...,lifestyle,lifestyleroyals
2,N42777,Prince George's Royal Life in Photos,lifestyle,lifestyleroyals
3,N63495,65 Photos of Prince Charles You've Probably Ne...,lifestyle,lifestyleroyals
4,N64143,Adorable Moments Between Queen Elizabeth and H...,lifestyle,lifestyleroyals


### LightFM - Híbrido

#### LightFM TF-IDF

In [65]:
user_id = get_userID(78)
hybrid_recs = get_lightfm_tfidf_recommendations(user_id, top_n=n_recommendations)
hybrid_recs

Unnamed: 0,NewsID,Title,Category,SubCategory
0,N45921,Instant analysis: Behind Lamar Jackson's big d...,sports,football_nfl
1,N46283,Trump DC hotel sales pitch boasts of millions ...,news,newspolitics
2,N55325,Alicia Keys Returns as GRAMMY Awards Host for ...,music,music-grammys
3,N10121,Disney+ isn't working for some users on launch...,news,newsscienceandtechnology
4,N50675,Utah death-row inmate featured in best-selling...,news,newscrime


#### LightFM BERT

In [66]:
user_id = get_userID(78)
hybrid_recs = get_lightfm_bert_recommendations(user_id, top_n=n_recommendations)
hybrid_recs

Unnamed: 0,NewsID,Title,Category,SubCategory
0,N47020,The News In Cartoons,news,newsopinion
1,N1034,'Baby Trump' balloon slashed at Alabama appear...,news,newsus
2,N25791,Opinions | We thought Trump was the biggest co...,news,newsopinion
3,N35729,Porsche launches into second story of New Jers...,news,newsus
4,N57005,This Is The Moment A Senior Shelter Dog Knew H...,news,newsgoodnews


### DeepFM - Híbrido

In [76]:
user_id = get_userID(78)
hybrid_recs = get_deepfm_recommendations(user_id, top_n=n_recommendations)
hybrid_recs

Unnamed: 0,NewsID,Title,Category,SubCategory
0,N58728,How Patriots had good Sunday despite being on ...,sports,football_nfl
1,N45700,Why the Chiefs are struggling to finish off re...,sports,football_nfl
2,N35680,Bengals cut candidates who could follow Presto...,sports,football_nfl
3,N5038,Ryan Clark makes a lofty comparison for the 20...,sports,football_nfl
4,N25791,Opinions | We thought Trump was the biggest co...,news,newsopinion


## Evaluación de Modelos

In [78]:
model_functions_to_test = {
    "CF KNN": get_cf_recommendations,
    "CBF TF-IDF": get_cbf_tfidf_recommendations_wrapper,
    "CBF BERT": get_cbf_bert_recommendations_wrapper,
    # "CBF Entity": get_cbf_entity_recommendations_wrapper,
    "LightFM TF-IDF": get_lightfm_tfidf_recommendations,
    "LightFM BERT": get_lightfm_bert_recommendations,
    "DeepFM BERT": get_deepfm_recommendations
}

In [69]:
K_VALUE = 10
SAMPLE_SIZE = 100

In [79]:
exact_results = evaluate_exact_news_prediction(
    model_functions_to_test,
    test_behaviors,
    k=K_VALUE,
    sample_size=SAMPLE_SIZE
)

print("\n📊 RESULTADOS POR NOTICIAS:")
exact_results

🎯 Métrica: Predicción de NOTICIAS EXACTAS (k=10)
📊 Evaluando para 100 usuarios:


Progreso:   0%|          | 0/100 [00:00<?, ?it/s]


✅ ¡Evaluación completada!


📊 RESULTADOS POR NOTICIAS:


Unnamed: 0,Model,Hit Rate @10,Recall @10,Precision @10
5,DeepFM BERT,3.00%,1.61%,0.30%
4,LightFM BERT,15.00%,8.60%,1.60%
3,LightFM TF-IDF,12.00%,6.45%,1.20%
2,CBF BERT,1.00%,0.66%,0.13%
0,CF KNN,0.00%,0.00%,0.00%
1,CBF TF-IDF,0.00%,0.00%,0.00%


### Predicción según noticias específicas

### Predicción según categorías relevantes

In [80]:
category_results = orchestrate_evaluation_categories(
    model_functions_to_test,
    train_behaviors,
    test_behaviors,
    news,
    k=K_VALUE,
    sample_size=SAMPLE_SIZE
)

print("\n📊 RESULTADOS POR CATEGORÍAS:")
category_results

🎯 Métrica: Predicción por CATEGORÍAS (k=10)
📊 Evaluando para 100 usuarios:


Progreso:   0%|          | 0/100 [00:00<?, ?it/s]


✅ ¡Evaluación completada!


📊 RESULTADOS POR CATEGORÍAS:


Unnamed: 0,Model,Hit Rate @10,Recall @10,Precision @10
4,LightFM BERT,96.00%,61.68%,48.50%
0,CF KNN,91.00%,57.34%,35.23%
1,CBF TF-IDF,90.00%,43.68%,45.78%
2,CBF BERT,90.00%,35.92%,49.60%
3,LightFM TF-IDF,87.00%,55.71%,38.18%
5,DeepFM BERT,77.00%,28.80%,48.62%


### Novedad y Diversidad

In [83]:
exploratory_results = evaluate_exploratory_metrics(
    model_functions_to_test,
    test_behaviors,
    k=K_VALUE,
    sample_size=SAMPLE_SIZE
)

print("\n📊 RESULTADOS EXPLORATORIOS:")
exploratory_results


✨ Métrica: Exploración (Novedad y Diversidad) (k=10)


Progreso:   0%|          | 0/100 [00:00<?, ?it/s]


✅ Evaluación exploratoria completada!


📊 RESULTADOS EXPLORATORIOS:


Unnamed: 0,Model,Avg Novelty,Avg Diversity
3,LightFM TF-IDF,9.3979,0.8918
0,CF KNN,9.3704,0.8938
1,CBF TF-IDF,9.1231,0.6212
4,LightFM BERT,9.0886,0.8686
2,CBF BERT,8.8031,0.5148
5,DeepFM BERT,18.1198,0.8166
