<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

#  MIND Utils Generation

MIND dataset\[1\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

Many news recommendation methods use word embeddings, news vertical embeddings, news subvertical embeddings and user id embedding. Therefore, it is necessary to generate a word dictionary, a vertical dictionary, a subvertical dictionary and a `userid` dictionary to convert words, news verticals, subverticals and user ids from strings to indexes. To use the pretrain word embedding, an embedding matrix is generated as the initial weight of the word embedding layer.

This notebook gives examples about how to generate:
* `word_dict.pkl`: convert the words in news titles into indexes.
* `word_dict_all.pkl`: convert the words in news titles and abstracts into indexes.
* `embedding.npy`: pretrained word embedding matrix of words in word_dict.pkl
* `embedding_all.npy`: pretrained embedding matrix of words in word_dict_all.pkl
* `vert_dict.pkl`: convert news verticals into indexes.
* `subvert_dict.pkl`: convert news subverticals into indexes.
* `uid2index.pkl`: convert user ids into indexes.

In [2]:
import os
import sys
import numpy as np
import pandas as pd
from tqdm import tqdm
import pickle
from collections import Counter
from tempfile import TemporaryDirectory

from recommenders.datasets.mind import (download_mind,
                                     extract_mind,
                                     download_and_extract_glove,
                                     load_glove_matrix,
                                     word_tokenize
                                    )
from recommenders.datasets.download_utils import unzip_file
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))


System version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]


In [3]:
# MIND sizes: "demo", "small" or "large"
mind_type="small" 
# word_embedding_dim should be in [50, 100, 200, 300]
word_embedding_dim = 300

In [4]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name
train_zip, valid_zip = download_mind(size=mind_type, dest_path=data_path)
unzip_file(train_zip, os.path.join(data_path, 'train'), clean_zip_file=False)
unzip_file(valid_zip, os.path.join(data_path, 'valid'), clean_zip_file=False)
output_path = os.path.join(data_path, 'utils')
os.makedirs(output_path, exist_ok=True)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 51.8k/51.8k [00:06<00:00, 8.62kKB/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 30.2k/30.2k [00:04<00:00, 6.91kKB/s]


## Prepare utils of news

* word dictionary
* vertical dictionary
* subvetical dictionary

In [5]:
news = pd.read_table(os.path.join(data_path, 'train', 'news.tsv'),
                     names=['newid', 'vertical', 'subvertical', 'title',
                            'abstract', 'url', 'entities in title', 'entities in abstract'],
                     usecols = ['newid','vertical', 'subvertical', 'title', 'abstract'])

print(len(news))


51282


In [6]:
news.head()

Unnamed: 0,newid,vertical,subvertical,title,abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."


In [7]:
news_vertical = news.vertical.drop_duplicates().reset_index(drop=True)
vert_dict_inv = news_vertical.to_dict()
vert_dict = {v: k+1 for k, v in vert_dict_inv.items()}

news_subvertical = news.subvertical.drop_duplicates().reset_index(drop=True)
subvert_dict_inv = news_subvertical.to_dict()
subvert_dict = {v: k+1 for k, v in vert_dict_inv.items()}

In [8]:
news.title = news.title.apply(word_tokenize)
news.abstract = news.abstract.apply(word_tokenize)

In [9]:
word_cnt = Counter()
word_cnt_all = Counter()

for i in tqdm(range(len(news))):
    word_cnt.update(news.loc[i]['title'])
    word_cnt_all.update(news.loc[i]['title'])
    word_cnt_all.update(news.loc[i]['abstract'])

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 51282/51282 [00:01<00:00, 26401.93it/s]


In [10]:
word_dict = {k: v+1 for k, v in zip(word_cnt, range(len(word_cnt)))}
word_dict_all = {k: v+1 for k, v in zip(word_cnt_all, range(len(word_cnt_all)))}

In [11]:
with open(os.path.join(output_path, 'vert_dict.pkl'), 'wb') as f:
    pickle.dump(vert_dict, f)
    
with open(os.path.join(output_path, 'subvert_dict.pkl'), 'wb') as f:
    pickle.dump(subvert_dict, f)

with open(os.path.join(output_path, 'word_dict.pkl'), 'wb') as f:
    pickle.dump(word_dict, f)
    
with open(os.path.join(output_path, 'word_dict_all.pkl'), 'wb') as f:
    pickle.dump(word_dict_all, f)

## Prepare embedding matrixs
* embedding.npy
* embedding_all.npy

In [12]:
glove_path = download_and_extract_glove(data_path)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 842k/842k [00:09<00:00, 90.1kKB/s] 


In [13]:
embedding_matrix, exist_word = load_glove_matrix(glove_path, word_dict, word_embedding_dim)
embedding_all_matrix, exist_all_word = load_glove_matrix(glove_path, word_dict_all, word_embedding_dim)

400001it [00:04, 91479.56it/s] 
400001it [00:05, 79081.33it/s] 


In [14]:
np.save(os.path.join(output_path, 'embedding.npy'), embedding_matrix)
np.save(os.path.join(output_path, 'embedding_all.npy'), embedding_all_matrix)

## Prepare uid2index.pkl

In [15]:
uid2index = {}

with open(os.path.join(data_path, 'train', 'behaviors.tsv'), 'r') as f:
    for l in tqdm(f):
        uid = l.strip('\n').split('\t')[1]
        if uid not in uid2index:
            uid2index[uid] = len(uid2index) + 1

156965it [00:00, 664988.87it/s]


In [16]:
with open(os.path.join(output_path, 'uid2index.pkl'), 'wb') as f:
    pickle.dump(uid2index, f)

In [17]:
utils_state = {
    'vert_num': len(vert_dict),
    'subvert_num': len(subvert_dict),
    'word_num': len(word_dict),
    'word_num_all': len(word_dict_all),
    'embedding_exist_num': len(exist_word),
    'embedding_exist_num_all': len(exist_all_word),
    'uid2index': len(uid2index)
}
utils_state

{'vert_num': 17,
 'subvert_num': 17,
 'word_num': 31029,
 'word_num_all': 55028,
 'embedding_exist_num': 29081,
 'embedding_exist_num_all': 48422,
 'uid2index': 50000}

In [18]:
# Record results for tests - ignore this cell
store_metadata("vert_num", len(vert_dict))
store_metadata("subvert_num", len(subvert_dict))
store_metadata("word_num", len(word_dict))
store_metadata("word_num_all", len(word_dict_all))
store_metadata("embedding_exist_num", len(exist_word))
store_metadata("embedding_exist_num_all", len(exist_all_word))
store_metadata("uid2index", len(uid2index))

## Content based filtering      

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sparse_dot_topn import awesome_cossim_topn
import pandas as pd
import numpy as np

print(len(news))

#  Combine text features for TF-IDF
news['combined_text'] = news['vertical'] + ' ' + news['subvertical'] + ' ' + \
                        news['title'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x) + ' ' + \
                        news['abstract'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

#  Initialize TF-IDF with stopword removal and feature limit
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
tfidf_matrix = tfidf_vectorizer.fit_transform(news['combined_text'])

#  Convert to sparse matrix for efficient computation
tfidf_matrix_sparse = csr_matrix(tfidf_matrix)

#  Compute sparse cosine similarity using `sparse_dot_topn`
top_n = 10  # Keep only top-10 most similar items per article
threshold = 0.01  # Ignore weak similarities
cosine_sim = awesome_cossim_topn(tfidf_matrix_sparse, tfidf_matrix_sparse.T, top_n, threshold)

#  Function to get top-N recommended articles
def get_recommendations(article_index, top_n=5):
    """Returns top-N most similar news articles based on content similarity."""
    
    # Extract similarity scores for the given article
    sim_scores = cosine_sim[article_index].toarray().flatten()  # Convert sparse row to dense array

    # Sort and retrieve top-N similar articles (excluding itself)
    sim_scores = list(enumerate(sim_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]  

    # Extract recommended news details
    recommended_articles = []
    for i in sim_scores:
        news_id = news.iloc[i[0]]['newid']  
        title = news.iloc[i[0]]['title']
        genre = news.iloc[i[0]]['vertical']  
        subgenre = news.iloc[i[0]]['subvertical']

        # If title is a list, convert to a readable string
        if isinstance(title, list):
            title = ' '.join(title)

        recommended_articles.append(f"{news_id}: {title} (Genre: {genre}, Subgenre: {subgenre})")

    return recommended_articles

#  Function to get top-N recommended news IDs (for evaluation)
def get_recommendations_news_ids(article_index, top_n=10):
    """Returns top-N most similar news article IDs."""
    
    # Extract similarity scores
    sim_scores = cosine_sim[article_index].toarray().flatten()  

    # Sort and retrieve top-N articles (excluding itself)
    sim_scores = list(enumerate(sim_scores))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]  

    # Return only the news IDs
    recommended_news_ids = [news.iloc[i[0]]['newid'] for i in sim_scores]
    return recommended_news_ids

#  Example: Get recommendations for the first article
recommended_articles = get_recommendations(0, top_n=15)

#  Pretty print the results
print("Recommended Articles:\n")
for idx, article in enumerate(recommended_articles, start=1):
    print(f"{idx}. {article}")


51282


  cosine_sim = awesome_cossim_topn(tfidf_matrix_sparse, tfidf_matrix_sparse.T, top_n, threshold)


Recommended Articles:

1. N9056: this is what queen elizabeth is doing about the prince william prince harry feud (Genre: lifestyle, Subgenre: lifestyleroyals)
2. N60671: prince charles teared up when prince william talked about succeeding him (Genre: lifestyle, Subgenre: lifestyleroyals)
3. N38133: the cutest photos of royal children and their beloved nannies from prince george to the queen (Genre: lifestyle, Subgenre: lifestyleroyals)
4. N43522: prince charles is getting into fashion (Genre: lifestyle, Subgenre: lifestylevideo)
5. N63174: prince albert on twins jacques and gabriella they re starting to say , are we there yet ? (Genre: lifestyle, Subgenre: lifestyleroyals)
6. N51725: prince charles looks in awe of master archie at christening (Genre: video, Subgenre: lifestyle)
7. N18530: all the photos of prince charles s trip to japan for emperor naruhito s enthronement ceremony (Genre: lifestyle, Subgenre: lifestyleroyals)
8. N43301: see all the best photos of prince charles s trip

## Content based filtering result check

In [20]:
valid_news = pd.read_table(
    os.path.join(data_path, 'valid', 'news.tsv'),
    names=['newid', 'vertical', 'subvertical', 'title', 'abstract', 'url', 'entities in title', 'entities in abstract'],
    usecols=['newid', 'vertical', 'subvertical', 'title', 'abstract']
)



article_index = 5  # Choose a random validation article

print("\nTesting on Validation Data:")
print(f"ID: {valid_news.iloc[article_index]['newid']}")
print(f"Title: {valid_news.iloc[article_index]['title']}")
print(f"Genre: {valid_news.iloc[article_index]['vertical']}")
print(f"Subgenre: {valid_news.iloc[article_index]['subvertical']}\n")

# Get recommendations based on the validation article
recommended_articles = get_recommendations(article_index, top_n=15)

print("Recommended Articles:\n")
for idx, article in enumerate(recommended_articles, start=1):
    print(f"{idx}. {article}")



Testing on Validation Data:
ID: N2073
Title: Should NFL be able to fine players for criticizing officiating?
Genre: sports
Subgenre: football_nfl

Recommended Articles:

1. N37948: prescott bad on the nfl if it does not protect mic d up players (Genre: sports, Subgenre: football_nfl)
2. N21089: dak prescott bad on the brand if nfl does not protect mic d up players (Genre: sports, Subgenre: football_nfl)
3. N3314: 5 nfl breakout players of 2019 (Genre: sports, Subgenre: football_nfl)
4. N46662: nfl cheerleaders (Genre: sports, Subgenre: football_nfl)
5. N846: nfl week 7 awards is this the best photo ever taken of a nfl player ? (Genre: sports, Subgenre: football_nfl)
6. N12200: teams with most and fewest in state players (Genre: sports, Subgenre: football_ncaa)
7. N33164: 100 famous nfl players who played for teams you forgot about (Genre: sports, Subgenre: football_nfl)
8. N8921: nfl week 6 awards this baffling call in cleveland was the worst of the week (Genre: sports, Subgenre: foot

In [21]:
# Load validation impressions 
valid_behaviors = pd.read_table(
    os.path.join(data_path, 'valid', 'behaviors.tsv'),
    names=['impression_id', 'user_id', 'time', 'history', 'impressions']
)

# Extract a sample user's history
sample_user = valid_behaviors.iloc[0]

print(f"User {sample_user['user_id']} previously read:")
print(sample_user['history'])

print("\nRecommended articles:")
recommended_articles = get_recommendations(0, top_n=5)
for idx, article in enumerate(recommended_articles, start=1):
    print(f"{idx}. {article}")


User U80234 previously read:
N55189 N46039 N51741 N53234 N11276 N264 N40716 N28088 N43955 N6616 N47686 N63573 N38895 N30924 N35671

Recommended articles:
1. N9056: this is what queen elizabeth is doing about the prince william prince harry feud (Genre: lifestyle, Subgenre: lifestyleroyals)
2. N60671: prince charles teared up when prince william talked about succeeding him (Genre: lifestyle, Subgenre: lifestyleroyals)
3. N38133: the cutest photos of royal children and their beloved nannies from prince george to the queen (Genre: lifestyle, Subgenre: lifestyleroyals)
4. N43522: prince charles is getting into fashion (Genre: lifestyle, Subgenre: lifestylevideo)
5. N63174: prince albert on twins jacques and gabriella they re starting to say , are we there yet ? (Genre: lifestyle, Subgenre: lifestyleroyals)


In [43]:


def load_data(split="train"):
    """
    Loads news and behaviors data for a given dataset split ('train' or 'valid').
    Ensures we only evaluate on articles that exist in the dataset.
    """
    news_df = pd.read_table(
        os.path.join(data_path, split, 'news.tsv'),
        names=['newid', 'vertical', 'subvertical', 'title', 'abstract']
    )
    
    behaviors_df = pd.read_table(
        os.path.join(data_path, split, 'behaviors.tsv'),
        names=['impression_id', 'user_id', 'time', 'history', 'impressions']
    )

    return news_df, behaviors_df

# ‚úÖ Load training dataset
train_news, train_behaviors = load_data(split="train")

# ‚úÖ Load validation dataset
valid_news, valid_behaviors = load_data(split="valid")


  similarity_matrix = awesome_cossim_topn(tfidf_matrix, tfidf_matrix.T, top_n, 0.01)  # Keep top-N scores


In [47]:
import os
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sparse_dot_topn import awesome_cossim_topn
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity

# ‚úÖ Define cache paths
PROJECT_DIR = os.getcwd()
CACHE_DIR = os.path.join(PROJECT_DIR, "cache")
os.makedirs(CACHE_DIR, exist_ok=True)

SIMILARITY_CACHE = os.path.join(CACHE_DIR, "similarity_cache.pkl")
CLUSTER_CACHE = os.path.join(CACHE_DIR, "clusters.pkl")

# ‚úÖ Load news and behavior data
def load_data(split="train"):
    """
    Loads news and behaviors data for a given dataset split ('train' or 'valid').
    Ensures we only evaluate on articles that exist in the dataset.
    """
    news_df = pd.read_table(
        os.path.join(data_path, split, 'news.tsv'),
        names=['newid', 'vertical', 'subvertical', 'title', 'abstract']
    )
    
    behaviors_df = pd.read_table(
        os.path.join(data_path, split, 'behaviors.tsv'),
        names=['impression_id', 'user_id', 'time', 'history', 'impressions']
    )

    return news_df, behaviors_df

# ‚úÖ Load train & validation datasets separately
train_news, train_behaviors = load_data(split="train")
valid_news, valid_behaviors = load_data(split="valid")

# ‚úÖ Prepare text features
def prepare_text_features(news_df):
    """Prepares TF-IDF text representation for news articles."""
    news_df['combined_text'] = news_df[['vertical', 'subvertical', 'title', 'abstract']].fillna('').agg(' '.join, axis=1)
    return news_df

train_news = prepare_text_features(train_news)
valid_news = prepare_text_features(valid_news)

# ‚úÖ Compute TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_news['combined_text'])
tfidf_matrix_valid = tfidf_vectorizer.transform(valid_news['combined_text'])

# ‚úÖ Compute sparse cosine similarity using `sparse_dot_topn`
def compute_sparse_similarity(tfidf_matrix, top_n=100):
    """
    Computes sparse cosine similarity, keeping only the top `top_n` similarities per row.
    """
    tfidf_matrix = csr_matrix(tfidf_matrix)  # Convert to sparse matrix
    similarity_matrix = awesome_cossim_topn(tfidf_matrix, tfidf_matrix.T, top_n, 0.01)  # Keep top-N scores
    return similarity_matrix

# ‚úÖ Compute sparse similarity matrices
train_similarity_matrix = compute_sparse_similarity(tfidf_matrix_train, top_n=100)
valid_similarity_matrix = compute_sparse_similarity(tfidf_matrix_valid, top_n=100)

# ‚úÖ Clustering for efficiency
NUM_CLUSTERS = 5
kmeans = KMeans(n_clusters=NUM_CLUSTERS, random_state=42, n_init=10)
train_news['cluster'] = kmeans.fit_predict(tfidf_matrix_train)
clusters = train_news[['newid', 'cluster']].set_index('newid').to_dict()['cluster']

# ‚úÖ Define Content-Based Recommender
class ContentBasedRecommender:
    def __init__(self, news_df, behaviors_df, similarity_matrix):
        self.news_df = news_df
        self.behaviors_df = behaviors_df
        self.similarity_matrix = similarity_matrix

    def recommend(self, user_id, N=5):
        """Recommend top-N articles for a user within the same dataset."""
        
        user_row = self.behaviors_df[self.behaviors_df["user_id"] == user_id]
        if user_row.empty or pd.isna(user_row.iloc[0]["history"]):
            return []  # No history

        clicked_articles = user_row.iloc[0]["history"].split()
        valid_clicked = [a for a in clicked_articles if a in self.news_df["newid"].values]

        if not valid_clicked:
            return []  # No matching history in dataset

        recommended_news_ids = set()
        for article in valid_clicked:
            article_index = self.news_df[self.news_df["newid"] == article].index[0]
            sim_scores = list(zip(self.similarity_matrix[article_index].indices, self.similarity_matrix[article_index].data))
            sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
            recommended_news_ids.update([self.news_df.iloc[i[0]]['newid'] for i in sim_scores[:N]])

        return list(recommended_news_ids)[:N]

# ‚úÖ Custom Evaluation Function
def evaluate_model(recommender, behaviors_df, K=5, min_interactions=5):
    """
    Evaluates a recommender model while filtering users with fewer than `min_interactions`.
    Ensures correct user counts and prints debug information.
    """
    precision_scores = []
    recall_scores = []
    ndcg_scores = []
    
    skipped_users = set()
    evaluated_users = set()

    for _, row in behaviors_df.iterrows():
        user_id = row["user_id"]
        if pd.isna(row["impressions"]):
            continue

        # Extract clicked articles
        actual_clicked = {item.split("-")[0] for item in row["impressions"].split() if item.endswith("-1")}

        # Skip users with fewer than `min_interactions`
        if len(actual_clicked) < min_interactions:
            skipped_users.add(user_id)
            continue  

        evaluated_users.add(user_id)

        # Get recommendations
        recommended = recommender.recommend(user_id, N=K) or []
        recommended = recommended[:K]

        # Compute scores
        precision_scores.append(len(set(recommended) & actual_clicked) / K if recommended else 0)
        recall_scores.append(len(set(recommended) & actual_clicked) / len(actual_clicked) if recommended else 0)
        dcg = sum(1 / np.log2(i + 2) for i, item in enumerate(recommended[:K]) if item in actual_clicked)
        idcg = sum(1 / np.log2(i + 2) for i in range(min(len(actual_clicked), K))) or 1
        ndcg_scores.append(dcg / idcg)

    avg_precision = np.mean(precision_scores) if precision_scores else 0.0
    avg_recall = np.mean(recall_scores) if recall_scores else 0.0
    avg_ndcg = np.mean(ndcg_scores) if ndcg_scores else 0.0

    print(f"\nüìä **User Statistics:**")
    print(f"üîπ Evaluated Users: {len(evaluated_users)}")
    print(f"üîπ Skipped Users (less than {min_interactions} interactions): {len(skipped_users)}")

    return avg_precision, avg_recall, avg_ndcg

# ‚úÖ Train & Evaluate on Training Dataset
train_recommender = ContentBasedRecommender(train_news, train_behaviors, train_similarity_matrix)
precision_train, recall_train, ndcg_train = evaluate_model(train_recommender, train_behaviors, 5, min_interactions=5)

print("\nüîπ **Training Set Evaluation:**")
print(f"üìå Precision@5: {precision_train:.4f}")
print(f"üìå Recall@5: {recall_train:.4f}")
print(f"üìå NDCG@5: {ndcg_train:.4f}")

# ‚úÖ Train & Evaluate on Validation Dataset
valid_recommender = ContentBasedRecommender(valid_news, valid_behaviors, valid_similarity_matrix)
precision_valid, recall_valid, ndcg_valid = evaluate_model(valid_recommender, valid_behaviors, 5, min_interactions=5)

print("\nüîπ **Validation Set Evaluation:**")
print(f"üìå Precision@5: {precision_valid:.4f}")
print(f"üìå Recall@5: {recall_valid:.4f}")
print(f"üìå NDCG@5: {ndcg_valid:.4f}")


  similarity_matrix = awesome_cossim_topn(tfidf_matrix, tfidf_matrix.T, top_n, 0.01)  # Keep top-N scores



üìä **User Statistics:**
üîπ Evaluated Users: 3135
üîπ Skipped Users (less than 5 interactions): 49664

üîπ **Training Set Evaluation:**
üìå Precision@5: 0.0000
üìå Recall@5: 0.0000
üìå NDCG@5: 0.0000

üìä **User Statistics:**
üîπ Evaluated Users: 1905
üîπ Skipped Users (less than 5 interactions): 49073

üîπ **Validation Set Evaluation:**
üìå Precision@5: 0.0000
üìå Recall@5: 0.0000
üìå NDCG@5: 0.0000


In [49]:
sample_article_index = 1
sample_article_id = valid_news.iloc[sample_article_index]["newid"]
sample_article_title = valid_news.iloc[sample_article_index]["title"]
sample_article_abstract = valid_news.iloc[sample_article_index]["abstract"]

# Get top 5 recommendations for this article
sim_scores = list(zip(valid_similarity_matrix[sample_article_index].indices, valid_similarity_matrix[sample_article_index].data))
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[:5]

# Print the selected article
print(f"\nüîç **Selected Article:**")
print(f"üìå ID: {sample_article_id}")
print(f"üì∞ Title: {sample_article_title}")
print(f"üìÑ Abstract: {sample_article_abstract}")
print(f"\nüîç **Top 5 Recommended Articles:**")

# Print recommended articles
for idx, (rec_index, score) in enumerate(sim_scores):
    rec_article = valid_news.iloc[rec_index]
    print(f"\n‚≠ê **Recommendation {idx+1} (Similarity: {score:.4f})**")
    print(f"üìå ID: {rec_article['newid']}")
    print(f"üì∞ Title: {rec_article['title']}")
    print(f"üìÑ Abstract: {rec_article['abstract']}")



üîç **Selected Article:**
üìå ID: Dispose of unwanted prescription drugs during the DEA's Take Back Day
üì∞ Title: [{"Label": "Drug Enforcement Administration", "Type": "O", "WikidataId": "Q622899", "Confidence": 0.992, "OccurrenceOffsets": [50], "SurfaceForms": ["DEA"]}]
üìÑ Abstract: []

üîç **Top 5 Recommended Articles:**

‚≠ê **Recommendation 1 (Similarity: 1.0000)**
üìå ID: Dispose of unwanted prescription drugs during the DEA's Take Back Day
üì∞ Title: [{"Label": "Drug Enforcement Administration", "Type": "O", "WikidataId": "Q622899", "Confidence": 0.992, "OccurrenceOffsets": [50], "SurfaceForms": ["DEA"]}]
üìÑ Abstract: []

‚≠ê **Recommendation 2 (Similarity: 0.8527)**
üìå ID: The Drug Enforcement Administration warns of lethal fentanyl-laced pills
üì∞ Title: [{"Label": "Drug Enforcement Administration", "Type": "O", "WikidataId": "Q622899", "Confidence": 1.0, "OccurrenceOffsets": [4], "SurfaceForms": ["Drug Enforcement Administration"]}]
üìÑ Abstract: []

‚≠ê **Reco

In [39]:
# ‚úÖ Extract all unique clicked articles from validation
clicked_articles_in_valid = set()

for _, row in valid_behaviors_df.iterrows():
    if pd.isna(row["impressions"]):
        continue
    clicked_articles = {item.split("-")[0] for item in row["impressions"].split() if item.endswith("-1")}
    clicked_articles_in_valid.update(clicked_articles)

# ‚úÖ Extract all unique articles in training
train_articles = set(train_news["newid"].unique())

# ‚úÖ Find missing articles
missing_articles = clicked_articles_in_valid - train_articles

print(f"üîπ **Total Clicked Articles in Validation:** {len(clicked_articles_in_valid)}")
print(f"‚úÖ **Clicked Articles Found in Training Data:** {len(clicked_articles_in_valid & train_articles)}")
print(f"‚ùå **Clicked Articles Missing from Training:** {len(missing_articles)}")


üîπ **Total Clicked Articles in Validation:** 2212
‚úÖ **Clicked Articles Found in Training Data:** 0
‚ùå **Clicked Articles Missing from Training:** 2212


In [28]:
import os
import pickle
import numpy as np
import pandas as pd
from utils.evaluation import evaluate_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

# ‚úÖ Define cache file paths
PROJECT_DIR = os.getcwd()
CACHE_DIR = os.path.join(PROJECT_DIR, "cache")
os.makedirs(CACHE_DIR, exist_ok=True)

SIMILARITY_CACHE = os.path.join(CACHE_DIR, "similarity_cache.pkl")
CLUSTER_CACHE = os.path.join(CACHE_DIR, "clusters.pkl")

# ‚úÖ Load train & validation news
train_news = pd.read_table(
    os.path.join(data_path, 'train', 'news.tsv'),
    names=['newid', 'vertical', 'subvertical', 'title', 'abstract']
)

valid_news = pd.read_table(
    os.path.join(data_path, 'valid', 'news.tsv'),
    names=['newid', 'vertical', 'subvertical', 'title', 'abstract']
)

# ‚úÖ Prepare text features
# ‚úÖ Ensure there are no NaN values in combined_text
train_news['combined_text'] = train_news[['vertical', 'subvertical', 'title', 'abstract']].fillna('').agg(' '.join, axis=1)
valid_news['combined_text'] = valid_news[['vertical', 'subvertical', 'title', 'abstract']].fillna('').agg(' '.join, axis=1)

# ‚úÖ Compute TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
tfidf_matrix_train = tfidf_vectorizer.fit_transform(train_news['combined_text'])
tfidf_matrix_valid = tfidf_vectorizer.transform(valid_news['combined_text'])  # ‚úÖ Only transform for valid

# ‚úÖ Compute similarity between validation & training articles
if os.path.exists(SIMILARITY_CACHE):
    with open(SIMILARITY_CACHE, "rb") as f:
        similarity_matrix = pickle.load(f)
else:
    print("Computing similarity between validation and training articles...")
    similarity_matrix = cosine_similarity(tfidf_matrix_valid, tfidf_matrix_train)
    with open(SIMILARITY_CACHE, "wb") as f:
        pickle.dump(similarity_matrix, f)

# ‚úÖ Clustering for efficient search
NUM_CLUSTERS = 5  # ‚úÖ Reduce clusters for better coverage
if os.path.exists(CLUSTER_CACHE):
    with open(CLUSTER_CACHE, "rb") as f:
        clusters = pickle.load(f)
else:
    kmeans = KMeans(n_clusters=NUM_CLUSTERS, random_state=42, n_init=10)
    train_news['cluster'] = kmeans.fit_predict(tfidf_matrix_train)
    clusters = train_news[['newid', 'cluster']].set_index('newid').to_dict()['cluster']
    with open(CLUSTER_CACHE, "wb") as f:
        pickle.dump(clusters, f)

# ‚úÖ Define Content-Based Recommender
class ContentBasedRecommender:
    def __init__(self, train_news, valid_news, similarity_matrix, clusters, behaviors_df):
        self.train_news = train_news
        self.valid_news = valid_news
        self.similarity_matrix = similarity_matrix
        self.clusters = clusters
        self.behaviors_df = behaviors_df

    def recommend(self, user_id, N=10):
        """Recommend top-N articles for a user."""
        
        user_row = self.behaviors_df[self.behaviors_df["user_id"] == user_id]
        if user_row.empty or pd.isna(user_row.iloc[0]["history"]):
            return []  # No history

        clicked_articles = user_row.iloc[0]["history"].split()
        valid_clicked = [a for a in clicked_articles if a in self.train_news["newid"].values]

        if not valid_clicked:
            print(f"‚ùå No matching clicked articles found for User {user_id}.")
            return []

        recommended_news_ids = set()
        for article in valid_clicked:
            article_index = self.train_news[self.train_news["newid"] == article].index[0]
            sim_scores = sorted(enumerate(self.similarity_matrix[:, article_index]), key=lambda x: x[1], reverse=True)
            recommended_news_ids.update([self.valid_news.iloc[i[0]]['newid'] for i in sim_scores[:N]])

        return list(recommended_news_ids)[:N]

# ‚úÖ Load validation behaviors
valid_behaviors_df = pd.read_table(
    os.path.join(data_path, 'valid', 'behaviors.tsv'),
    names=['impression_id', 'user_id', 'time', 'history', 'impressions']
)

# ‚úÖ Initialize Recommender
recommender = ContentBasedRecommender(train_news, valid_news, similarity_matrix, clusters, valid_behaviors_df)

# ‚úÖ Evaluate
precision, recall, ndcg = evaluate_model(recommender, valid_behaviors_df, 10)

# ‚úÖ Print Results
print(f"\nüîπ **Final Evaluation Results:**")
print(f"üìå Precision@10: {precision:.4f}")
print(f"üìå Recall@10: {recall:.4f}")
print(f"üìå NDCG@10: {ndcg:.4f}")



 User U86141
    Clicked: {'N44621'}
    Recommended: []
   üéØ Matches: set()
‚ùå No matches found for this user!
‚ùå No matching clicked articles found for User U87658.

KeyboardInterrupt: 

In [None]:
class DummyRecommender:
    def recommend(self, user_id, N=5):
        """Always returns the actual clicked articles (perfect recommender)."""
        user_row = valid_behaviors_df[valid_behaviors_df["user_id"] == user_id]
        if user_row.empty or pd.isna(user_row.iloc[0]["impressions"]):
            return []
        actual_clicked = {item.split("-")[0] for item in user_row.iloc[0]["impressions"].split() if item.endswith("-1")}
        return list(actual_clicked)[:N]

# ‚úÖ Evaluate Dummy Model
dummy_recommender = DummyRecommender()
precision, recall, ndcg = evaluate_model(dummy_recommender, valid_behaviors_df, 5)

print(f"\n **Dummy Model (Perfect Recommendations) Results:**")
print(f" Precision@10: {precision:.4f}")
print(f" Recall@10: {recall:.4f}")
print(f" NDCG@10: {ndcg:.4f}")


In [None]:
tmpdir.cleanup()

## References

\[1\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>