<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

#  MIND Utils Generation

MIND dataset\[1\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

Many news recommendation methods use word embeddings, news vertical embeddings, news subvertical embeddings and user id embedding. Therefore, it is necessary to generate a word dictionary, a vertical dictionary, a subvertical dictionary and a `userid` dictionary to convert words, news verticals, subverticals and user ids from strings to indexes. To use the pretrain word embedding, an embedding matrix is generated as the initial weight of the word embedding layer.

This notebook gives examples about how to generate:
* `word_dict.pkl`: convert the words in news titles into indexes.
* `word_dict_all.pkl`: convert the words in news titles and abstracts into indexes.
* `embedding.npy`: pretrained word embedding matrix of words in word_dict.pkl
* `embedding_all.npy`: pretrained embedding matrix of words in word_dict_all.pkl
* `vert_dict.pkl`: convert news verticals into indexes.
* `subvert_dict.pkl`: convert news subverticals into indexes.
* `uid2index.pkl`: convert user ids into indexes.

In [1]:
import os
import sys
import numpy as np
import pandas as pd
from tqdm import tqdm
import pickle
from collections import Counter
from tempfile import TemporaryDirectory

from recommenders.datasets.mind import (download_mind,
                                     extract_mind,
                                     download_and_extract_glove,
                                     load_glove_matrix,
                                     word_tokenize
                                    )
from recommenders.datasets.download_utils import unzip_file
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))


System version: 3.12.4 (tags/v3.12.4:8e8a4ba, Jun  6 2024, 19:30:16) [MSC v.1940 64 bit (AMD64)]


In [2]:
# MIND sizes: "demo", "small" or "large"
mind_type="small" 
# word_embedding_dim should be in [50, 100, 200, 300]
word_embedding_dim = 300

In [3]:
tmpdir = TemporaryDirectory()
data_path = tmpdir.name
train_zip, valid_zip = download_mind(size=mind_type, dest_path=data_path)
unzip_file(train_zip, os.path.join(data_path, 'train'), clean_zip_file=False)
unzip_file(valid_zip, os.path.join(data_path, 'valid'), clean_zip_file=False)
output_path = os.path.join(data_path, 'utils')
os.makedirs(output_path, exist_ok=True)

100%|██████████| 51.8k/51.8k [00:05<00:00, 9.87kKB/s]
100%|██████████| 30.2k/30.2k [00:03<00:00, 9.02kKB/s]


## Prepare utils of news

* word dictionary
* vertical dictionary
* subvetical dictionary

In [4]:
news = pd.read_table(os.path.join(data_path, 'train', 'news.tsv'),
                     names=['newid', 'vertical', 'subvertical', 'title',
                            'abstract', 'url', 'entities in title', 'entities in abstract'],
                     usecols = ['newid','vertical', 'subvertical', 'title', 'abstract'])

print(len(news))


51282


In [5]:
news.head()

Unnamed: 0,newid,vertical,subvertical,title,abstract
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the..."
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re..."


In [6]:
news_vertical = news.vertical.drop_duplicates().reset_index(drop=True)
vert_dict_inv = news_vertical.to_dict()
vert_dict = {v: k+1 for k, v in vert_dict_inv.items()}

news_subvertical = news.subvertical.drop_duplicates().reset_index(drop=True)
subvert_dict_inv = news_subvertical.to_dict()
subvert_dict = {v: k+1 for k, v in vert_dict_inv.items()}

In [7]:
news.title = news.title.apply(word_tokenize)
news.abstract = news.abstract.apply(word_tokenize)

In [8]:
word_cnt = Counter()
word_cnt_all = Counter()

for i in tqdm(range(len(news))):
    word_cnt.update(news.loc[i]['title'])
    word_cnt_all.update(news.loc[i]['title'])
    word_cnt_all.update(news.loc[i]['abstract'])

100%|██████████| 51282/51282 [00:01<00:00, 25974.08it/s]


In [9]:
word_dict = {k: v+1 for k, v in zip(word_cnt, range(len(word_cnt)))}
word_dict_all = {k: v+1 for k, v in zip(word_cnt_all, range(len(word_cnt_all)))}

In [10]:
with open(os.path.join(output_path, 'vert_dict.pkl'), 'wb') as f:
    pickle.dump(vert_dict, f)
    
with open(os.path.join(output_path, 'subvert_dict.pkl'), 'wb') as f:
    pickle.dump(subvert_dict, f)

with open(os.path.join(output_path, 'word_dict.pkl'), 'wb') as f:
    pickle.dump(word_dict, f)
    
with open(os.path.join(output_path, 'word_dict_all.pkl'), 'wb') as f:
    pickle.dump(word_dict_all, f)

## Prepare embedding matrixs
* embedding.npy
* embedding_all.npy

In [None]:
glove_path = download_and_extract_glove(data_path)

100%|██████████| 842k/842k [01:14<00:00, 11.3kKB/s] 


In [None]:
embedding_matrix, exist_word = load_glove_matrix(glove_path, word_dict, word_embedding_dim)
embedding_all_matrix, exist_all_word = load_glove_matrix(glove_path, word_dict_all, word_embedding_dim)

400001it [00:05, 71162.62it/s]
400001it [00:05, 76498.90it/s] 


In [None]:
np.save(os.path.join(output_path, 'embedding.npy'), embedding_matrix)
np.save(os.path.join(output_path, 'embedding_all.npy'), embedding_all_matrix)

## Prepare uid2index.pkl

In [None]:
uid2index = {}

with open(os.path.join(data_path, 'train', 'behaviors.tsv'), 'r') as f:
    for l in tqdm(f):
        uid = l.strip('\n').split('\t')[1]
        if uid not in uid2index:
            uid2index[uid] = len(uid2index) + 1

156965it [00:00, 658058.85it/s]


In [None]:
with open(os.path.join(output_path, 'uid2index.pkl'), 'wb') as f:
    pickle.dump(uid2index, f)

In [None]:
utils_state = {
    'vert_num': len(vert_dict),
    'subvert_num': len(subvert_dict),
    'word_num': len(word_dict),
    'word_num_all': len(word_dict_all),
    'embedding_exist_num': len(exist_word),
    'embedding_exist_num_all': len(exist_all_word),
    'uid2index': len(uid2index)
}
utils_state

{'vert_num': 17,
 'subvert_num': 17,
 'word_num': 31029,
 'word_num_all': 55028,
 'embedding_exist_num': 29081,
 'embedding_exist_num_all': 48422,
 'uid2index': 50000}

In [None]:
# Record results for tests - ignore this cell
store_metadata("vert_num", len(vert_dict))
store_metadata("subvert_num", len(subvert_dict))
store_metadata("word_num", len(word_dict))
store_metadata("word_num_all", len(word_dict_all))
store_metadata("embedding_exist_num", len(exist_word))
store_metadata("embedding_exist_num_all", len(exist_all_word))
store_metadata("uid2index", len(uid2index))

## Content based filtering      

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

print(len(news))
# Load news dataset
news['combined_text'] = news['vertical'] + ' ' + news['subvertical'] + ' ' + \
                        news['title'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x) + ' ' + \
                        news['abstract'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)


# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)

# Compute TF-IDF matrix
tfidf_matrix = tfidf_vectorizer.fit_transform(news['combined_text'])

# Compute cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Function to get top N recommendations
def get_recommendations(article_index, top_n=5):
    """Returns top-N most similar news articles based on content similarity, formatted properly."""
    sim_scores = list(enumerate(cosine_sim[article_index]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]  # Exclude self
    
    recommended_articles = []
    for i in sim_scores:
        news_id = news.iloc[i[0]]['newid']  # Get news ID
        title = news.iloc[i[0]]['title']
        genre = news.iloc[i[0]]['vertical']  # Get genre
        subgenre = news.iloc[i[0]]['subvertical']  # Get subgenre
        
        # If title is a list of words, join into a readable string
        if isinstance(title, list):
            title = ' '.join(title)
        
        recommended_articles.append(f"{news_id}: {title} (Genre: {genre}, Subgenre: {subgenre})")
    
    return recommended_articles

# Example: Get recommendations for first article
recommended_articles = get_recommendations(0, top_n=15)

# Pretty print the results
print("Recommended Articles:\n")
for idx, article in enumerate(recommended_articles, start=1):
    print(f"{idx}. {article}")



51282
Recommended Articles:

1. N9056: this is what queen elizabeth is doing about the prince william prince harry feud (Genre: lifestyle, Subgenre: lifestyleroyals)
2. N60671: prince charles teared up when prince william talked about succeeding him (Genre: lifestyle, Subgenre: lifestyleroyals)
3. N38133: the cutest photos of royal children and their beloved nannies from prince george to the queen (Genre: lifestyle, Subgenre: lifestyleroyals)
4. N18530: all the photos of prince charles s trip to japan for emperor naruhito s enthronement ceremony (Genre: lifestyle, Subgenre: lifestyleroyals)
5. N63823: prince charles hit by one of the most incredible art hoaxes in royal history (Genre: lifestyle, Subgenre: lifestyleroyals)
6. N63174: prince albert on twins jacques and gabriella they re starting to say , are we there yet ? (Genre: lifestyle, Subgenre: lifestyleroyals)
7. N63495: 65 photos of prince charles you ve probably never seen before (Genre: lifestyle, Subgenre: lifestyleroyals)
8.

## Content based filtering validation


In [None]:
valid_news = pd.read_table(
    os.path.join(data_path, 'valid', 'news.tsv'),
    names=['newid', 'vertical', 'subvertical', 'title', 'abstract', 'url', 'entities in title', 'entities in abstract'],
    usecols=['newid', 'vertical', 'subvertical', 'title', 'abstract']
)



article_index = 5  # Choose a random validation article

print("\nTesting on Validation Data:")
print(f"ID: {valid_news.iloc[article_index]['newid']}")
print(f"Title: {valid_news.iloc[article_index]['title']}")
print(f"Genre: {valid_news.iloc[article_index]['vertical']}")
print(f"Subgenre: {valid_news.iloc[article_index]['subvertical']}\n")

# Get recommendations based on the validation article
recommended_articles = get_recommendations(article_index, top_n=15)

print("Recommended Articles:\n")
for idx, article in enumerate(recommended_articles, start=1):
    print(f"{idx}. {article}")



Testing on Validation Data:
ID: N2073
Title: Should NFL be able to fine players for criticizing officiating?
Genre: sports
Subgenre: football_nfl

Recommended Articles:

1. N61576: nfl fines baker mayfield for stating the obvious (Genre: sports, Subgenre: football_nfl)
2. N29891: nfl officiating stinks . here are 10 ways to fix it . (Genre: sports, Subgenre: football_nfl)
3. N46662: nfl cheerleaders (Genre: sports, Subgenre: football_nfl)
4. N3314: 5 nfl breakout players of 2019 (Genre: sports, Subgenre: football_nfl)
5. N51783: retired eagles de chris long calls officiating a mess , says nfl needs to do something (Genre: sports, Subgenre: football_nfl)
6. N36282: nfl sending message with multiple fines for criticizing referees (Genre: sports, Subgenre: football_nfl)
7. N12200: teams with most and fewest in state players (Genre: sports, Subgenre: football_ncaa)
8. N33164: 100 famous nfl players who played for teams you forgot about (Genre: sports, Subgenre: football_nfl)
9. N43525: nf

In [None]:
# Load validation impressions (assuming they exist)
valid_behaviors = pd.read_table(
    os.path.join(data_path, 'valid', 'behaviors.tsv'),
    names=['impression_id', 'user_id', 'time', 'history', 'impressions']
)

# Extract a sample user's history
sample_user = valid_behaviors.iloc[0]

print(f"User {sample_user['user_id']} previously read:")
print(sample_user['history'])

print("\nRecommended articles:")
recommended_articles = get_recommendations(0, top_n=5)
for idx, article in enumerate(recommended_articles, start=1):
    print(f"{idx}. {article}")


User U80234 previously read:
N55189 N46039 N51741 N53234 N11276 N264 N40716 N28088 N43955 N6616 N47686 N63573 N38895 N30924 N35671

Recommended articles:
1. N9056: this is what queen elizabeth is doing about the prince william prince harry feud (Genre: lifestyle, Subgenre: lifestyleroyals)
2. N60671: prince charles teared up when prince william talked about succeeding him (Genre: lifestyle, Subgenre: lifestyleroyals)
3. N38133: the cutest photos of royal children and their beloved nannies from prince george to the queen (Genre: lifestyle, Subgenre: lifestyleroyals)
4. N18530: all the photos of prince charles s trip to japan for emperor naruhito s enthronement ceremony (Genre: lifestyle, Subgenre: lifestyleroyals)
5. N63823: prince charles hit by one of the most incredible art hoaxes in royal history (Genre: lifestyle, Subgenre: lifestyleroyals)


In [None]:

#tmpdir.cleanup()

## References

\[1\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>