# MIND: A Large-scale Datset for News Recommendation

## COMP9727 Project Implementation and Evaluation

Linbo Zhang z5352294, Jinghan Wang z5286124, Junyu Li z5467278


## Problem

The project aims to implement and evaluate a news recommendation system based on the MIND dataset. The MIND dataset is a large-scale dataset for news recommendation, which was released by Microsoft. The dataset contains news articles and user behaviors (clicks). The goal of the project is to build a news recommendation system that can recommend news articles to users based on their historical behaviors.

## Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_behaviors = pd.read_csv('MINDlarge_train/behaviors.tsv', sep='\t', header=None)
columns = ["Impression_ID", "User_ID", "Time", "History", "Impressions"]
df_behaviors.columns = columns

In [None]:
print(df_behaviors.shape)
df_behaviors.head()

In [None]:
# the date format is in MM/DD/YYYY HH:MM:SS, so need to convert to a datetime object
df_behaviors['Time'] = pd.to_datetime(df_behaviors['Time'], format='%m/%d/%Y %I:%M:%S %p')
df_behaviors['Time'].describe()

Total 5 columns. We can ignore the first column `Impression ID`.

- `User ID`: the unique identifier of the user.
- `Time`: The timestamp of the impression
- `History`: The news that the user has viewed before this impression. Each news is separated by a space. And it is a string at the moment.
- `Impressions`: The news that the user has viewed in this impression. Each news contains `News_ID-<click status>`, so 1 means the user clicks on the news, and 0 means the user does not click on the news. Each news is separated by a space.

Check how many NA values

In [None]:
# check NA
df_behaviors.isna().sum()

Among all the 2232748 rows, there are 46065 with empty history.

In [None]:
# check how many unique user Id
print(df_behaviors['User_ID'].nunique())

There are total 2,232,748 records, so more than 2.2 million records. And there are total 711,222 unique users.

So many users have multiple records.

So when we look back to the 46065 records with empty history, we may replace the NA values with the most closest history of the same user.

According to the dataset description, they sampled 1 million users who had at least 5 news clicks during 6 weeks from October 12 to November 22, in year 2019.

The click behaviors in the first 4 weeks are used to construct the news click history for user modeling.

The samples in the last day of the fifth week is the validation set.

So we can check the date range using the `Time` column.

In [None]:
# check date range using the Time column
print(df_behaviors['Time'].min())
print(df_behaviors['Time'].max())

In [None]:
# here we also read the val hebaviors and test behaviors csv to compare the time range
df_behaviors_val = pd.read_csv('MINDlarge_dev/behaviors.tsv', sep='\t', header=None)
df_behaviors_val.columns = columns

df_behaviors_test = pd.read_csv('MINDlarge_test/behaviors.tsv', sep='\t', header=None)
df_behaviors_test.columns = columns

# convert the time fo rthe val and test set
df_behaviors_val['Time'] = pd.to_datetime(df_behaviors_val['Time'], format='%m/%d/%Y %I:%M:%S %p')
df_behaviors_test['Time'] = pd.to_datetime(df_behaviors_test['Time'], format='%m/%d/%Y %I:%M:%S %p')

print("Validation set")
print(df_behaviors_val['Time'].min())
print(df_behaviors_val['Time'].max())

print("Test set")
print(df_behaviors_test['Time'].min())
print(df_behaviors_test['Time'].max())

Let's summarize the date range for the 3 sets

| Set | Min Date | Max Date |
| --- | --- | --- |
| Train | 2019-11-09 00:00:00 | 2019-11-14 23:59:59 |
| Val | 2019-11-15 00:00:00 | 2019-11-15 23:59:43 |
| Test | 2019-11-16 00:00:00 | 2019-11-22 23:59:58 |

Hence the training data happens over 6 days. And the data on the 7th day is used as validation set.

Then the next week's data is used as the test set.

Notice that this is a time series data.

Now we look further into the history.

In [None]:
# print the first row history, and first row impression
print(df_behaviors['History'][0])
print()
print(df_behaviors['Impressions'][0])

The history is a string of news IDs.

And the `Impressions` list is also a list of news IDs. Each news in the impression column also comes with the click status.

In [None]:
# split the history, check how many items in the history, (the average number)
# and split the impressions, check how many items in the impressions, (the average number)
df_behaviors['History_Length'] = df_behaviors['History'].apply(lambda x: len(x.split()) if type(x) == str else np.nan)
df_behaviors['Impressions_Length'] = df_behaviors['Impressions'].apply(lambda x: len(x.split()) if type(x) == str else np.nan)

print(df_behaviors['History_Length'].mean())
print(df_behaviors['Impressions_Length'].mean())

On average, the user is exposed to 33.67 news in the history, then the impression contains on average 37.40 news.

This is quite the same experience when we open the news website or news app.

Now we look at the news data. It is a separate tsv file.

In [None]:
df_news = pd.read_csv('MINDlarge_train/news.tsv', sep='\t', header=None)
columns = ["News_ID", "Category", "Subcategory", "Title", "Abstract", "URL", "Title_Entities", "Abstract_Entities"]
df_news.columns = columns

print(df_news.shape)
df_news.head()

There are total 101527 news (over a million news).

There are 8 columns in the news data.

- News_ID: the unique identifier for each news
- Category
- Subcategory
- Title
- Abstract
- URL: the news url. but since the dataset is from 2019, most urls cannot be open anymore. At that time, many models are trained using the news content. But we will not do that since these contents are no longer available.
- Title Entities: entities contained in the title, it includes label, type, wikidataId, confidence, occurrenceOffsets, SurfaceForms. These are gathered by the dataset provider.
- Abstract Entities: similar to the title entities.

Now we look at how many cells are NA in the news data.

In [None]:
# check NA
df_news.isna().sum()

Among the 1 million news, 5415 does not have an abstract. And only a few is missing the title and abstract entities.

We check if the news ID is all unique.

In [None]:
print(df_news['Category'].nunique())

So there are 18 duplicated news IDs. That will be removed later.

Now we do some analysis on the length of the title and abstract, as the user only sees the title and abstract, before deciding to click on the news.

In [None]:
# for the Title, count the number of words
# skip NaN values
df_news['Title_Length'] = df_news['Title'].apply(lambda x: len(x.split()) if type(x) == str else 0)

# for the abstract length, skip NaN values
df_news['Abstract_Length'] = df_news['Abstract'].apply(lambda x: len(x.split()) if type(x) == str else 0)

In [None]:
title_length_count = df_news['Title_Length'].value_counts().sort_index()
total_titles = title_length_count.sum()
title_length_percentage = title_length_count / total_titles * 100

# print the average title length and mode
print("mean value: ", df_news['Title_Length'].mean())
print("mode value: ", df_news['Title_Length'].mode()[0])

# plot
plt.figure(figsize=(12, 6))
title_length_percentage.plot(kind='bar')
plt.xlabel('Title Length (Number of Words)')
plt.ylabel('Percentage (%)')
plt.title('Frequency Percentage Plot of Title Lengths on Training Set')

# make the xticks 45
plt.xticks(rotation=45)
plt.show()

The mean length is 10.7 words. And the mode is 10 words. So that a good news title should be around 10 words, i.e. keep the title short.

In [None]:
# same for the abstract
abstract_length_count = df_news['Abstract_Length'].value_counts().sort_index()
total_abstracts = abstract_length_count.sum()
abstract_length_percentage = abstract_length_count / total_abstracts * 100

# print mean and mode
print("mean value: ", df_news['Abstract_Length'].mean())

# plot
plt.figure(figsize=(15, 6))
abstract_length_percentage.plot(kind='bar')
plt.xlabel('Abstract Length (Number of Words)')
plt.ylabel('Percentage (%)')
plt.title('Frequency Plot of Abstract Lengths on Training Set')

# Step 4: Reduce the number of xticks
step = 20  # Choose an appropriate step size
plt.xticks(ticks=range(0, abstract_length_percentage.index.max() + 1, step))

# make the xticks upright
plt.xticks(rotation=45)

# set the xlim between 0 to 200
plt.xlim(0, 140)


plt.show()

The situation is different for the abstract.

The abstract is longer, so the mean number of words is 36.24 words.

And we see there are two modes, one is around 20, and the other one is around 70. That means the abstract can be short or long.

Now we check one interestring concept in the news data, the survival time. It is defined as the time interval between the first appearance of the news and the last appearance of the news. The value calculated from the dataset could be inaccurate since we only have the impression time. But it is a good estimation.

In [None]:
# check the survival time for each news
# defined using the time interval between its first and last appearance time in the dataset

# take the time and impressions coumn
df_survival = df_behaviors[['Time', 'Impressions']]

news_with_time = []

# for each row in the df_survival, split the impressions. and remove -0 or -1 at the end
# then insert a tuple of the (news id, time) into the news_with_time list
for idx, row in df_survival.iterrows():
    dt = row['Time']
    for impression in row['Impressions'].split():
        news_id, click = impression.split('-')
        news_with_time.append((news_id, dt))

# convert the news_with_time list into a dataframe
df_survival = pd.DataFrame(news_with_time, columns=['News_ID', 'Time'])

# group by the News_ID and get the earliest and latest time
df_survival_grouped = df_survival.groupby('News_ID')['Time'].agg(["min", "max"]).reset_index()

# get the difference between the two
df_survival_grouped['Survival_Time'] = df_survival_grouped["max"] - df_survival_grouped["min"]

In [None]:
# plot a histogram of the survival time
plt.figure(figsize=(12, 6))

# convert to days, but in float
plt.hist(df_survival_grouped['Survival_Time'].dt.total_seconds() / 3600 / 24, bins=50)
plt.xlabel('Survival Time (Days)')
plt.ylabel('Frequency')
plt.title('Histogram of Survival Time of News Articles in Training Set')
plt.show()

So most news have a survival time less than 1 day.

That means most news are short-lived.

The longest survival is 6 days which is inaccurate, as the dataset only contains 6 days of data.

Now we plot to see the distribution of category.

In [None]:
# use the df_news["Category"] column to plot a pie chart
category_count = df_news['Category'].value_counts()
total_news = category_count.sum()
category_percentage = category_count / total_news * 100

# plot
plt.figure(figsize=(15, 6))
patches, texts = plt.pie(category_count, startangle=90, counterclock=False)

# do the labels separately
labels_with_percentage = [f'{label} ({percentage:.2f}%)' for label, percentage in zip(category_count.index, category_percentage)]
plt.legend(patches, labels_with_percentage, loc='center left', bbox_to_anchor=(1, 0.5))

plt.title('Pie Chart of News Categories on Training Set')
plt.ylabel('')
plt.show()


So the top 2 popular categories are sports and news.

At last, we have a look at the validation set and test set. And check the NA existence.

In [None]:
behavior_columns = ["Impression_ID", "User_ID", "Time", "History", "Impressions"]
news_columns = ["News_ID", "Category", "Subcategory", "Title", "Abstract", "URL", "Title_Entities", "Abstract_Entities"]

In [None]:
# check the shape of the validation set and the test set
df_behaviors_val = pd.read_csv('MINDlarge_dev/behaviors.tsv', sep='\t', header=None)
df_behaviors_val.columns = behavior_columns

print("Validation Set behaviors.tsv")
print(df_behaviors_val.shape)
print(df_behaviors_val.isna().sum())
print()

df_news_val = pd.read_csv('MINDlarge_dev/news.tsv', sep='\t', header=None)
df_news_val.columns = news_columns

print("Validation Set news.tsv")
print(df_news_val.shape)
print(df_news_val.isna().sum())

In [None]:
# for the test set
df_behaviors_test = pd.read_csv('MINDlarge_test/behaviors.tsv', sep='\t', header=None)
df_behaviors_test.columns = behavior_columns

print("Test Set behaviors.tsv")
print(df_behaviors_test.shape)
print(df_behaviors_test.isna().sum())
print()

df_news_test = pd.read_csv('MINDlarge_test/news.tsv', sep='\t', header=None)
df_news_test.columns = news_columns

print("Test Set news.tsv")
print(df_news_test.shape)
print(df_news_test.isna().sum())

The NA existence is similar. Some history column in the behavior.csv is empty. And some abstract column in the news.tsv is empty.

And a small number of URL, title_entities, abstract_entities are empty in the news.tsv file.

## Methods: Use the MINDsmall to train and test the model

The original dataset is too large. And our computer resources are very limited.

So we will use the MINDsmall dataset.

Unfortunately, we only have the train version of the small dataset. The small validation set cannot be downloaded.

So we will use the user_id inside the train dataset to extract the relevant data from the original dataset.

In [None]:
import pandas as pd
import numpy as np

df_behaviors_small = pd.read_csv('MINDsmall_train/behaviors.tsv', sep='\t', header=None)
df_behaviors_large = pd.read_csv('MINDlarge_train/behaviors.tsv', sep='\t', header=None)

df_behaviors_small.columns = ["Impression_ID", "User_ID", "Time", "History", "Impressions"]
df_behaviors_large.columns = ["Impression_ID", "User_ID", "Time", "History", "Impressions"]

# check the shape of the two dataframes
print(df_behaviors_small.shape)
print(df_behaviors_large.shape)

In [None]:
# check the number of unique people inside the two dataframes
print(df_behaviors_small['User_ID'].nunique())
print(df_behaviors_large['User_ID'].nunique())

The small training set is a subset of the large set. It contains only 50,000 users.

156965/2232748 = 7%

In [None]:
# open the large dev and test set behaviors.tsv file
# and check the number of unique people inside the two dataframes
df_behaviors_val = pd.read_csv('MINDlarge_dev/behaviors.tsv', sep='\t', header=None)
df_behaviors_test = pd.read_csv('MINDlarge_test/behaviors.tsv', sep='\t', header=None)

df_behaviors_val.columns = ["Impression_ID", "User_ID", "Time", "History", "Impressions"]
df_behaviors_test.columns = ["Impression_ID", "User_ID", "Time", "History", "Impressions"]

print(df_behaviors_val['User_ID'].nunique())
print(df_behaviors_test['User_ID'].nunique())

In [None]:
# keep only the same 50000 users in the large dataset
keep_user_ids = df_behaviors_small['User_ID'].unique()
df_behaviors_val = df_behaviors_val[df_behaviors_val['User_ID'].isin(keep_user_ids)]
df_behaviors_test = df_behaviors_test[df_behaviors_test['User_ID'].isin(keep_user_ids)]

# check the shape of the two dataframes
print(df_behaviors_val.shape)
print(df_behaviors_test.shape)

If we keep the same 50,000 users in both validation and test set,

for validation, remain 22880 out of 255990 which is 8.9%

for testing, remain 140593 out of 702005, which is 20%.

From the percentage, the selected 50000 users play an important role in the test set.

Now we save the filtered dataframe into the folder. And we manually move all other files into the folder as well.

So we have `MINDsmall_train`, `MINDsmall_dev`, `MINDsmall_test`.

In [None]:
# output the two files into MINDsmall_dev and MINDsmall_test
import os
if not os.path.exists('MINDsmall_dev'):
    os.makedirs('MINDsmall_dev')

if not os.path.exists('MINDsmall_test'):
    os.makedirs('MINDsmall_test')

df_behaviors_val.to_csv('MINDsmall_dev/behaviors.tsv', sep='\t', index=False, header=False)
df_behaviors_test.to_csv('MINDsmall_test/behaviors.tsv', sep='\t', index=False, header=False)

# and also copy all the remaining files into the folder.

## Method: Prepare the two datasets

1. Load the news.tsv file, replace NA with empty strings, and remove the duplicated news IDs.
2. Concat the title, abstract, category, subcategory into a single string. Lowercase the new string column, and apply TF-IDF to convert the text into vectors.
3. Create a dictionary of the {news_id: vector} for the news data
4. Load the behaviors.tsv file, and replace NA with empty strings.
5. Split the impressions column via space, and explode the column into multiple rows.
6. Further split the impressions into news_id and click status. Convert the click status into integers.
7. For the history column, based ont eh dictionary in step 3, convert the history column into a vector, using the mean of the vectors of the news.
8. Convert the news_id into vector as well.
9. Now we have the user_id, history vector, news vector, and the click status.
10. Use label encoder to fit and transform on the user_id. And another label encoder to fit and transform on the news_id.

The following codes go through the step. And finally we will provide a function for that.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

In [None]:
TF_VECTOR_LENGTH = 5000

behavior_columns = ["Impression_ID", "User_ID", "Time", "History", "Impressions"]
news_columns = ["News_ID", "Category", "Subcategory", "Title", "Abstract", "URL", "Title_Entities", "Abstract_Entities"]

In [None]:
# read the news.tsv first, replace NA with empty strings
df_news = pd.read_csv('MINDsmall_train/news.tsv', sep='\t', header=None)
df_news.columns = news_columns
df_news.fillna("", inplace=True)

# concat the title, abstract, category, subcategory into a single string
df_news['Text'] = df_news['Title'] + ' ' + df_news['Abstract'] + ' ' + df_news['Category'] + ' ' + df_news['Subcategory']
df_news['Text'] = df_news['Text'].str.lower()

# initialize the TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english', max_features=TF_VECTOR_LENGTH)

# fit the TfidfVectorizer on the text, and save the result into a new column
df_news['tfidf'] = list(tfidf.fit_transform(df_news['Text']).toarray())

# check the shape
print(df_news.shape)
df_news.head()

In [None]:
# obtain a tfidf vector
v = df_news['tfidf'][0]
print(np.shape(v))
print(v)

# check unique values from the v
print(np.unique(v))

So each vector is fixed at some number of features. And it is saveed into the tfidf column. So for the news tsv, we only need to keep the News_ID and tfidf columns.

In [None]:
df_news = df_news[['News_ID', 'tfidf']]
print(df_news.shape)
df_news.head()

In [None]:
# save the two columns into a dictionary
news_dict = dict(zip(df_news['News_ID'], df_news['tfidf']))

In [None]:
# read the behaviors.tsv file
# drop rows with NA values
# convert time
df_behaviors = pd.read_csv('MINDsmall_train/behaviors.tsv', sep='\t', header=None)
df_behaviors.columns = behavior_columns

df_behaviors.dropna(inplace=True)

df_behaviors['Time'] = pd.to_datetime(df_behaviors['Time'], format='%m/%d/%Y %I:%M:%S %p')

print(df_behaviors.shape)
df_behaviors.head()

In [None]:
# split the Impressions column via space, and explode it
df_behaviors["Impressions"] = df_behaviors["Impressions"].apply(lambda x: x.split()).explode().reset_index(drop=True)

# now further split the Impressions into the News_ID and Click
df_behaviors[['News_ID', 'Click']] = df_behaviors['Impressions'].str.split('-', expand=True)

# convert the click into int
df_behaviors['Click'] = df_behaviors['Click'].astype(int)

# drop the Impressions column
df_behaviors.drop(columns=["Impressions", "Impression_ID", "Time"], inplace=True)

print(df_behaviors.shape)
df_behaviors.head()

In [None]:
# so for each history cell, need to split them and get each tfidf vector, then take the average
def get_avg_tfidf_history(history):
    news_ids = history.split()
    tfidf_vectors = [news_dict[news_id] for news_id in news_ids if news_id in news_dict]
    if len(tfidf_vectors) > 0:
        return np.mean(tfidf_vectors, axis=0)
    else:
        return np.zeros(5000)

# apply the function to the History column
df_behaviors["history_tfidf"] = df_behaviors["History"].apply(get_avg_tfidf_history)

# for the News_ID also need to get the tfidf
df_behaviors["news_tfidf"] = df_behaviors["News_ID"].apply(lambda x: news_dict[x] if x in news_dict else np.zeros(5000))

# drop the History
df_behaviors.drop(columns=["History"], inplace=True)

print(df_behaviors.shape)
df_behaviors.head()

In the LibFM model, we need to concat the history_tfidf and news_tfidf into a single vector.

For the deep learning model, we don't need to concat them.

Here we provide a function to do the above steps

In [None]:
def process_tsv_files(behaviors_tsv_path, news_tsv_path, tf_length=TF_VECTOR_LENGTH):
    df_news = pd.read_csv(news_tsv_path, sep='\t', header=None)
    df_news.columns = news_columns
    df_news.fillna("", inplace=True)

    df_news['Text'] = df_news['Title'] + ' ' + df_news['Abstract'] + ' ' + df_news['Category'] + ' ' + df_news['Subcategory']
    df_news['Text'] = df_news['Text'].str.lower()
    vectorizer = TfidfVectorizer(stop_words='english', max_features=tf_length)
    df_news['tfidf'] = list(vectorizer.fit_transform(df_news['Text']).toarray())

    # dictionary
    news_dict = dict(zip(df_news['News_ID'], df_news['tfidf']))

    df_behaviors = pd.read_csv(behaviors_tsv_path, sep='\t', header=None)
    df_behaviors.columns = behavior_columns
    df_behaviors.dropna(inplace=True)

    df_behaviors['Time'] = pd.to_datetime(df_behaviors['Time'], format='%m/%d/%Y %I:%M:%S %p')

    df_behaviors["Impressions"] = df_behaviors["Impressions"].apply(lambda x: x.split()).explode().reset_index(drop=True)
    df_behaviors[['News_ID', 'Click']] = df_behaviors['Impressions'].str.split('-', expand=True)
    df_behaviors['Click'] = df_behaviors['Click'].astype(int)

    def get_avg_tfidf_history(history):
        news_ids = history.split()
        tfidf_vectors = [news_dict[news_id] for news_id in news_ids if news_id in news_dict]
        if len(tfidf_vectors) > 0:
            return np.mean(tfidf_vectors, axis=0)
        else:
            return np.zeros(tf_length)

    df_behaviors["history_tfidf"] = df_behaviors["History"].apply(get_avg_tfidf_history)
    df_behaviors["news_tfidf"] = df_behaviors["News_ID"].apply(lambda x: news_dict[x] if x in news_dict else np.zeros(tf_length))

    # only keep the necessary columns
    df_behaviors = df_behaviors[['User_ID', 'history_tfidf', 'news_tfidf', 'Click', 'News_ID']]

    # label encoder
    user_encoder = LabelEncoder()
    df_behaviors['User_ID'] = user_encoder.fit_transform(df_behaviors['User_ID'])

    news_encoder = LabelEncoder()
    news_encoder.fit(df_news['News_ID'])
    df_behaviors['News_ID'] = news_encoder.transform(df_behaviors['News_ID'])

    # return the two dataframes, and the vectorizer, and the two encoder
    return df_behaviors, df_news, vectorizer, user_encoder, news_encoder

## Model: LibFM

In [None]:
from lightfm import LightFM
from lightfm.data import Dataset

* Model Type: Hybrid Recommender System combining Collaborative Filtering and Content-Based Filtering.
* Data Input: Uses an interaction matrix for user-item interactions and feature matrices for item and user attributes.
* Training Optimization: Utilizes WARP, BPR, and log loss functions for improved ranking and personalization.
* Embeddings: Generates dense vector representations for users and items to calculate scores.
* Evaluation Metric: Assesses performance using the AUC score to measure the model's ability to distinguish relevant items.

The LightFM model is a hybrid recommender system that integrates collaborative filtering and content-based filtering. It uses an interaction matrix for user-item interactions and feature matrices for additional item and user attributes. Training the model involves optimizing advanced loss functions like WARP, BPR, and log loss, which improve ranking and personalization. The model creates dense vector embeddings for users and items, using the dot product of these embeddings to score and rank items for recommendations. Performance is evaluated using metrics like the AUC score, which measures the ability to distinguish between relevant and non-relevant items.



This is a relatively early model, similar to machine learning in nature. According to the paper, this is a classical recommendation method based on a factorization machine.

In this model, we input the user ID, news ID, and the TF-IDF vector of the news. This vector consists of two parts: the first part represents the historical news vector, and the second part represents the vector of the news being recommended to the user. The user then decides whether to click on the news, which we refer to as a “click.”

Our prediction is focused on whether the user will click on the news.

In [None]:
# concat the two vectors into a single one
df_behaviors_train, df_news_train, vectorizer, user_encoder, news_encoder = process_tsv_files("MINDsmall_train/behaviors.tsv", "MINDsmall_train/news.tsv")
df_behaviors_train["combined_tfidf"] = df_behaviors_train.apply(lambda x: np.concatenate([x["history_tfidf"], x["news_tfidf"]]), axis=1)

# initialize the dataset
dataset = Dataset()
dataset.fit(
    users=df_behaviors_train["User_ID"].unique(),
    items=df_behaviors_train["News_ID"].unique(),
    item_features=[f"{i}" for i in range(len(df_behaviors_train['combined_tfidf'][0]))]
)

# check the number of users and number of items
num_users, num_items = dataset.interactions_shape()
print(num_users, num_items)

# check the item features
num_item_features = dataset.item_features_shape()
print(num_item_features)

# build the interactions
# input the user_id, item_id and the weight,
# return a tuple of the interactions and the weights
interactions, weights = dataset.build_interactions(
    (row['User_ID'], row['News_ID'], row['Click']) for index, row in df_behaviors_train.iterrows()
)

item_features = dataset.build_item_features(
    (
        row["News_ID"],
        {f"{i}": v for i, v in enumerate(row["combined_tfidf"]) if v != 0}
    ) for index, row in df_behaviors_train.iterrows()
)


model = LightFM(loss='warp')
model.fit(interactions, item_features=item_features, epochs=10, num_threads=1, verbose=True)

from lightfm.evaluation import auc_score
train_auc = auc_score(model, interactions, item_features=item_features, num_threads=16).mean()
print(f'Train AUC: {train_auc}')

In [None]:
train_auc = auc_score(model, interactions, item_features=item_features, num_threads=1).mean()
print(f'Train AUC: {train_auc}')

# Predict scores for all user-item pairs
def predict_scores(model, interactions, item_features):
    num_users, num_items = interactions.shape
    scores = np.empty((num_users, num_items))

    for user_id in range(num_users):
        scores[user_id, :] = model.predict(user_id, np.arange(num_items), item_features=item_features, num_threads=16)

    return scores

# Compute the predicted scores
scores = predict_scores(model, interactions, item_features)
print(scores)

# Get true relevance scores
true_relevance = interactions.toarray()

from sklearn.metrics import ndcg_score

ndcg_5 = ndcg_score(true_relevance, scores, k=5)
ndcg_10 = ndcg_score(true_relevance, scores, k=10)

print(f'nDCG@5: {ndcg_5}')
print(f'nDCG@10: {ndcg_10}')

In [None]:
df_behaviors_train, df_news_train, vectorizer, user_encoder, news_encoder = process_tsv_files("MINDsmall_train/behaviors.tsv", "MINDsmall_train/news.tsv")

# need further combine the news_tfidf and history_tfidf
df_behaviors_train["combined_tfidf"] = df_behaviors_train.apply(lambda x: np.concatenate([x["history_tfidf"], x["news_tfidf"]]), axis=1)

## Model: Neural Collaborative Filtering

This model is a Neural Collaborative Filtering (NCF) model designed for recommendation systems. It combines embeddings of users and items, processing them through fully connected layers to predict user preferences for items.

#### Main Components

1. **User Embedding Layer**:
   - Uses `nn.Embedding` to create an embedding vector for each user, capturing user features and preferences.
   - `self.user_embedding = nn.Embedding(num_users, embedding_dim)`

2. **Item Embedding Layer**:
   - Uses `nn.Embedding` to create an embedding vector for each item (news), capturing item features.
   - `self.item_embedding = nn.Embedding(num_items, embedding_dim)`

3. **History TF-IDF Processing Layer**:
   - Uses `nn.Linear` to map the TF-IDF vector of historical news to the embedding dimension.
   - `self.history_dense = nn.Linear(tf_vector_length, embedding_dim)`

4. **News TF-IDF Processing Layer**:
   - Uses `nn.Linear` to map the TF-IDF vector of current news to the embedding dimension.
   - `self.news_dense = nn.Linear(tf_vector_length, embedding_dim)`

5. **Fully Connected Layers**:
   - Combines user embeddings, item embeddings, historical TF-IDF embeddings, and news TF-IDF embeddings, processing them through a series of fully connected layers to predict the click probability.
   - `self.fc1 = nn.Linear(embedding_dim * 4, 128)`
   - `self.fc2 = nn.Linear(128, 64)`
   - `self.fc3 = nn.Linear(64, 1)`

#### Forward Pass

During the forward pass, the model processes the input user ID, news ID, historical news TF-IDF vector, and current news TF-IDF vector as follows:

1. **Embedding Layer Processing**:
   - Obtains the embedding vectors for users and news.
   - `user_embeds = self.user_embedding(user_ids)`
   - `item_embeds = self.item_embedding(item_ids)`

2. **TF-IDF Processing**:
   - Uses linear layers to map the historical and current news TF-IDF vectors to the embedding dimension, followed by ReLU activation.
   - `history_embeds = F.relu(self.history_dense(history_tfidf))`
   - `news_embeds = F.relu(self.news_dense(news_tfidf))`

3. **Concatenation and Fully Connected Layer Processing**:
   - Concatenates user embeddings, item embeddings, historical TF-IDF embeddings, and news TF-IDF embeddings, processes them through fully connected layers with ReLU activation.
   - `x = torch.cat([user_embeds, item_embeds, history_embeds, news_embeds], dim=1)`
   - `x = F.relu(self.fc1(x))`
   - `x = F.relu(self.fc2(x))`
   - Outputs the click probability using the sigmoid function.
   - `x = torch.sigmoid(self.fc3(x))`

#### Model Prediction
- Finally, the model outputs a value between 0 and 1, indicating the probability of the user clicking on the news.

By combining user and item embeddings with TF-IDF features, this model can capture the complex relationships between users and items, enhancing the accuracy and effectiveness of recommendations.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from torch.utils.data import Dataset, DataLoader
import torch

# by default, the vector length is 2000
df_behaviors_train, df_news_train, vectorizer, user_encoder, news_encoder = process_tsv_files(
                                    "MINDsmall_train/behaviors.tsv", "MINDsmall_train/news.tsv")

df_behaviors_val, df_news_val, vectorizer, user_encoder, news_encoder = process_tsv_files(
                                    "MINDsmall_dev/behaviors.tsv", "MINDsmall_dev/news.tsv",
                                    user_encoder=user_encoder, news_encoder=news_encoder, vectorizer=vectorizer)

# df_behaviors_test, df_news_test, vectorizer, user_encoder, news_encoder = process_tsv_files(
#                                     "MINDsmall_test/behaviors.tsv", "MINDsmall_test/news.tsv",
#                                     user_encoder=user_encoder, news_encoder=news_encoder, vectorizer=vectorizer)

# Prepare the dataset
class RecommendationDataset(Dataset):
    def __init__(self, df):
        self.user_ids = torch.tensor(df_behaviors_train['User_ID'].values, dtype=torch.long)
        self.news_ids = torch.tensor(df_behaviors_train['News_ID'].values, dtype=torch.long)
        self.history_tfidf = torch.tensor(np.stack(df_behaviors_train['history_tfidf'].values))
        self.news_tfidf = torch.tensor(np.stack(df_behaviors_train['news_tfidf'].values))
        self.labels = torch.tensor(df_behaviors_train['Click'].values)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return (self.user_ids[idx], self.news_ids[idx], self.history_tfidf[idx], self.news_tfidf[idx], self.labels[idx])

train_dataset = RecommendationDataset(df_behaviors_train)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# validation loader
val_dataset = RecommendationDataset(df_behaviors_val)
val_loader = DataLoader(val_dataset, batch_size=64, shuffle=False)

# and test
# test_dataset = RecommendationDataset(df_behaviors_test)
# test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)


#### Model Define

In [None]:
import torch.nn as nn
import torch.nn.functional as F

class NeuralCollaborativeFiltering(nn.Module):
    def __init__(self, num_users, num_items, tf_vector_length, embedding_dim=32):
        super(NeuralCollaborativeFiltering, self).__init__()
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        self.history_dense = nn.Linear(tf_vector_length, embedding_dim)
        self.news_dense = nn.Linear(tf_vector_length, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim * 4, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)

    def forward(self, user_ids, item_ids, history_tfidf, news_tfidf):
        user_embeds = self.user_embedding(user_ids)
        item_embeds = self.item_embedding(item_ids)
        history_embeds = F.relu(self.history_dense(history_tfidf))
        news_embeds = F.relu(self.news_dense(news_tfidf))
        x = torch.cat([user_embeds, item_embeds, history_embeds, news_embeds], dim=1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = torch.sigmoid(self.fc3(x))
        return x

For training, the Adam optimizer is used, and for loss, the binrary cross entropy loss is used (because it predicts 1 or 0).

In [None]:
import lightning as L
from sklearn.metrics import roc_auc_score
from sklearn.metrics import ndcg_score


class TrainModelClass(L.LightningModule):
    def __init__(self, model, criterion, lr=1e-3):
        super(TrainModelClass, self).__init__()
        self.model = model
        self.criterion = criterion
        self.lr = lr

    def forward(self, user_ids, item_ids, history_tfidf, news_tfidf):
        return self.model(user_ids, item_ids, history_tfidf, news_tfidf)

    def training_step(self, batch, batch_idx):
        user_ids, item_ids, history_tfidf, news_tfidf, labels = batch

        user_ids = user_ids.long()
        item_ids = item_ids.long()
        history_tfidf = history_tfidf.float()
        news_tfidf = news_tfidf.float()
        labels = labels.float()

        outputs = self(user_ids, item_ids, history_tfidf, news_tfidf).squeeze()
        loss = self.criterion(outputs, labels.float())

        self.log("train_loss", loss, on_step=False, on_epoch=True, prog_bar=True, logger=True)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.model.parameters(), self.lr)
        return optimizer

    def val_test_step(self, batch, batch_idx):
        user_ids, item_ids, history_tfidf, news_tfidf, labels = batch
        user_ids = user_ids.long()
        item_ids = item_ids.long()
        history_tfidf = history_tfidf.float()
        news_tfidf = news_tfidf.float()
        labels = labels.float()

        outputs = self(user_ids, item_ids, history_tfidf, news_tfidf).squeeze()
        self.prob_preds.extend(outputs.detach().cpu().numpy())

        # # each user is exposed to user_ids, history_tfidf, then the news_tfidf
        # # so need to save the mapping with key = (user_id, history_tfidf) and value = output
        # for user_id, history, output, label in zip(user_ids, history_tfidf, outputs, labels):
        #     # if the user_id and history is not in the dictionary, then add it
        #     if (user_id.item(), tuple(history.tolist())) not in self.user_history_session:
        #         self.user_history_session[(user_id.item(), tuple(history.tolist()))] = []
        #     self.user_history_session[(user_id.item(), tuple(history.tolist()))].append((output.item(), label.item()))

        outputs = torch.round(outputs)

        # save the predictions and labels
        self.preds.extend(outputs.detach().cpu().numpy())
        self.labels.extend(labels.detach().cpu().numpy())

    def on_validation_epoch_start(self):
        self.preds = []
        self.labels = []
        self.prob_preds = []
        self.user_history_session = {}

    # for the validation step
    def validation_step(self, batch, batch_idx):
        self.val_test_step(batch, batch_idx)

    # for the validation epoch end
    def on_validation_epoch_end(self):
        # calculate the auc
        val_auc = roc_auc_score(self.labels, self.preds)
        self.log("val_auc", val_auc, on_step=False, on_epoch=True, prog_bar=True, logger=True)

        # is it possible to calculate the nDCG@5 and nDCG@10?
        # calculate the nDCG@5 and nDCG@10
        # val_ndcg_5 = ndcg_score(self.val_labels, self.val_prob_preds, k=5)
        #

    # for the test step
    def test_step(self, batch, batch_idx):
        self.val_test_step(batch, batch_idx)

    def on_test_epoch_start(self):
        self.preds = []
        self.labels = []

    def on_test_epoch_end(self):
        test_auc = roc_auc_score(self.labels, self.preds)
        self.log("test_auc", test_auc, on_step=False, on_epoch=True, prog_bar=True, logger=True)

In [None]:
# Initialize the model
num_users = len(user_encoder.classes_)
num_items = len(news_encoder.classes_)
model = NeuralCollaborativeFiltering(num_users, num_items, 2000)

# wrap the model with the TrainModelClass
loss_fn = torch.nn.BCELoss()
model = TrainModelClass(model, loss_fn, lr=1e-3)

# set up the trainer
csv_logger = L.pytorch.loggers.CSVLogger("lightning_logs", name="ncf")
trainer = L.Trainer(max_epochs=10, logger=csv_logger)

trainer.fit(model, train_loader, val_loader)

In [None]:
# plot the line with train_loss, and another graph with val_auc
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv("lightning_logs/ncf/version_0/metrics.csv")
df.head(n=10)

In [None]:
# for the train_loss column, some values may be NaN
train_loss_values = df['train_loss'].dropna().values
val_auc_values = df['val_auc'].dropna().values

fig, axs = plt.subplots(1, 2, figsize=(14, 6))
axs[0].plot(train_loss_values)
axs[1].plot(val_auc_values)

axs[0].set_title("Neural Collaborative Filtering: Train Loss")
axs[1].set_title("Neural Collaborative Filtering: Validation AUC")

axs[0].set_xlabel("Epoch")
axs[1].set_xlabel("Epoch")

axs[0].set_ylabel("Train Loss")
axs[1].set_ylabel("Validation AUC")

plt.show()

## Model: Matrix Factorization

### Libraries and load data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install pytorch_lightning

In [None]:
import numpy as np
import pandas as pd
import torch.nn as nn
import pytorch_lightning as pl
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch
from collections import Counter

### view the data

In [None]:
raw_behaviour = pd.read_csv("/content/drive/My Drive/ColabFiles_9727data/behaviors.tsv",sep="\t",names=["impressionId","userId","timestamp","click_history","impressions"])

print(f"The dataset originally consist of {len(raw_behaviour)} number of interactions.")
raw_behaviour.head()

In [None]:
news = pd.read_csv("/content/drive/My Drive/ColabFiles_9727data/news.tsv",sep="\t",names=["itemId","category","subcategory","title","abstract","url","title_entities","abstract_entities"])
print(f"The article data consist in total of {len(news)} number of articles.")
news.head()

In [None]:
# Print the number of interactions
print(f"The dataset originally consists of {len(raw_behaviour)} interactions.")

# Display the first few rows of the dataset
print(raw_behaviour.head())

# Check how many unique values are in a specific column, for example, "userId"
unique_user_ids = raw_behaviour['userId'].nunique()
print(f"The number of unique userId values is: {unique_user_ids}")


In [None]:
# Print the number of interactions
print(f"The dataset originally consists of {len(raw_behaviour)} interactions.")

# Display the first few rows of the dataset
print(raw_behaviour.head())

# Check if a specific column, for example 'userId', contains any missing or empty values
missing_user_ids = raw_behaviour['click_history'].isnull().sum()
empty_user_ids = (raw_behaviour['click_history'] == '').sum()

print(f"The 'click_history' column contains {missing_user_ids} missing values.")
print(f"The 'click_history' column contains {empty_user_ids} empty values.")


In [None]:
# Function to split the impressions and clicks into two seperate lists
def process_impression(impression_list):
    list_of_strings = impression_list.split()
    click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '1']
    non_click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '0']
    return click,non_click

# Indexize these two new columns:
raw_behaviour['click'], raw_behaviour['noclicks'] = zip(*raw_behaviour['impressions'].map(process_impression))

In [None]:
# Convert timestamp value to hours since epoch
raw_behaviour['epochhrs'] = pd.to_datetime(raw_behaviour['timestamp']).values.astype(np.int64)/(1e6)/1000/3600
raw_behaviour['epochhrs'] = raw_behaviour['epochhrs'].round()

In [None]:
# If there exists several clicks in one session, expand to new observation
raw_behaviour = raw_behaviour.explode("click").reset_index(drop=True)

# Extract the clicks from the previous clicks
click_history = raw_behaviour[["userId","click_history"]].drop_duplicates().dropna()
click_history["click_history"] = click_history.click_history.map(lambda x: x.split())
click_history = click_history.explode("click_history").rename(columns={"click_history":"click"})
# Dummy time set to earlies epochhrs in raw_behaviour as we don't know when these events took place.
click_history["epochhrs"] = raw_behaviour.epochhrs.min()
click_history["noclicks"] = pd.Series([[] for _ in range(len(click_history.index))])

# concatenate historical clicks with the raw_behaviour
raw_behaviour = pd.concat([raw_behaviour,click_history],axis=0).reset_index(drop=True)
print(f"The dataset after pre-processing consist of {len(raw_behaviour)} number of interactions.")

### visualize the data distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

article_clicks = raw_behaviour['click'].explode().value_counts()

filtered_clicks = article_clicks[article_clicks <= 100]

sns.set(style="whitegrid")

plt.figure(figsize=(8,4))
sns.histplot(filtered_clicks, bins=50, kde=False)
plt.title('Distribution of Article Clicks (<= 100 clicks)')
plt.xlabel('Number of Clicks')
plt.ylabel('Number of Articles')
plt.show()

filtered_clicks = article_clicks[article_clicks <= 50]
plt.figure(figsize=(8,4))
sns.histplot(filtered_clicks, bins=50, kde=False)
plt.title('Distribution of Article Clicks (<= 50 clicks)')
plt.xlabel('Number of Clicks')
plt.ylabel('Number of Articles')
plt.show()

#### Because the clicks are nearly almost 70% of them less than 10, we set the cutoff be 5, to try maintain as much users as possible.

In [None]:
min_click_cutoff = 5
print(f'Number of items that have less than {min_click_cutoff} clicks make up',np.round(np.mean(raw_behaviour.groupby("click").size() < min_click_cutoff)*100,3),'% of the total, and these will be removed.')

In [None]:
# remove items with less clicks than min_click_cutoff
raw_behaviour = raw_behaviour[raw_behaviour.groupby("click")["userId"].transform('size') >= min_click_cutoff].reset_index(drop=True)
# Get a set with all the unique items
click_set = set(raw_behaviour['click'].unique())

# remove items for impressions that is not avaiable in the click set (the items that we will be training on)
raw_behaviour['noclicks'] = raw_behaviour['noclicks'].apply(lambda impressions: [impression for impression in impressions if impression in click_set])

In [None]:
## Select the columns that we now want to use for further analysis
behaviour = raw_behaviour[['epochhrs','userId','click','noclicks']].copy()

print('Number of interactions in the behaviour dataset:', behaviour.shape[0])
print('Number of users in the behaviour dataset:', behaviour.userId.nunique())
print('Number of articles in the behaviour dataset:', behaviour.click.nunique())

behaviour.head()

### We can split the data into 80 training and 20 testing

In [None]:
# Let us use the last 10pct of the data as our validation data:
test_time_th = behaviour['epochhrs'].quantile(0.8)
train = behaviour[behaviour['epochhrs']< test_time_th].copy()

## Indexize items
# Allocate a unique index for each item, but let the zeroth index be a UNK index:
ind2item = {idx +1: itemid for idx, itemid in enumerate(train.click.unique())}
item2ind = {itemid : idx for idx, itemid in ind2item.items()}

train['noclicks'] = train['noclicks'].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
train['click'] = train['click'].map(lambda item: item2ind.get(item, 0))

## Indexize users
# Allocate a unique index for each user, but let the zeroth index be a UNK index:
ind2user = {idx +1: userid for idx, userid in enumerate(train['userId'].unique())}
user2ind = {userid : idx for idx, userid in ind2user.items()}

# Create a new column with userIdx:
train['userIdx'] = train['userId'].map(lambda x: user2ind.get(x,0))

# Repeat for validation
valid =  behaviour[behaviour['epochhrs']>= test_time_th].copy()
valid["click"] = valid["click"].map(lambda item: item2ind.get(item, 0))
valid["noclicks"] = valid["noclicks"].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
valid["userIdx"] = valid["userId"].map(lambda x: user2ind.get(x,0))

print(train.head(5))

### New splitting way: 70 training, 15 validation. 15 testing

In [None]:
# Split into 70% training set, 15% validation set, 15% test set
test_time_th = behaviour['epochhrs'].quantile(0.85)
valid_time_th = behaviour['epochhrs'].quantile(0.7)
train = behaviour[behaviour['epochhrs']< valid_time_th].copy()


## Indexize items
# Allocate a unique index for each item, but let the zeroth index be a UNK index:
ind2item = {idx +1: itemid for idx, itemid in enumerate(train.click.unique())}
item2ind = {itemid : idx for idx, itemid in ind2item.items()}

train['noclicks'] = train['noclicks'].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
train['click'] = train['click'].map(lambda item: item2ind.get(item, 0))

## Indexize users
# Allocate a unique index for each user, but let the zeroth index be a UNK index:
ind2user = {idx +1: userid for idx, userid in enumerate(train['userId'].unique())}
user2ind = {userid : idx for idx, userid in ind2user.items()}

# Create a new column with userIdx:
train['userIdx'] = train['userId'].map(lambda x: user2ind.get(x,0))

# Repeat for validation
valid = behaviour[(behaviour['epochhrs'] >= valid_time_th) & (behaviour['epochhrs'] < test_time_th)].copy()
valid["click"] = valid["click"].map(lambda item: item2ind.get(item, 0))
valid["noclicks"] = valid["noclicks"].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
valid["userIdx"] = valid["userId"].map(lambda x: user2ind.get(x,0))

# for test
test = behaviour[behaviour['epochhrs'] >= test_time_th].copy()
test['click'] = test['click'].map(lambda item: item2ind.get(item, 0))
test['noclicks'] = test['noclicks'].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
test['userIdx'] = test['userId'].map(lambda x: user2ind.get(x, 0))

In [None]:
class MindDataset(Dataset):
    # A fairly simple torch dataset module that can take a pandas dataframe (as above),
    # and convert the relevant fields into a dictionary of arrays that can be used in a dataloader
    def __init__(self, df):
        # Create a dictionary of tensors out of the dataframe
        self.data = {
            'userIdx' : torch.tensor(df.userIdx.values.astype(np.int64)),
            'click' : torch.tensor(df.click.values.astype(np.int64))
        }
    def __len__(self):
        return len(self.data['userIdx'])
    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.data.items()}


In [None]:
# Build datasets and dataloaders of train and validation dataframes:
bs = 1024
ds_train = MindDataset(train)
train_loader = DataLoader(ds_train, batch_size=bs, shuffle=True)
ds_valid = MindDataset(valid)
valid_loader = DataLoader(ds_valid, batch_size=bs, shuffle=False)
ds_test = MindDataset(test)
test_loader = DataLoader(ds_test, batch_size=bs, shuffle=False)

batch = next(iter(train_loader))

In [None]:
# View the first batch in the test loader
batch = next(iter(test_loader))
print(batch)


In [None]:
# Alternatively, iterate through the test loader to view all batches
for batch_idx, batch in enumerate(test_loader):
    print(f"Batch {batch_idx + 1}")
    print(batch)
    # Break after the first batch for brevity
    if batch_idx == 0:
        break

In [None]:
# Build a matrix factorization model
class NewsMF(pl.LightningModule):
    def __init__(self, num_users, num_items, dim = 100, dropout_prob=0.2, reg=0.01): # add regularization
        super().__init__()
        self.dim=dim
        self.num_users = num_users
        self.num_items = num_items
        self.reg = reg
        self.useremb = nn.Embedding(num_embeddings=num_users, embedding_dim=dim)
        self.itememb = nn.Embedding(num_embeddings=num_items, embedding_dim=dim)

        self.dropout = nn.Dropout(p=dropout_prob) # the drop out probablity is set to 0.2


    def step(self, batch, batch_idx, phase="train"):
        batch_size = batch['userIdx'].size(0)
        uservec = self.useremb(batch['userIdx'])
        itemvec_click = self.itememb(batch['click'])

        # Apply dropout to embeddings
        uservec = self.dropout(uservec)                # added drop out
        itemvec_click = self.dropout(itemvec_click)

        # For each positive interaction,sample a random negative
        neg_sample = torch.randint_like(batch["click"],1,self.num_items)
        itemvec_noclick = self.itememb(neg_sample)
        itemvec_noclick = self.dropout(itemvec_noclick)  # Apply dropout to negative samples

        score_click = torch.sigmoid((uservec*itemvec_click).sum(-1).unsqueeze(-1))
        score_noclick =  torch.sigmoid((uservec*itemvec_noclick).sum(-1).unsqueeze(-1))

        # Compute loss as binary cross entropy (categorical distribution between the clicked and the no clicked item)
        scores_all = torch.concat((score_click, score_noclick), dim=1)
        target_all = torch.concat((torch.ones_like(score_click), torch.zeros_like(score_noclick)),dim=1)
        # loss = F.binary_cross_entropy(scores_all, target_all)
        # return loss
        loss = F.binary_cross_entropy(scores_all, target_all)
        reg_loss = self.reg * (self.useremb.weight.norm(2) + self.itememb.weight.norm(2)) # add regularization
        return loss + reg_loss


    def training_step(self, batch, batch_idx):
        return self.step(batch, batch_idx, "train")

    def validation_step(self, batch, batch_idx):
        # for now, just do the same computation as during training
        return self.step(batch, batch_idx, "val")

    #def configure_optimizers(self):
        #optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        #return optimizer

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(self.parameters(), lr=1e-3)
        return optimizer


In [None]:
from pytorch_lightning import seed_everything

In [None]:
seed_everything(42, workers=True)
# Define and train model
mf_model = NewsMF(num_users=len(ind2user) + 1, num_items = len(ind2item) + 1, dim = 100) # increase the embedding demension to 100
trainer = pl.Trainer(max_epochs=2, accelerator="gpu",deterministic = True)
trainer.fit(model=mf_model, train_dataloaders=train_loader)


### New training code

In [None]:
'''
# Define and train the model
mf_model = NewsMF(num_users=len(ind2user) + 1, num_items=len(ind2item) + 1, dim=100, dropout_prob=0.2, reg=0.01)
trainer = pl.Trainer(max_epochs=2, accelerator="gpu", deterministic=True)
trainer.fit(model=mf_model, train_dataloaders=train_loader, val_dataloaders=valid_loader)
'''
seed_everything(42, workers=True)
# Train the model
mf_model = NewsMF(num_users=len(ind2user) + 1, num_items=len(ind2item) + 1, dim=100, dropout_prob=0.2, reg=0.01)  # increase the embedding dimension to 100
trainer = pl.Trainer(max_epochs=2, accelerator="gpu", deterministic=True)
trainer.fit(model=mf_model, train_dataloaders=train_loader, val_dataloaders=valid_loader)

In [None]:
## Add more information to the article data
# The item index
news["ind"] = news["itemId"].map(item2ind)
news = news.sort_values("ind").reset_index(drop=True)
# Number of clicks in training data per article, investigate the cold start issue
news["n_click_training"] = news["ind"].map(dict(Counter(train.click)))
# 5 most clicked articles
news.sort_values("n_click_training",ascending=False).head()

### Test for the most 5 similar news article for a news

In [None]:
# store the learned item embedding into a seperate tensor
itememb = mf_model.itememb.weight.detach()
print(itememb.shape)

In [None]:
# Investigate different rows of the item embedding (articles embeddings) to see if the model works
## some examples N13259, N16636, N10272
## Can you find some examples that does not work good? Why?

ind = item2ind.get("N3259")
# This calculates the cosine similarity and outputs the 5 most similar articles w.r.t to ind in descending order
similarity = torch.nn.functional.cosine_similarity(itememb[ind], itememb, dim=1)
most_sim = news[~news.ind.isna()].iloc[(similarity.argsort(descending=True).numpy()-1)]
most_sim.head(5)

### Calculate the new prediction

### New recommendation

In [None]:
# Function to recommend news articles to a user
def recommend_news(user_idx, model, top_k=5):
    user_vector = model.useremb(torch.tensor(user_idx))
    scores = torch.sigmoid((user_vector * model.itememb.weight).sum(-1))
    top_k_items = scores.argsort(descending=True)[:top_k]
    return [ind2item[idx.item()] for idx in top_k_items]

# Example: Recommend top 5 news articles for a user
user_id = test['userIdx'].iloc[3616]  # Replace with the desired user index
recommended_news = recommend_news(user_id, mf_model, top_k=5)
print(f"Recommended news articles for user {user_id}: {recommended_news}")

# Evaluate on the test set
test['recommended'] = test['userIdx'].apply(lambda x: recommend_news(x, mf_model, top_k=5))

print(test.head(5))




In [None]:
# Evaluate the model and calculate predictions
def calculate_prediction(user_idx, item_idx):
    user_tensor = torch.tensor(user_idx, dtype=torch.long)
    item_tensor = torch.tensor(item_idx, dtype=torch.long)
    user_vector = mf_model.useremb(user_tensor)
    item_vector = mf_model.itememb(item_tensor)
    score = torch.sigmoid((user_vector * item_vector).sum(-1))
    return score.item()

test['pred'] = test.apply(lambda x: calculate_prediction(x['userIdx'], x['click']), axis=1)
print(test.head(5))

### check

In [None]:
unique_values = test['click'].unique()
print(f"Unique values in 'click': {unique_values}")


### MF2 (with precision, recall, f1)

In [None]:
import os
import numpy as np
import pandas as pd
import torch.nn as nn
import pytorch_lightning as pl
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch
from collections import Counter
from torchmetrics.functional import auroc
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data_path = './data'
raw_behaviour = pd.read_csv(
    os.path.join(data_path, "behaviors.tsv"),
    sep="\t",
    names=["impressionId","userId","timestamp","click_history","impressions"])

print(f"The dataset originally consist of {len(raw_behaviour)} number of interactions.")
raw_behaviour.head()

In [None]:
news = pd.read_csv(
    os.path.join(data_path,"news.tsv"),
    sep="\t",
    names=["itemId","category","subcategory","title","abstract","url","title_entities","abstract_entities"])
print(f"The article data consist in total of {len(news)} number of articles.")
news.head()

In [None]:
def process_impression(impression_list):
    list_of_strings = impression_list.split()
    click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '1']
    non_click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '0']
    return click,non_click

# We can then indexize these two new columns:
raw_behaviour['click'], raw_behaviour['noclicks'] = zip(*raw_behaviour['impressions'].map(process_impression))

In [None]:
raw_behaviour['epochhrs'] = pd.to_datetime(raw_behaviour['timestamp']).values.astype(np.int64)/(1e6)/1000/3600
raw_behaviour['epochhrs'] = raw_behaviour['epochhrs'].round()

In [None]:
raw_behaviour = raw_behaviour.explode("click").reset_index(drop=True)

# Extract the clicks from the previous clicks
click_history = raw_behaviour[["userId","click_history"]].drop_duplicates().dropna()
click_history["click_history"] = click_history.click_history.map(lambda x: x.split())
click_history = click_history.explode("click_history").rename(columns={"click_history":"click"})
# Dummy time set to earlies epochhrs in raw_behaviour as we don't know when these events took place.
click_history["epochhrs"] = raw_behaviour.epochhrs.min()
click_history["noclicks"] = pd.Series([[] for _ in range(len(click_history.index))])

# concatenate historical clicks with the raw_behaviour
raw_behaviour = pd.concat([raw_behaviour,click_history],axis=0).reset_index(drop=True)
print(f"The dataset after pre-processing consist of {len(raw_behaviour)} number of interactions.")

In [None]:
min_click_cutoff = 100
print(f'Number of items that have less than {min_click_cutoff} clicks make up',np.round(np.mean(raw_behaviour.groupby("click").size() < min_click_cutoff)*100,3),'% of the total, and these will be removed.')


In [None]:
raw_behaviour = raw_behaviour[raw_behaviour.groupby("click")["userId"].transform('size') >= min_click_cutoff].reset_index(drop=True)
# Get a set with all the unique items
click_set = set(raw_behaviour['click'].unique())

# remove items for impressions that is not avaiable in the click set (the items that we will be training on)
raw_behaviour['noclicks'] = raw_behaviour['noclicks'].apply(lambda impressions: [impression for impression in impressions if impression in click_set])

In [None]:
behaviour = raw_behaviour[['epochhrs','userId','click','noclicks']].copy()

print('Number of interactions in the behaviour dataset:', behaviour.shape[0])
print('Number of users in the behaviour dataset:', behaviour.userId.nunique())
print('Number of articles in the behaviour dataset:', behaviour.click.nunique())

behaviour.head()

In [None]:
test_time_th = behaviour['epochhrs'].quantile(0.9)
train = behaviour[behaviour['epochhrs']< test_time_th].copy()

## Indexize items
# Allocate a unique index for each item, but let the zeroth index be a UNK index:
ind2item = {idx +1: itemid for idx, itemid in enumerate(train.click.unique())}
item2ind = {itemid : idx for idx, itemid in ind2item.items()}

train['noclicks'] = train['noclicks'].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
train['click'] = train['click'].map(lambda item: item2ind.get(item, 0))

## Indexize users
# Allocate a unique index for each user, but let the zeroth index be a UNK index:
ind2user = {idx +1: userid for idx, userid in enumerate(train['userId'].unique())}
user2ind = {userid : idx for idx, userid in ind2user.items()}

# Create a new column with userIdx:
train['userIdx'] = train['userId'].map(lambda x: user2ind.get(x,0))

# Repeat for validation
valid =  behaviour[behaviour['epochhrs']>= test_time_th].copy()
valid["click"] = valid["click"].map(lambda item: item2ind.get(item, 0))
valid["noclicks"] = valid["noclicks"].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
valid["userIdx"] = valid["userId"].map(lambda x: user2ind.get(x,0))

In [None]:
class MindDataset(Dataset):
    # A fairly simple torch dataset module that can take a pandas dataframe (as above),
    # and convert the relevant fields into a dictionary of arrays that can be used in a dataloader
    def __init__(self, df):
        # Create a dictionary of tensors out of the dataframe
        self.data = {
            'userIdx' : torch.tensor(df.userIdx.values.astype(np.int64)),
            'click' : torch.tensor(df.click.values.astype(np.int64))
        }
    def __len__(self):
        return len(self.data['userIdx'])
    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.data.items()}

In [None]:
bs = 1024
ds_train = MindDataset(train)
train_loader = DataLoader(ds_train, batch_size=bs, shuffle=True)
ds_valid = MindDataset(valid)
valid_loader = DataLoader(ds_valid, batch_size=bs, shuffle=False)

batch = next(iter(train_loader))

In [None]:
# Build a matrix factorization model
class NewsMF(pl.LightningModule):
    def __init__(self, num_users, num_items, dim = 10):
        super().__init__()
        self.dim=dim
        self.num_users = num_users
        self.num_items = num_items

        self.useremb = nn.Embedding(num_embeddings=num_users, embedding_dim=dim)
        self.itememb = nn.Embedding(num_embeddings=num_items, embedding_dim=dim)

    def forward(self, user_idx, item_idx):
        user_vec = self.useremb(user_idx)
        item_vec = self.itememb(item_idx)
        dot_product = (user_vec * item_vec).sum(-1).unsqueeze(-1)
        score = torch.sigmoid(dot_product)
        return score

    def step(self, batch, batch_idx, phase="train"):
        batch_size = batch['userIdx'].size(0)
        uservec = self.useremb(batch['userIdx'])
        itemvec_click = self.itememb(batch['click'])

        # For each positive interaction,sample a random negative
        neg_sample = torch.randint_like(batch["click"],1,self.num_items)
        itemvec_noclick = self.itememb(neg_sample)

        score_click = torch.sigmoid((uservec*itemvec_click).sum(-1).unsqueeze(-1))
        score_noclick =  torch.sigmoid((uservec*itemvec_noclick).sum(-1).unsqueeze(-1))

        # Compute loss as binary cross entropy (categorical distribution between the clicked and the no clicked item)
        scores_all = torch.concat((score_click, score_noclick), dim=1)
        target_all = torch.concat((torch.ones_like(score_click), torch.zeros_like(score_noclick)),dim=1)
        loss = F.binary_cross_entropy(scores_all, target_all)
        return loss


    def training_step(self, batch, batch_idx):
        return self.step(batch, batch_idx, "train")

    def validation_step(self, batch, batch_idx):
        # for now, just do the same computation as during training
        return self.step(batch, batch_idx, "val")

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer

In [None]:
from pytorch_lightning import seed_everything

In [None]:
seed_everything(42, workers=True)
# Define and train model
mf_model = NewsMF(num_users=len(ind2user) + 1, num_items = len(ind2item) + 1, dim = 50)
trainer = pl.Trainer(max_epochs=2, accelerator="gpu",deterministic = True)
trainer.fit(model=mf_model, train_dataloaders=train_loader)

In [None]:
all_scores = []
all_labels = []
with torch.no_grad():
    for btach in valid_loader:
        userIdx = batch['userIdx']
        click = batch['click']
        scores_click = mf_model(userIdx, click)
        neg_sample = torch.randint(0, mf_model.num_items, click.size(), device = click.device)
        scores_noclick = mf_model(userIdx, neg_sample)

        all_scores.append(scores_click)
        all_labels.append(torch.ones_like(scores_click))

        all_scores.append(scores_noclick)
        all_labels.append(torch.zeros_like(scores_noclick))
all_scores = torch.cat(all_scores).view(-1)
all_labels = torch.cat(all_labels).view(-1)

In [None]:
auc_r = auroc(all_scores, all_labels.int(), task='binary')

In [None]:
auc_r

In [None]:
fpr, tpr, thresholds = roc_curve(all_labels.cpu().numpy(), all_scores.cpu().numpy())
roc_auc = auc(list(fpr), list(tpr))
plt.figure()
plt.plot(list(fpr), list(tpr), color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('(ROC) Curve')
plt.legend(loc="lower right")
plt.show()


In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score
predict_results = (all_scores > 0.5)
predict_results = predict_results.numpy()
all_labels = all_labels.numpy()

In [None]:
precision_score(predict_results, all_labels, average='macro')

In [None]:
recall_score(predict_results, all_labels, average='macro')

In [None]:
f1_score(predict_results, all_labels, average='macro')

## Model: GNN(LightGCN) attempt

### Problem
The recommendation problem addressed here involves recommending news articles to users. The system will be deployed within a mobile news application. Competitor analysis indicates that current systems, such as those used by popular news apps, often rely on collaborative filtering and suffer from the cold start problem. Our proposed system uses Graph Neural Networks (GNN) to leverage both user-article interaction data and content features for more accurate recommendations.

- User Inputs: User interaction data (clicks, likes, shares) and user profile information.
- Recommendations: News articles personalized to user preferences.
- User Feedback: Clicks on recommended articles, which will be used to update the model.
- Problem Definition: Ranking problem where the goal is to rank news articles according to user preferences.

Firstly, we import the packages that we will used.

In [None]:
import os
import sys
import numpy as np
import pandas as pd
from tqdm import tqdm
import pickle
from collections import Counter
import tensorflow as tf
from recommenders.utils.timer import Timer
from recommenders.models.deeprec.DataModel.ImplicitCF import ImplicitCF
from recommenders.models.deeprec.models.graphrec.lightgcn import LightGCN
from recommenders.models.deeprec.deeprec_utils import prepare_hparams
from recommenders.models.deeprec.deeprec_utils import cal_metric
tf.get_logger().setLevel('ERROR')

### Dataset
<p> Dataset: MIND (Microsoft News Dataset), a large-scale dataset for news recommendation.</p>
Characteristics:

Contains news articles, user interaction data, and user profiles.
Articles have associated metadata like title, abstract, and category.
Exploratory Data Analysis:

Distribution of interactions per user.
Most popular categories of news articles.
- Strengths: Rich user interaction data and comprehensive metadata for news articles.
- Weaknesses: May have sparse interactions for some users (cold start problem).

In [None]:
data_path = './data'
raw_behaviour = pd.read_csv(
    os.path.join(data_path, "behaviors.tsv"),
    sep="\t",
    names=["impressionId","userId","timestamp","click_history","impressions"])

print(f"The dataset originally consist of {len(raw_behaviour)} number of interactions.")
raw_behaviour.head()

In [None]:
# use-item click
clicks = []
for _, row in raw_behaviour.iterrows():
    user_id = row["userId"]
    impressions = row["impressions"].split()
    for impression in impressions:
        item_id, click = impression.split('-')
        clicks.append((user_id, item_id, int(click)))
df_clicks = pd.DataFrame(clicks, columns=['userID', 'itemID', 'rating'])
#save data [click==1]

In [None]:
df_clicks.head()

### Method

#### Overall Approach:
- Use a Graph Neural Network (GNN) for recommendation.
<p> Construct a graph where nodes represent users and articles, and edges represent interactions.<p>
- Method:
 We offer an example to help users to run a ID-based collaborative filtering baseline with LightGCN.
LightGCN is a simple and neat Graph Convolution Network (GCN) model for recommender systems.
I It uses a GCN to learn the embeddings of users/items, with the goal that low-order and high-order user-item interactions are explicitly exploited into the embedding function.


![jupyter](https://camo.githubusercontent.com/d01f9da6d6cf4e07e35b0d77ccbf0195851ef1b1a035c30efc810e37b0da624a/68747470733a2f2f7265636f64617461736574732e7a32302e7765622e636f72652e77696e646f77732e6e65742f6b6464323032302f696d616765732532464c6967687447434e2d67726170686578616d706c652e4a5047)

Model structure as belows:

![jupyter](https://camo.githubusercontent.com/f390dcd24e48a86a2c6eeac9344eac60f9b45fc817bc24cf850e7d28d8a955bd/68747470733a2f2f7265636f64617461736574732e7a32302e7765622e636f72652e77696e646f77732e6e65742f696d616765732f6c6967687447434e2d6d6f64656c2e6a7067)

LightGCN only takes positive user-item interactions for model training. Pairs with rating < 1 will be ignored by the model.

In [None]:
data = ImplicitCF(
    train=df_clicks, test=df_clicks, seed=0,
    col_user='userID',
    col_item='itemID',
    col_rating='rating'
)

In [None]:
yaml_file = './lightgcn.yaml'
lightgcn_dir = './lightgcn_model'
def create_dir(path):
    if not os.path.exists(path):
        os.makedirs(path)
create_dir(lightgcn_dir)
hparams = prepare_hparams(yaml_file,
                          learning_rate=0.005,
                          eval_epoch=20,
                          top_k=20,
                          save_model=False,
                          epochs=300,
                          save_epoch=20
                         )
hparams.MODEL_DIR = os.path.join(lightgcn_dir, 'saved_models')

In [None]:
model = LightGCN(hparams, data, seed=0)

In [None]:
with Timer() as train_time:
    model.fit()

print("Took {} seconds for training.".format(train_time.interval))

### Reflection


GNN seems not producing a reasonable result by this trying. It may be the issues with the establishing the nodes and the features regarding the users. As the dataset do not contain much demographics other than click information, we may need to further gather the data and modify the GCN model.

##### Challenges:

Computationally intensive training.
Addressing the cold start problem for new users and articles.

##### Future Work:

Integrate context-aware recommendation to consider user’s current context.
Explore sequential recommendation to account for temporal dynamics in user interactions.
Incorporate social network data for enhanced recommendations.
Commercial Viability:

The proposed system shows promise but requires further optimization for real-time recommendations and handling new user/article scenarios.

##### Reference:

LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation https://arxiv.org/abs/2002.02126


## Model: Deep Learning Moedel

#### Global settings and imports

In [None]:
import os
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score, average_precision_score
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import OneHotEncoder
import matplotlib.pyplot as plt
from tqdm import tqdm
from google.colab import drive

#### Load datasets

In [None]:
def load_data(behaviors_path, news_path):
    behaviors = pd.read_csv(behaviors_path, sep='\t', names=['impression_id', 'user_id', 'time', 'history', 'impressions'])
    news = pd.read_csv(news_path, sep='\t', names=['news_id', 'category', 'subcategory', 'title', 'abstract', 'url', 'title_entities', 'abstract_entities'])
    return behaviors, news

#### Load embeddings

In [None]:
def load_embeddings(entity_embedding_path, relation_embedding_path):
    entity_embeddings = pd.read_csv(entity_embedding_path, sep=' ', header=None, index_col=0)
    relation_embeddings = pd.read_csv(relation_embedding_path, sep=' ', header=None, index_col=0)
    return entity_embeddings, relation_embeddings

#### Data preprocessing

In [None]:
def preprocess_data(behaviors, news):
    # Split user click history and impressions
    behaviors['history'] = behaviors['history'].fillna('').apply(lambda x: x.split(' '))
    behaviors['impressions'] = behaviors['impressions'].apply(lambda x: [i.split('-') for i in x.split(' ')])

    # Fill missing values in news data
    news['title'] = news['title'].fillna('')
    news['abstract'] = news['abstract'].fillna('')
    news['category'] = news['category'].fillna('unknown')
    news['subcategory'] = news['subcategory'].fillna('unknown')
    return behaviors, news

#### Text representation using TF-IDF

We adopted the TF-IDF method, setting the maximum number of features to 5000, and combined news titles and abstracts for vectorization. This method effectively captures the importance of words, particularly suitable for text-intensive content like news. Due to computational power limitations, we didn't use the BERT model, as the subsequent cosine similarity calculations were too computationally intensive.

In [None]:
def text_representation(news):
    # Combine title and abstract for TF-IDF vectorization
    vectorizer = TfidfVectorizer(max_features=5000)
    news['title_abstract'] = news['title'] + ' ' + news['abstract']
    tfidf_matrix = vectorizer.fit_transform(news['title_abstract'])
    return tfidf_matrix, vectorizer

#### Entity representation using embeddings

we extracted entities mentioned in each news article and used pre-trained entity embeddings to represent these entities. By averaging all entity embeddings for each news article, we obtained the entity representation for that news. This method allows our system to understand the semantic content of the news, not just the surface text.

In [None]:
def entity_representation(news, entity_embeddings):
    embedding_dim = entity_embeddings.shape[1]
    news_entity_embeddings = np.zeros((len(news), embedding_dim))

    for i, entities in enumerate(news['title_entities']):
        if pd.isna(entities) or entities == "[]":
            continue
        entity_ids = [entity['WikidataId'] for entity in eval(entities) if 'WikidataId' in entity]
        if len(entity_ids) > 0:
            embeddings = np.mean([entity_embeddings.loc[e].values for e in entity_ids if e in entity_embeddings.index], axis=0)
            news_entity_embeddings[i] = embeddings

    return news_entity_embeddings

#### Combine TF-IDF and entity embeddings

we concatenated the TF-IDF vectors and entity embedding vectors. The advantage of this method is that it considers both the statistical characteristics of the text and semantic information, providing rich feature representation for each news article.

In [None]:
def combine_features(tfidf_matrix, news_entity_embeddings, news):
    category_encoder = OneHotEncoder()
    categories = category_encoder.fit_transform(news[['category', 'subcategory']]).toarray()
    return np.hstack((tfidf_matrix.toarray(), news_entity_embeddings, categories))

#### User profile modeling

we adopted a simple but effective method. We traversed each user's click history, extracted feature vectors of all clicked news, and then took the average as the user's interest representation. This method effectively captures the overall interest distribution of users and is computationally efficient, suitable for large-scale online recommendation systems.

In [None]:
def user_profile_modeling(behaviors, news, combined_features, aggregate_size=100):
    news_index = {news_id: idx for idx, news_id in enumerate(news['news_id'])}
    user_profiles = {}

    def aggregate_profiles(profiles):
        return np.mean(profiles, axis=0) if len(profiles) > 0 else np.zeros(combined_features.shape[1])

    for user_id, hist in zip(behaviors['user_id'], behaviors['history']):
        if user_id not in user_profiles:
            user_profiles[user_id] = []
        weights = np.arange(1, len(hist) + 1) / len(hist)  # increasing weights for more recent items
        for news_id, weight in zip(hist, weights):
            if news_id in news_index:
                news_idx = news_index[news_id]
                user_profiles[user_id].append(weight * combined_features[news_idx])
            if len(user_profiles[user_id]) >= aggregate_size:
                user_profiles[user_id] = [aggregate_profiles(user_profiles[user_id])]

    for user_id in user_profiles:
        if len(user_profiles[user_id]) > 0:
            user_profiles[user_id] = aggregate_profiles(user_profiles[user_id])
        else:
            user_profiles[user_id] = np.zeros(combined_features.shape[1])

    return user_profiles



#### Recommendation generation

We use cosine similarity to match user profiles and candidate news articles. Specifically, we calculate the cosine similarity between the user profile vector and all candidate news vectors, then select the top N articles with the highest similarity as the recommendation results. This method is simple, intuitive, and fast to compute, very suitable for real-time recommendation scenarios.

In [None]:
def generate_recommendations(user_profiles, news, combined_features):
    news_index = {news_id: idx for idx, news_id in enumerate(news['news_id'])}
    recommendations = {}

    for user_id in tqdm(user_profiles):
        user_profile = user_profiles[user_id]
        cosine_similarities = cosine_similarity(user_profile.reshape(1, -1), combined_features)
        similar_indices = cosine_similarities.argsort().flatten()[-10:]
        recommendations[user_id] = [news['news_id'][i] for i in similar_indices]

    return recommendations

#### Define PyTorch dataset

In [None]:
class NewsDataset(Dataset):
    def __init__(self, news, labels):
        self.news = news
        self.labels = labels

    def __len__(self):
        return len(self.news)

    def __getitem__(self, idx):
        news = self.news[idx]
        label = self.labels[idx]
        return news, label

#### Define the neural network model

We introduced a deep learning model using PyTorch. This feedforward neural network includes an input layer, two hidden layers with Batch Normalization and ReLU activation, and Dropout layers to prevent overfitting. This model aims to learn complex relationships between user interests and news content, enhancing recommendation accuracy.

In [None]:
class NewsRecommendationModel(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim, dropout_prob=0.5):
        super(NewsRecommendationModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.bn1 = nn.BatchNorm1d(hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.bn2 = nn.BatchNorm1d(hidden_dim2)
        self.fc3 = nn.Linear(hidden_dim2, output_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_prob)

    def forward(self, x):
        out = self.fc1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc2(out)
        out = self.bn2(out)
        out = self.relu(out)
        out = self.dropout(out)
        out = self.fc3(out)
        return out

#### Model training function

In [None]:
def train_model(model, criterion, optimizer, dataloader, num_epochs=5):
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        for inputs, labels in dataloader:
            inputs, labels = inputs.float(), labels.float()
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * inputs.size(0)

        epoch_loss = running_loss / len(dataloader.dataset)
        print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss:.4f}')

#### Evaluation function

In [None]:
def evaluate_model(recommendations, behaviors):
    all_true_labels = []
    all_pred_scores = []

    for user_id, impressions in zip(behaviors['user_id'], behaviors['impressions']):
        if user_id in recommendations:
            user_recommendations = recommendations[user_id]
            for impression in impressions:
                news_id, label = impression
                label = int(label)
                score = 1 if news_id in user_recommendations else 0
                all_true_labels.append(label)
                all_pred_scores.append(score)

    auc = roc_auc_score(all_true_labels, all_pred_scores)
    map_score = average_precision_score(all_true_labels, all_pred_scores)

    print(f"AUC: {auc}")
    print(f"MAP: {map_score}")

    return auc, map_score

#### Output predictions

In [None]:
def output_predictions(recommendations, behaviors, output_path='predictions.txt'):
    with open(output_path, 'w') as f:
        for user_id, impressions in zip(behaviors['user_id'], behaviors['impressions']):
            if user_id in recommendations:
                user_recommendations = recommendations[user_id]
                pred_rank = (np.argsort(np.argsort(user_recommendations)[::-1]) + 1).tolist()
                pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
                f.write(' '.join([str(user_id), pred_rank]) + '\n')

#### Load data and embeddings, Processing the data

In [None]:
behaviors, news = load_data('/content/drive/MyDrive/MIND/behaviors.tsv', '/content/drive/MyDrive/MIND/news.tsv')
entity_embeddings, relation_embeddings = load_embeddings('/content/drive/MyDrive/MIND/entity_embedding.vec', '/content/drive/MyDrive/MIND/relation_embedding.vec')
behaviors, news = preprocess_data(behaviors, news)

#### Create TF-IDF and entity embeddings

In [None]:
tfidf_matrix, vectorizer = text_representation(news)
news_entity_embeddings = entity_representation(news, entity_embeddings)

#### Combine features

In [None]:
combined_features = combine_features(tfidf_matrix, news_entity_embeddings, news)

#### Create user profiles

In [None]:
user_profiles = user_profile_modeling(behaviors, news, combined_features)

#### Generate recommendations

In [None]:
recommendations = generate_recommendations(user_profiles, news, combined_features)

#### Train the model

In [None]:
input_dim = combined_features.shape[1]
hidden_dim = 128
output_dim = 1

model = NewsRecommendationModel(input_dim, hidden_dim, output_dim)
criterion = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

#### Create dataloader

In [None]:
all_news = torch.tensor(combined_features)
labels = torch.tensor([0] * len(all_news))  # Dummy labels for all news items
dataset = NewsDataset(all_news, labels)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

#### Train the model

In [None]:
train_model(model, criterion, optimizer, dataloader, num_epochs=5)
evaluate_model(recommendations, behaviors)

## Disscusion

1. **TF-IDF and Entity Embedding**:
   - **Manual Feature Design and Extraction**: Requires significant manual effort for feature design and extraction.
   - **Complex Relationships**: May struggle to capture complex relationships due to the manual nature of feature engineering.

2. **Matrix Factorization (MF)**:
   - **Handling Sparse and Incomplete Data**: Effective in managing sparse and incomplete datasets.
   - **Dimensionality Reduction**: Reduces the dimensionality and complexity of the data.
   - **Modeling Latent Factors**: Efficient in modeling latent factors in user-item interactions.
   - **Overfitting and Underfitting**: Susceptible to overfitting and underfitting, which can impact accuracy.

3. **Neural Collaborative Filtering (NCF)**:
   - **Understanding High-Dimensional Sparse Data**: Capable of comprehending high-dimensional sparse data.
   - **Embeddings for Interactions**: Uses embeddings to capture interactions and extract features.
   - **Deep Learning Approach**: Employs deep learning techniques to enhance collaborative filtering.

4. **LibFM (Factorization Machines)**:
   - **High-Dimensional Sparse Data**: Performs well with high-dimensional sparse datasets.
   - **Feature Interactions**: Captures feature interactions, offering more expressiveness than linear models.

5. **Graph Neural Networks (GNNs)**:
   - **Capturing Complex High-Order Relationships**: Excellent at modeling complex high-order relationships between users and news articles.
   - **Deeper Interactions**: Utilizes message-passing mechanisms to model deeper interactions.


### Summary

- **TF-IDF and Entity Embedding**: Labor-intensive with limitations in capturing complex relationships.
- **Matrix Factorization (MF)**: Effective for sparse data and latent factor modeling but prone to overfitting and underfitting.
- **Neural Collaborative Filtering (NCF)**: Handles high-dimensional sparse data using deep learning and embeddings.
- **LibFM (Factorization Machines)**: Captures feature interactions effectively in high-dimensional sparse data.
- **Graph Neural Networks (GNNs)**: Excels in capturing complex relationships and deeper interactions through advanced mechanisms.

### Future Work

1.	Enhanced Feature Engineering: Explore automated methods for feature design and extraction to reduce manual effort and improve the ability to capture complex relationships.
2.	Hybrid Models: Develop hybrid approaches that combine matrix factorization with other techniques to mitigate overfitting and underfitting while improving predictive accuracy.
3.	Contextual Information Integration: Incorporate additional contextual information such as user demographics, temporal dynamics, and item attributes to enhance model performance.
4.	Scalability Improvements: Investigate scalable algorithms and distributed computing techniques to handle large-scale datasets more efficiently.
5.	Advanced Neural Network Techniques: Apply advanced neural network techniques, such as dynamic and hierarchical graph neural networks, to better capture complex and evolving relationships.

## Conclusion

This project successfully implemented and evaluated a news recommendation system based on the MIND dataset. By utilizing a variety of methodologies, we were able to analyze their effectiveness in recommending news articles to users based on their historical behaviors.

Through this exploration, we identified the strengths and weaknesses of each approach, providing valuable insights into their application in news recommendation systems. The diversity of methods, from traditional techniques like TF-IDF and entity embedding to more advanced approaches like GNNs, highlighted the complexity and nuance required in creating effective recommendation systems.

Our findings suggest that while each method has its advantages, there is significant potential for improvement, particularly in areas such as feature engineering, hybrid model development, and scalability. These improvements could enhance the system's ability to provide personalized and accurate recommendations.