# 2024 Recsys Challenge

## About

This year's challenge focuses on online news recommendation, addressing both the technical and normative challenges inherent in designing effective and responsible recommender systems for news publishing. The challenge will delve into the unique aspects of news recommendation, including modeling user preferences based on implicit behavior, accounting for the influence of the news agenda on user interests, and managing the rapid decay of news items. Furthermore, our challenge embraces the normative complexities, involving investigating the effects of recommender systems on the news flow and whether they resonate with editorial values. [1]

## Challenge Task

The Ekstra Bladet RecSys Challenge aims to predict which article a user will click on from a list of articles that were seen during a specific impression. Utilizing the user's click history, session details (like time and device used), and personal metadata (including gender and age), along with a list of candidate news articles listed in an impression log, the challenge's objective is to rank the candidate articles based on the user's personal preferences. This involves developing models that encapsulate both the users and the articles through their content and the users' interests. The models are to estimate the likelihood of a user clicking on each article by evaluating the compatibility between the article's content and the user's preferences. The articles are ranked based on these likelihood scores, and the precision of these rankings is measured against the actual selections made by users. [1]

## Dataset Information

The Ekstra Bladet News Recommendation Dataset (EB-NeRD) was created to support advancements in news recommendation research. It was collected from user behavior logs at Ekstra Bladet. We collected behavior logs from active users during the 6 weeks from April 27 to June 8, 2023. This timeframe was selected to avoid major events, e.g., holidays or elections, that could trigger atypical behavior at Ekstra Bladet. The active users were defined as users who had at least 5 and at most 1,000 news click records in a three-week period from May 18 to June 8, 2023. To protect user privacy, every user was delinked from the production system when securely hashed into an anonymized ID using one-time salt mapping. Alongside, we provide Danish news articles published by Ekstra Bladet. Each article is enriched with textual context features such as title, abstract, body, categories, among others. Furthermore, we provide features that have been generated by proprietary models, including topics, named entity recognition (NER), and article embeddings [2]

For more information on the [dataset](https://recsys.eb.dk/dataset/).

## References
[1] [RecySys Challenge 2024 Logistics](https://recsys.eb.dk/)

[2] [Ekstra Bladet News Recommendation Dataset](https://recsys.eb.dk/dataset/)

------------------------------------------------------------------------------

### Notebook Organization
### This purpose of this notebook is for modeling (check out EDA notebook)

- Logistics
- Data Preprocessing
- Modeling

Recommendation systems benefit users by improving the discovery of relevant information, products, and services, thus saving users time and effort. They create personalized experiences by tailoring content to individual preferences and support various fields, including education, healthcare, and e-commerce. By enhancing efficiency and accessibility, these systems contribute to more informed decision-making and an improved quality of life.

------------------------------------------------------------------------------------

# Data Preprocessing

Let's import our packages used for this notebook.

In [3]:
# Packages
from datetime import datetime
from plotly.subplots import make_subplots
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import plotly.graph_objects as go

Load in the three separate data sources of the dataset:

**Articles**: Detailed information of news articles.[*](https://recsys.eb.dk/dataset/#articles)

**Behaviors**: Impression Logs. [*](https://recsys.eb.dk/dataset/#behaviors)

**History**: Click histories of users. [*](https://recsys.eb.dk/dataset/#history)

In [4]:
# Load in various dataframes
# Articles
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/train/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/train/history.parquet")

In [5]:
# Load in various dataframes
# Articles

# Behaviors
df_bev_val = pd.read_parquet("Data/Small/validation/behaviors.parquet")

# History
df_his_val = pd.read_parquet("Data/Small/validation/history.parquet")

What feature can we join the data sources on?

- Articles & Behavior: Article ID

- History & Behavior: User ID

Before we can join, we need to modify the behavior['article_ids_clicked'] column.

In [6]:
# Convert datatype of column first
df_bev['article_id'] = df_bev['article_id'].apply(lambda x: x if isinstance(x, str) else int(x) if not np.isnan(x) else x)

# Join bevhaiors to article
df = df_bev.join(df_art.set_index("article_id"), on="article_id")

# Join bevhaiors to history
df = df.join(df_his.set_index("user_id"), on="user_id")


More preprocessing needed before we can begin further analysis.

In [7]:
def device_(x):
    """ 
    Changes the device input from a int to a str
    Keyword arguments:
        x -- int
    Output:
        str
    """
    if x == 1:
        return 'Desktop'
    elif x == 2:
        return 'Mobile'
    else:
        return 'Tablet'

def gender_(x):
    """ 
    Changes the gender input from a float to a str
    Keyword arguments:
        x -- float
    Output:
        str
    """
    if x == 0.0:
        return 'Male'
    elif x == 1.0:
        return 'Female'
    else:
        return None


def postcodes_(x):
    """ 
    Changes the postcodes input from a float to a str
    Keyword arguments:
        x -- float
    Output:
        str
    """
    if x == 0.0:
        return 'Metropolitan'
    elif x == 1.0:
        return 'Rural District'

    elif x == 2.0:
        return 'Municipality'

    elif x == 3.0:
        return 'Provincial'

    elif x == 4.0:
        return 'Big City'

    else:
        return None

In [8]:
# Preprocessing
df.dropna(subset=['article_id'], inplace=True)

# Change article IDs into int
df['article_id'] = df['article_id'].apply(lambda x: int(x))
df['article_id'] = df['article_id'].astype(np.int64)

# Change age from int to string
df['device_type'] = df['device_type'].apply(lambda x: device_(x))

# Change genders from float to string
df['gender'] = df['gender'].apply(lambda x: gender_(x))

# Change age to str it's a range
df['age'] = df['age'].astype('Int64')
df['age'] = df['age'].astype(str)
df['age'] = df['age'].apply(
    lambda x: x if x == '<NA>' else x + ' - ' + x[0] + '9')


# Change postcodes from int to str
df['postcode'] = df['postcode'].apply(lambda x: postcodes_(x))

Next section will be on all the helper functions used in this notebook!

-------------------------------------------------------------------------------------

# MODELING

We will explore news recommendation systems, starting from first principles and progressing to state-of-the-art approaches. This will include exploring content-based approaches and hybrid-based approaches.

## Content-Based Approach


Our first approach will be utilizing TF-IDF vectorization because of its simplicity.

In conjuction with TF-IDF, we need to utilize a similarity method to compare documents such as cosine similarity or FAISS

### Cosine Similarity

Cosine similarity is a measure used to determine the similarity between two non-zero vectors in a multi-dimensional space. It calculates the cosine of the angle between the vectors, with values ranging from -1 to 1. In the context of NLP and recommender systems, it is commonly used to compare TF-IDF or word embeddings to evaluate the similarity between documents or items. Setting an appropriate threshold value is crucial for determining the significance of similarity. 

In [17]:
# Packages
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import pandas as pd
import sklearn

# Modeling
# Rand Seed
np.random.seed(42)

# Merge fields into strings for processing
df_art['topics_str'] = df_art['topics'].apply(' '.join)
df_art['entity_groups_str'] = df_art['entity_groups'].apply(' '.join)
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(' '.join)

# Create a dictionary for quick lookups
article_content_dict = {
    row['article_id']: f"{row['title']} {row['body']} {row['category_str']} {row['article_type']} "
                       f"{row['ner_clusters_str']} {row['entity_groups_str']} {row['topics_str']}"
    for _, row in df_art.iterrows()
}

# Fit TF-IDF vectorizer
vectorizer = TfidfVectorizer(norm='l1')
tfidf_matrix_all = vectorizer.fit_transform(article_content_dict.values())
article_ids = list(article_content_dict.keys())

# Map article IDs to indices in the TF-IDF matrix
article_id_to_idx = {article_id: idx for idx, article_id in enumerate(article_ids)}

# Initialize predicted impressions list
predicted_impressions = []
# Store similarity scores for auc_score
similarity_scores = []

# Process each row of the user behavior data
for i in df_bev.index[0:1000]:  
    # Extract the articles viewed by the current user
    user_article_history = df_bev.loc[i, 'article_ids_inview']
    
    # Map article IDs to indices if they exist in the article ID-to-index dictionary
    indices = [article_id_to_idx[x] for x in user_article_history if x in article_id_to_idx]
    
    # If no valid indices are found, append None to predictions and skip further processing
    if not indices:
        predicted_impressions.append(None)
        continue
    
    # Compute the average TF-IDF vector for the user's article history
    user_profile_vector = tfidf_matrix_all[indices].mean(axis=0).A  
    
    highest_similarity = 0
    best_imp = None

    # Evaluate each article in the user's history for similarity to the user's profile
    for imp in user_article_history:
        if imp in article_id_to_idx:
            # Retrieve the TF-IDF vector for the article
            imp_idx = article_id_to_idx[imp]
            imp_tfidf_vector = tfidf_matrix_all[imp_idx].A 
            
            # Calculate the cosine similarity between the user's profile and the article
            similarity = cosine_similarity(user_profile_vector, imp_tfidf_vector.reshape(1, -1))[0, 0]
            
            # Update the best match if the current similarity is higher
            if similarity > highest_similarity:
                highest_similarity = similarity
                best_imp = imp

    # If similarity is low, pick a random article; otherwise, use the best match
    if highest_similarity <= 0.5:
        impression = np.random.choice(user_article_history)
        predicted_impressions.append(impression)
        # Assign low similarity value for random choice
        similarity_scores.append(0)  
        print(f"User {df_bev.loc[i, 'user_id']}: Low similarity, random impression chosen")
    else:
        predicted_impressions.append(best_imp)
        similarity_scores.append(highest_similarity)
        print(f"User {df_bev.loc[i, 'user_id']}: Best impression {best_imp} with similarity {highest_similarity}")

# Prepare binary labels for AUC
actual_impressions = [x[0] for x in df_bev['article_ids_clicked'].values][0:1000]
binary_labels = [1 if pred == actual else 0 for pred, actual in zip(predicted_impressions, actual_impressions)]
auc_score = sklearn.metrics.roc_auc_score(binary_labels, similarity_scores)

# Calculate accuracy
y_pred = predicted_impressions
y_true = actual_impressions
acc = sklearn.metrics.accuracy_score(y_true, y_pred)

print("-------------------------")
print("Accuracy:", acc)
print("AUC Score:", auc_score)


User 139836: Best impression 9778669 with similarity 0.6107836259921116
User 143471: Best impression 9778669 with similarity 0.5256599132656282
User 151570: Best impression 9778669 with similarity 0.5569654912324206
User 151570: Best impression 7213923 with similarity 0.5416005334566849
User 151570: Best impression 9774568 with similarity 0.5922306886631256
User 151570: Low similarity, random impression chosen
User 151570: Best impression 9778627 with similarity 0.5507244747867066
User 161621: Low similarity, random impression chosen
User 161621: Low similarity, random impression chosen
User 161621: Best impression 9778769 with similarity 0.6293684333116504
User 161621: Low similarity, random impression chosen
User 161621: Low similarity, random impression chosen
User 161621: Best impression 9778669 with similarity 0.5728854668466661
User 163208: Best impression 9778669 with similarity 0.6398413332018615
User 164957: Low similarity, random impression chosen
User 164957: Low similarity,

The model's accuracy is 0.12, lower than a random model's 1/6 (≈0.1667) chance. However, its ROC-AUC score is 0.59, indicating a decent starting point as it outperforms random guessing.

### FAISS

FAISS (Facebook AI Similarity Search) is a library developed by Facebook AI that provides efficient tools for similarity search and clustering of dense vectors. The central task in FAISS is finding vectors in a database that are closest to a query vector, typically using a similarity measure like cosine similarity or Euclidean distance. Setting an appropriate threshold value is crucial for determining the significance of similarity. In some cases, a dimensional reduction techniques such as SVD must be employed to lower the number of memory required for computation.

In [15]:
# Packages
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
import sklearn
import faiss  

# Modeling
# Rand Seed
np.random.seed(42)

# Merge fields into strings for processing
df_art['topics_str'] = df_art['topics'].apply(' '.join)
df_art['entity_groups_str'] = df_art['entity_groups'].apply(' '.join)
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(' '.join)

# Create a dictionary for quick lookups
article_content_dict = {
    row['article_id']: f"{row['title']} {row['body']} {row['category_str']} {row['article_type']} "
                       f"{row['ner_clusters_str']} {row['entity_groups_str']} {row['topics_str']}"
    for _, row in df_art.iterrows()
}

# Fit TF-IDF vectorizer
vectorizer = TfidfVectorizer(norm='l1')
tfidf_matrix_all = vectorizer.fit_transform(article_content_dict.values())
article_ids = list(article_content_dict.keys())

# Dimensionality reduction for the TF-IDF matrix
# Number of components to keep
n_components = 100  
svd = TruncatedSVD(n_components=n_components)
tfidf_matrix_reduced = svd.fit_transform(tfidf_matrix_all)

# Create a FAISS index for the reduced TF-IDF matrix
index = faiss.IndexFlatL2(tfidf_matrix_reduced.shape[1]) 
index.add(tfidf_matrix_reduced)  

# Map article IDs to indices in the TF-IDF matrix
article_id_to_idx = {article_id: idx for idx, article_id in enumerate(article_ids)}

# Initialize predicted impressions list
predicted_impressions = []
# Store similarity scores for auc_score
similarity_scores = []

# Process each user behavior
for i in df_bev.index[0:1000]:
    user_article_history = df_bev.loc[i, 'article_ids_inview']
    indices = [article_id_to_idx[x] for x in user_article_history if x in article_id_to_idx]

    if not indices:
        predicted_impressions.append(None)
        continue

    # Aggregate user history into a single vector
    user_profile_vector = tfidf_matrix_reduced[indices].mean(axis=0).reshape(1, -1).astype(np.float32)

    # Use FAISS to find nearest neighbors (articles) based on user profile vector
    # Search for top 10 nearest neighbors (articles)
    D, I = index.search(user_profile_vector, k=10)

    # Get the most similar article based on the nearest neighbor
    highest_similarity = D[0][0]
    best_imp_idx = I[0][0]

    # Handle low similarity by selecting a random article
    if highest_similarity < 0.00005:
        impression = np.random.choice(user_article_history)
        predicted_impressions.append(impression)
        similarity_scores.append(0)
        print(f"User {df_bev.loc[i, 'user_id']}: Low similarity, random impression chosen")
    else:
        # Retrieve the article ID of the best match
        best_imp = article_ids[best_imp_idx]
        predicted_impressions.append(best_imp)
        # Store the similarity score
        similarity_scores.append(highest_similarity)  
        print(f"User {df_bev.loc[i, 'user_id']}: Best impression {best_imp} with similarity {highest_similarity}")

# Prepare binary labels for AUC
actual_impressions = [x[0] for x in df_bev['article_ids_clicked'].values][0:1000]
binary_labels = [1 if pred == actual else 0 for pred, actual in zip(predicted_impressions, actual_impressions)]
auc_score = sklearn.metrics.roc_auc_score(binary_labels, similarity_scores)

# Calculate accuracy
y_pred = predicted_impressions
y_true = actual_impressions
acc = sklearn.metrics.accuracy_score(y_true, y_pred)

print("-------------------------")
print("Accuracy:", acc)
print("AUC Score:", auc_score)


User 139836: Best impression 9506939 with similarity 8.73170793056488e-05
User 143471: Low similarity, random impression chosen
User 151570: Best impression 9127331 with similarity 9.042416786542162e-05
User 151570: Low similarity, random impression chosen
User 151570: Low similarity, random impression chosen
User 151570: Low similarity, random impression chosen
User 151570: Low similarity, random impression chosen
User 161621: Low similarity, random impression chosen
User 161621: Low similarity, random impression chosen
User 161621: Low similarity, random impression chosen
User 161621: Low similarity, random impression chosen
User 161621: Low similarity, random impression chosen
User 161621: Best impression 9321747 with similarity 6.255736661842093e-05
User 163208: Low similarity, random impression chosen
User 164957: Low similarity, random impression chosen
User 164957: Low similarity, random impression chosen
User 164957: Low similarity, random impression chosen
User 164957: Low sim

The model's accuracy is 0.096, below a random model's 1/6 (≈0.1667) chance, and its ROC-AUC score of 0.3324 indicates it performs worse than random guessing.

## Hybrid-based Approach 

Content-based approaches performed poorly due to the use of TF-IDF vectorization, a bag-of-words model that fails to capture relationships between words. Now, we'll use more state-of-the-art models!

We'll be examining the Deep Interest Network (DIN) model. 

### Deep Interest Network

#### Overview

1) Data Preparation
2) Model training
3) Model prediction

##### Data Preparation
The data preparation script is displayed [here](https://github.com/SulmanK/2024-Recsys-Challenge/blob/main/fuxcitr_dir/data/prepare_data_v1.py).

The main goal is to preprocess news and user interaction data across the training, validation, and testing sets.

- Inputs: article, behavior, history .parquet files
- Output: training, validation, testing .csv files

**Structure**
1) Loading in news articles from the parquet files.
2) Tokenize and map categorical features.
3) Process user interaction history.
4) Create feature mappings for categories, sentiments, and article types.
5) Generate CSV files with processed features.

##### Model training
Train a Deep Interest Network (DIN) model. The model configuration is [here](https://github.com/SulmanK/2024-Recsys-Challenge/blob/main/fuxcitr_dir/config/base_config/model_config.yaml). 

**Important parameters**

*Model Architecture*

- embedding_dim: Dimensionality of feature embeddings (default: 40)
- dnn_hidden_units: Neural network architecture (default: [500, 500, 500])
- dnn_activations: Activation function for dense layers (default: relu)

*Attention Mechanism*


- din_target_field: The target item being predicted
- din_sequence_field: User's historical interaction sequence
- attention_hidden_units: Structure of attention network
- attention_hidden_activations: Activation in attention layers (default: "Dice")
- din_use_softmax: Whether to use softmax in attention mechanism

*Training Dynamics*


- learning_rate: Controls model convergence (default: 1e-3)
- batch_size: Number of samples per training iteration
- epochs: Total training iterations
- optimizer: Optimization algorithm (default: adam)

In [None]:
DIN_test:
    model: DIN
    dataset_id: tiny_seq
    loss: 'binary_crossentropy'
    metrics: ['logloss', 'AUC']
    task: binary_classification
    optimizer: adam
    learning_rate: 1.0e-3
    embedding_regularizer: 0
    net_regularizer: 0
    batch_size: 128
    embedding_dim: 4
    dnn_hidden_units: [64, 32]
    dnn_activations: relu
    attention_hidden_units: [64]
    attention_hidden_activations: "Dice"
    attention_output_activation: null
    attention_dropout: 0
    din_target_field: adgroup_id
    din_sequence_field: click_sequence
    net_dropout: 0
    batch_norm: False
    epochs: 1
    shuffle: True
    seed: 2019
    monitor: 'AUC'
    monitor_mode: 'max'

##### Prediction
The submission script is displayed [here](https://github.com/SulmanK/2024-Recsys-Challenge/blob/main/fuxcitr_dir/submit.py).

The main goal is to make predictions on the testing set.

**Structure**
1) Create a data loader for test data
2) Load test data from CSV
3) Predicts scores for each sample
4) Ranks predictions for each impression 
5) Writes results to a predictions.txt file
6) Zip the predictions 

#### Procedure


1) Prepare the data by preprocessing news and user interaction data. (go to fuxcitr_dir/data directory)

``` python prepare_data_v1.py```

2) Run the following script in the fuxcitr_dir directory to train the model on train and validation sets.

```python run_param_tuner.py --config config/DIN_ebnerd_large_x1_tuner_config_01.yaml --gpu 0```

3) Run the following script in the fuxcitr_dir directory to make predictions on the test set.

```python submit.py --config config/DIN_ebnerd_large_x1_tuner_config_01 --expid DIN_ebnerd_large_x1_001_1860e41e --gpu 1```

#### Results

The model's ROC-AUC score is ~0.676, which is much better than our previous model scores.

### Deep Cross Network

#### Overview

1) Data Preparation
2) Model training
3) Model prediction

##### Data Preparation
The data preparation script is displayed [here](https://github.com/SulmanK/2024-Recsys-Challenge/blob/main/fuxcitr_dir/data/prepare_data_v1.py).

The main goal is to preprocess news and user interaction data across the training, validation, and testing sets.

- Inputs: article, behavior, history .parquet files
- Output: training, validation, testing .csv files

**Structure**
1) Loading in news articles from the parquet files.
2) Tokenize and map categorical features.
3) Process user interaction history.
4) Create feature mappings for categories, sentiments, and article types.
5) Generate CSV files with processed features.

##### Model training
Train a Deep Interest Network (DCN) model. The model configuration is [here](https://github.com/SulmanK/2024-Recsys-Challenge/blob/main/fuxcitr_dir/config/base_config/model_config.yaml). 

**Important parameters**

*Model Architecture*

- embedding_dim: Dimensionality of feature embeddings (default: 40)
- dnn_hidden_units: Neural network architecture (default: [500, 500, 500])
- dnn_activations: Activation function for dense layers (default: relu)

*Attention Mechanism*


- din_target_field: The target item being predicted
- din_sequence_field: User's historical interaction sequence
- attention_hidden_units: Structure of attention network
- attention_hidden_activations: Activation in attention layers (default: "Dice")
- din_use_softmax: Whether to use softmax in attention mechanism

*Training Dynamics*


- learning_rate: Controls model convergence (default: 1e-3)
- batch_size: Number of samples per training iteration
- epochs: Total training iterations
- optimizer: Optimization algorithm (default: adam)

In [None]:
DIN_test:
    model: DIN
    dataset_id: tiny_seq
    loss: 'binary_crossentropy'
    metrics: ['logloss', 'AUC']
    task: binary_classification
    optimizer: adam
    learning_rate: 1.0e-3
    embedding_regularizer: 0
    net_regularizer: 0
    batch_size: 128
    embedding_dim: 4
    dnn_hidden_units: [64, 32]
    dnn_activations: relu
    attention_hidden_units: [64]
    attention_hidden_activations: "Dice"
    attention_output_activation: null
    attention_dropout: 0
    din_target_field: adgroup_id
    din_sequence_field: click_sequence
    net_dropout: 0
    batch_norm: False
    epochs: 1
    shuffle: True
    seed: 2019
    monitor: 'AUC'
    monitor_mode: 'max'

##### Prediction
The submission script is displayed [here](https://github.com/SulmanK/2024-Recsys-Challenge/blob/main/fuxcitr_dir/submit.py).

The main goal is to make predictions on the testing set.

**Structure**
1) Create a data loader for test data
2) Load test data from CSV
3) Predicts scores for each sample
4) Ranks predictions for each impression 
5) Writes results to a predictions.txt file
6) Zip the predictions 

#### Procedure


1) Prepare the data by preprocessing news and user interaction data. (go to fuxcitr_dir/data directory)

``` python prepare_data_v1.py```

2) Run the following script in the fuxcitr_dir directory to train the model on train and validation sets.

```python run_param_tuner.py --config config/DIN_ebnerd_large_x1_tuner_config_01.yaml --gpu 0```

3) Run the following script in the fuxcitr_dir directory to make predictions on the test set.

```python submit.py --config config/DIN_ebnerd_large_x1_tuner_config_01 --expid DIN_ebnerd_large_x1_001_1860e41e --gpu 1```

#### Results

The model's ROC-AUC score is ~0.676, which is much better than our previous model scores.