# 2024 Recsys Challenge

## About

This year's challenge focuses on online news recommendation, addressing both the technical and normative challenges inherent in designing effective and responsible recommender systems for news publishing. The challenge will delve into the unique aspects of news recommendation, including modeling user preferences based on implicit behavior, accounting for the influence of the news agenda on user interests, and managing the rapid decay of news items. Furthermore, our challenge embraces the normative complexities, involving investigating the effects of recommender systems on the news flow and whether they resonate with editorial values. [1]

## Challenge Task

The Ekstra Bladet RecSys Challenge aims to predict which article a user will click on from a list of articles that were seen during a specific impression. Utilizing the user's click history, session details (like time and device used), and personal metadata (including gender and age), along with a list of candidate news articles listed in an impression log, the challenge's objective is to rank the candidate articles based on the user's personal preferences. This involves developing models that encapsulate both the users and the articles through their content and the users' interests. The models are to estimate the likelihood of a user clicking on each article by evaluating the compatibility between the article's content and the user's preferences. The articles are ranked based on these likelihood scores, and the precision of these rankings is measured against the actual selections made by users. [1]

## Dataset Information

The Ekstra Bladet News Recommendation Dataset (EB-NeRD) was created to support advancements in news recommendation research. It was collected from user behavior logs at Ekstra Bladet. We collected behavior logs from active users during the 6 weeks from April 27 to June 8, 2023. This timeframe was selected to avoid major events, e.g., holidays or elections, that could trigger atypical behavior at Ekstra Bladet. The active users were defined as users who had at least 5 and at most 1,000 news click records in a three-week period from May 18 to June 8, 2023. To protect user privacy, every user was delinked from the production system when securely hashed into an anonymized ID using one-time salt mapping. Alongside, we provide Danish news articles published by Ekstra Bladet. Each article is enriched with textual context features such as title, abstract, body, categories, among others. Furthermore, we provide features that have been generated by proprietary models, including topics, named entity recognition (NER), and article embeddings [2]

For more information on the [dataset](https://recsys.eb.dk/dataset/).

## References
[1] [RecySys Challenge 2024 Logistics](https://recsys.eb.dk/)

[2] [Ekstra Bladet News Recommendation Dataset](https://recsys.eb.dk/dataset/)

------------------------------------------------------------------------------

### Notebook Organization
### This purpose of this notebook is for EDA only. 

- Logistics
- EDA 
    - Data Preprocessing
    - Functions
        - Plot Functions
        - Feature Functions
            - Article
            - User
            - Topic
            - Activity
    - Feature Analysis
        - Overall Feature Analysis
        - Article
        - User
        - Session
        - Topic
        - Devices
        - If subscriber
        - Gender
        - Age
        - Postcodes

We need to establish specific metrics and analyze how different features impact those metrics. Our platform generates revenue through both subscriptions and advertisements. User engagement is crucial because the more time users spend reading new articles, the greater our advertisement revenue. With this in focus, let's start with exploratory data analysis (EDA).

------------------------------------------------------------------------------------

# EDA

## Data Preprocessing

Let's import our packages used for this notebook.

In [1]:
# Packages
from datetime import datetime
from plotly.subplots import make_subplots
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import plotly.graph_objects as go

Load in the three separate data sources of the dataset:

**Articles**: Detailed information of news articles.[*](https://recsys.eb.dk/dataset/#articles)

**Behaviors**: Impression Logs. [*](https://recsys.eb.dk/dataset/#behaviors)

**History**: Click histories of users. [*](https://recsys.eb.dk/dataset/#history)

In [None]:
# Load in various dataframes
# Articles
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/train/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/train/history.parquet")

In [89]:
# Load in various dataframes
# Articles
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/validation/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/validation/history.parquet")

What feature can we join the data sources on?

- Articles & Behavior: Article ID

- History & Behavior: User ID

Before we can join, we need to modify the behavior['article_ids_clicked'] column.

In [90]:
# Convert datatype of column first
df_bev['article_id'] = df_bev['article_id'].apply(lambda x: x if isinstance(x, str) else int(x) if not np.isnan(x) else x)

# Join bevhaiors to article
df = df_bev.join(df_art.set_index("article_id"), on="article_id")

# Join bevhaiors to history
df = df.join(df_his.set_index("user_id"), on="user_id")

# Drop all other dataframes from me
df_bev = []
df_his = []
df_art = []

More preprocessing needed before we can begin further analysis.

In [91]:
def device_(x):
    """ 
    Changes the device input from a int to a str
    Keyword arguments:
        x -- int
    Output:
        str
    """
    if x == 1:
        return 'Desktop'
    elif x == 2:
        return 'Mobile'
    else:
        return 'Tablet'

def gender_(x):
    """ 
    Changes the gender input from a float to a str
    Keyword arguments:
        x -- float
    Output:
        str
    """
    if x == 0.0:
        return 'Male'
    elif x == 1.0:
        return 'Female'
    else:
        return None


def postcodes_(x):
    """ 
    Changes the postcodes input from a float to a str
    Keyword arguments:
        x -- float
    Output:
        str
    """
    if x == 0.0:
        return 'Metropolitan'
    elif x == 1.0:
        return 'Rural District'

    elif x == 2.0:
        return 'Municipality'

    elif x == 3.0:
        return 'Provincial'

    elif x == 4.0:
        return 'Big City'

    else:
        return None

In [92]:
# Preprocessing
df.dropna(subset=['article_id'], inplace=True)

# Change article IDs into int
df['article_id'] = df['article_id'].apply(lambda x: int(x))
df['article_id'] = df['article_id'].astype(np.int64)

# Change age from int to string
df['device_type'] = df['device_type'].apply(lambda x: device_(x))

# Change genders from float to string
df['gender'] = df['gender'].apply(lambda x: gender_(x))

# Change age to str it's a range
df['age'] = df['age'].astype('Int64')
df['age'] = df['age'].astype(str)
df['age'] = df['age'].apply(
    lambda x: x if x == '<NA>' else x + ' - ' + x[0] + '9')


# Change postcodes from int to str
df['postcode'] = df['postcode'].apply(lambda x: postcodes_(x))

Next section will be on all the helper functions used in this notebook!

-------------------------------------------------------------------------------------

# MODELING

## Try content based apporach


In [93]:
# Create a new column which has stuff we can compare:
## Title, Body, category, article type, NER, entities, topics

## so we have to look at the user's ID, figure out what stuff he has looked at. join that stuff all together and then doa  cosine similarity compared to the impressions

In [94]:
df['topics_str'] = df['topics'].apply(lambda x: ' '.join(x))

In [95]:
df['full_content'] = df['title'] + " " + df['body'] + " " +  df["category_str"] + " " + df['article_type'] + " " + df['ner_clusters'] + " " + df['entity_groups'] + df['topics_str']

In [2]:
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/validation/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/validation/history.parquet")

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import linear_kernel

# Pre-compute string fields once
df_art['topics_str'] = df_art['topics'].apply(lambda x: ' '.join(x))
df_art['entity_groups_str'] = df_art['entity_groups'].apply(lambda x: ' '.join(x))
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(lambda x: ' '.join(x))

# Create a dictionary for quick lookups once
article_content_dict = {}
for _, row in df_art.iterrows():
    full_content = f"{row['title']} {row['body']} {row['category_str']} {row['article_type']} {row['ner_clusters_str']} {row['entity_groups_str']} {row['topics_str']}"
    article_content_dict[row['article_id']] = full_content

# Iterate over user behaviors and generate the corpus
for i in df_bev.index:
    # Get user ID (if needed)
    user_id = df_bev.loc[i, 'user_id']

    # Get previous profile information
    user_article_history = df_bev.loc[i, 'article_id']  # Assuming 'article_id"

    # Ensure user_article_history is a list or handle NaNs
    if pd.isna(user_article_history):
        user_article_history = []
    elif isinstance(user_article_history, (int, float)):
        user_article_history = [int(user_article_history)]  # Convert to list if it's a single article ID
    elif isinstance(user_article_history, str):
        user_article_history = [user_article_history]  # Convert to list if it's a single article ID in string form

    # Generate the corpus based on user_article_history
    corpus = ' '.join([article_content_dict[x] for x in user_article_history if x in article_content_dict])

    if corpus == '':
        continue
    else:
        # Take the corpus and use tf_idf on it
        v = TfidfVectorizer()
        tfidf_matrix = v.fit_transform(corpus)
        cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

        # Now for each impression get the article information
        for imp in df_bev['article_ids_inview'][i]:
            content_imp = article_content_dict[imp]


        



ValueError: Iterable over raw text documents expected, string object received.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

# Pre-compute string fields once
df_art['topics_str'] = df_art['topics'].apply(lambda x: ' '.join(x))
df_art['entity_groups_str'] = df_art['entity_groups'].apply(lambda x: ' '.join(x))
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(lambda x: ' '.join(x))

# Create a dictionary for quick lookups once
article_content_dict = {}
for _, row in df_art.iterrows():
    full_content = f"{row['title']} {row['body']} {row['category_str']} {row['article_type']} {row['ner_clusters_str']} {row['entity_groups_str']} {row['topics_str']}"
    article_content_dict[row['article_id']] = full_content

# Iterate over user behaviors and generate the corpus
for i in df_bev.index:
    # Get user ID (if needed)
    user_id = df_bev.loc[i, 'user_id']

    # Get previous profile information
    user_article_history = df_bev.loc[i, 'article_id']  # Assuming 'article_id_fixed' is the column containing lists of article IDs

    # Ensure user_article_history is a list or handle NaNs
    if pd.isna(user_article_history):
        user_article_history = []
    elif isinstance(user_article_history, (int, float)):
        user_article_history = [int(user_article_history)]  # Convert to list if it's a single article ID
    elif isinstance(user_article_history, str):
        user_article_history = [user_article_history]  # Convert to list if it's a single article ID in string form

    # Generate the corpus based on user_article_history
    corpus = [article_content_dict[x] for x in user_article_history if x in article_content_dict]

    if not corpus:
        continue
    else:
        # Take the corpus and use tf_idf on it
        v = TfidfVectorizer()
        tfidf_matrix = v.fit_transform(corpus)
        cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

        # Now for each impression get the article information
        for imp in df_bev.loc[i, 'article_ids_inview']:  # Assuming 'article_ids_inview' is the correct column name
            if imp in article_content_dict:
                content_imp = article_content_dict[imp]
                # Do something with content_imp
                # For example: print(content_imp)


In [17]:
df_art[df_art['article_id'] == 9782656]

Unnamed: 0,article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,...,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label,topics_str,entity_groups_str,ner_clusters_str
19270,9782656,Få masser af data i udlandet til latterligt la...,"Bruger du meget mobildata, selv når du er på f...",2023-06-29 06:49:06,False,Mange danskere ynder at rejse til varmere himm...,2023-05-27 18:53:26,"[9782638, 8428800, 9782641]",article_default,https://ekstrabladet.dk/forbrug/Teknologi/faa-...,...,[2865],forbrug,430776.0,48740.0,2973951.0,0.7073,Neutral,Økonomi Mikro Teknologi,LOC MISC MISC ORG,Danmark dansker danskere EU


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel

# Pre-compute string fields once
df_art['topics_str'] = df_art['topics'].apply(lambda x: ' '.join(x))
df_art['entity_groups_str'] = df_art['entity_groups'].apply(lambda x: ' '.join(x))
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(lambda x: ' '.join(x))

# Create a dictionary for quick lookups once
article_content_dict = {}
for _, row in df_art.iterrows():
    full_content = f"{row['title']} {row['body']} {row['category_str']} {row['article_type']} {row['ner_clusters_str']} {row['entity_groups_str']} {row['topics_str']}"
    article_content_dict[row['article_id']] = full_content

# Iterate over user behaviors and generate the corpus
for i in df_bev.index:
    # Get user ID (if needed)
    user_id = df_bev.loc[i, 'user_id']

    # Get previous profile information
    user_article_history = df_bev.loc[i, 'article_id']  # Assuming 'article_id' is the column containing lists of article IDs

    # Ensure user_article_history is a list or handle NaNs
    if pd.isna(user_article_history):
        user_article_history = []
    elif isinstance(user_article_history, (int, float)):
        user_article_history = [int(user_article_history)]  # Convert to list if it's a single article ID
    elif isinstance(user_article_history, str):
        user_article_history = [user_article_history]  # Convert to list if it's a single article ID in string form
    elif isinstance(user_article_history, list):
        user_article_history = [int(x) if isinstance(x, (int, float)) else x for x in user_article_history]

    # Generate the corpus based on user_article_history
    corpus = [article_content_dict[x] for x in user_article_history if x in article_content_dict]

    # Skip processing if the corpus is empty
    if not corpus:
        continue

    # Take the corpus and use tf-idf on it
    v = TfidfVectorizer()
    tfidf_matrix = v.fit_transform(corpus)
    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

    # Now for each impression get the article information
    for imp in df_bev.loc[i, 'article_ids_inview']:  # Assuming 'article_ids_inview' is the correct column name
        if imp in article_content_dict:
            content_imp = article_content_dict[imp]
            # Do something with content_imp
            # For example: print(content_imp)


In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

# Pre-compute string fields once
df_art['topics_str'] = df_art['topics'].apply(lambda x: ' '.join(x))
df_art['entity_groups_str'] = df_art['entity_groups'].apply(lambda x: ' '.join(x))
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(lambda x: ' '.join(x))

# Create a dictionary for quick lookups once
article_content_dict = {}
for _, row in df_art.iterrows():
    full_content = f"{row['title']} {row['body']} {row['category_str']} {row['article_type']} {row['ner_clusters_str']} {row['entity_groups_str']} {row['topics_str']}"
    article_content_dict[row['article_id']] = full_content

# Iterate over user behaviors and generate the corpus
for i in df_bev.index:
    # Get previous profile information
    user_article_history = df_bev.loc[i, 'article_id']  # Assuming 'article_id' is the column containing lists of article IDs

    # Ensure user_article_history is a list or handle NaNs
    if pd.isna(user_article_history):
        user_article_history = []
    elif isinstance(user_article_history, (int, float)):
        user_article_history = [int(user_article_history)]  # Convert to list if it's a single article ID
    elif isinstance(user_article_history, str):
        user_article_history = [user_article_history]  # Convert to list if it's a single article ID in string form
    elif isinstance(user_article_history, list):
        user_article_history = [int(x) if isinstance(x, (int, float)) else x for x in user_article_history]

    # Generate the corpus based on user_article_history
    corpus = [article_content_dict[x] for x in user_article_history if x in article_content_dict]

    # Skip processing if the corpus is empty
    if not corpus:
        continue

    # Take the corpus and use TF-IDF on it
    v = TfidfVectorizer()
    tfidf_matrix = v.fit_transform(corpus)

    # Now for each impression get the article information
    top_impression = []
    for imp in df_bev.loc[i, 'article_ids_inview']:  # Assuming 'article_ids_inview' is the correct column name
        if imp in article_content_dict:
            content_imp = article_content_dict[imp]

            # Compute TF-IDF for the content_imp
            tfidf_imp = v.transform([content_imp])

            # Calculate cosine similarity between the corpus and the content_imp
            similarity = cosine_similarity(tfidf_matrix, tfidf_imp)

            # Do something with the similarity scores
            # For example, print the top 5 most similar articles
            similar_indices = similarity.flatten().argsort()[:-6:-1]  # Get indices of top 5 similar articles
            similar_articles = [(user_article_history[idx], similarity[idx][0]) for idx in similar_indices]
            print(f"Top 5 similar articles to '{imp}': {similar_articles}")


Top 5 similar articles to '9784804': [(9782884, 0.6693779632649604)]
Top 5 similar articles to '9784803': [(9782884, 0.5643824705640781)]
Top 5 similar articles to '9782884': [(9782884, 0.9999999999999996)]
Top 5 similar articles to '9784702': [(9782884, 0.5565559293175961)]
Top 5 similar articles to '9784805': [(9782884, 0.5888146674852569)]
Top 5 similar articles to '9783042': [(9788362, 0.6449453362709829)]
Top 5 similar articles to '9780702': [(9788362, 0.5749188154595823)]
Top 5 similar articles to '9787499': [(9788362, 0.5553857234758968)]
Top 5 similar articles to '9788310': [(9788362, 0.5680521230695487)]
Top 5 similar articles to '9788462': [(9788362, 0.5588502365854231)]
Top 5 similar articles to '9788188': [(9788362, 0.4996308989742155)]
Top 5 similar articles to '9788497': [(9788362, 0.4836374787149648)]
Top 5 similar articles to '9788666': [(9788362, 0.46833931142208707)]
Top 5 similar articles to '9788310': [(9788362, 0.5680521230695487)]
Top 5 similar articles to '978846

KeyboardInterrupt: 

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd



# Pre-compute string fields once
df_art['topics_str'] = df_art['topics'].apply(lambda x: ' '.join(x))
df_art['entity_groups_str'] = df_art['entity_groups'].apply(lambda x: ' '.join(x))
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(lambda x: ' '.join(x))

# Create a dictionary for quick lookups once
article_content_dict = {}
for _, row in df_art.iterrows():
    full_content = f"{row['title']} {row['body']} {row['category_str']} {row['article_type']} {row['ner_clusters_str']} {row['entity_groups_str']} {row['topics_str']}"
    article_content_dict[row['article_id']] = full_content

predicted_impression = []
# Iterate over user behaviors and generate the corpus
for i in df_bev.index:
    # Get previous profile information
    user_article_history = df_bev.loc[i, 'article_id']  # Assuming 'article_id' is the column containing lists of article IDs

    # Ensure user_article_history is a list or handle NaNs
    if pd.isna(user_article_history):
        user_article_history = []
    elif isinstance(user_article_history, (int, float)):
        user_article_history = [int(user_article_history)]  # Convert to list if it's a single article ID
    elif isinstance(user_article_history, str):
        user_article_history = [user_article_history]  # Convert to list if it's a single article ID in string form
    elif isinstance(user_article_history, list):
        user_article_history = [int(x) if isinstance(x, (int, float)) else x for x in user_article_history]

    # Generate the corpus based on user_article_history
    corpus = [article_content_dict[x] for x in user_article_history if x in article_content_dict]

    # Skip processing if the corpus is empty
    if not corpus:
        continue

    # Take the corpus and use TF-IDF on it
    v = TfidfVectorizer()
    tfidf_matrix = v.fit_transform(corpus)

    highest_similarity = 0
    best_imp = None

    # Now for each impression get the article information
    for imp in df_bev.loc[i, 'article_ids_inview']:  # Assuming 'article_ids_inview' is the correct column name
        if imp in article_content_dict:
            content_imp = article_content_dict[imp]

            # Compute TF-IDF for the content_imp
            tfidf_imp = v.transform([content_imp])

            # Calculate cosine similarity between the corpus and the content_imp
            similarity = cosine_similarity(tfidf_matrix, tfidf_imp).mean()

            # Update highest similarity and best impression
            if similarity > highest_similarity:
                highest_similarity = similarity
                best_imp = imp
    
    predicted_impression.append(best_imp)
    #if best_imp is not None:
        
       # print(f"Best impression for user {df_bev.loc[i, 'user_id']}: {best_imp} with similarity {highest_similarity}")


KeyboardInterrupt: 

In [21]:
article_content_dict[9782656]

'Få masser af data i udlandet til latterligt lav pris Mange danskere ynder at rejse til varmere himmelstrøg i løbet af sommerferien, og for de flestes vedkommende går turen til et EU-land.\nEr du en af mange, der skal rejse til inden for EU’s grænser i løbet af sommeren, er der rigtig gode muligheder for at spare penge på mobilregningen.\nSe også:\nSammenlign priser på mobilabonnementer\n59 kroner per måned henover sommeren\nLavprisselskabet duka frister lige nu med et tilbud\n, der er svært at modstå. For bare 59 kroner per måned frem til den 30. september 2023 kan du få fri tale i og til EU, 35 GB data til brug i Danmark og yderligere 35 GB til brug i EU. Derefter fortsætter abonnementet til normalprisen på 119 kroner per måned.\n35 GB data er mere end rigeligt til at dække en gennemsnitglig danskernes dataforbrug, da\nen dansker i gennemsnit bruger 20,0 GB per måned\n. Og med 35 GB ekstra til brug i EU-lande er der altså ingen grund til at holde igen med mobilforbruget, bare fordi d

In [13]:
cosine_similarities

array([[1.]])

In [6]:
df_bev['article_ids_inview'][0]

array([9783865, 9784591, 9784679, 9784696, 9784710])

In [9]:
article_content_dict

NameError: name 'article_content_dict' is not defined

In [115]:
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/validation/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/validation/history.parquet")

In [108]:
df_bev.columns

Index(['impression_id', 'article_id', 'impression_time', 'read_time',
       'scroll_percentage', 'device_type', 'article_ids_inview',
       'article_ids_clicked', 'user_id', 'is_sso_user', 'gender', 'postcode',
       'age', 'is_subscriber', 'session_id', 'next_read_time',
       'next_scroll_percentage'],
      dtype='object')

In [109]:
df_his.columns

Index(['user_id', 'impression_time_fixed', 'scroll_percentage_fixed',
       'article_id_fixed', 'read_time_fixed'],
      dtype='object')

In [154]:
df_art.columns

Index(['article_id', 'title', 'subtitle', 'last_modified_time', 'premium',
       'body', 'published_time', 'image_ids', 'article_type', 'url',
       'ner_clusters', 'entity_groups', 'topics', 'category', 'subcategory',
       'category_str', 'total_inviews', 'total_pageviews', 'total_read_time',
       'sentiment_score', 'sentiment_label', 'topics_str', 'entity_groups_str',
       'ner_clusters_str'],
      dtype='object')