# 2024 Recsys Challenge

## About

This year's challenge focuses on online news recommendation, addressing both the technical and normative challenges inherent in designing effective and responsible recommender systems for news publishing. The challenge will delve into the unique aspects of news recommendation, including modeling user preferences based on implicit behavior, accounting for the influence of the news agenda on user interests, and managing the rapid decay of news items. Furthermore, our challenge embraces the normative complexities, involving investigating the effects of recommender systems on the news flow and whether they resonate with editorial values. [1]

## Challenge Task

The Ekstra Bladet RecSys Challenge aims to predict which article a user will click on from a list of articles that were seen during a specific impression. Utilizing the user's click history, session details (like time and device used), and personal metadata (including gender and age), along with a list of candidate news articles listed in an impression log, the challenge's objective is to rank the candidate articles based on the user's personal preferences. This involves developing models that encapsulate both the users and the articles through their content and the users' interests. The models are to estimate the likelihood of a user clicking on each article by evaluating the compatibility between the article's content and the user's preferences. The articles are ranked based on these likelihood scores, and the precision of these rankings is measured against the actual selections made by users. [1]

## Dataset Information

The Ekstra Bladet News Recommendation Dataset (EB-NeRD) was created to support advancements in news recommendation research. It was collected from user behavior logs at Ekstra Bladet. We collected behavior logs from active users during the 6 weeks from April 27 to June 8, 2023. This timeframe was selected to avoid major events, e.g., holidays or elections, that could trigger atypical behavior at Ekstra Bladet. The active users were defined as users who had at least 5 and at most 1,000 news click records in a three-week period from May 18 to June 8, 2023. To protect user privacy, every user was delinked from the production system when securely hashed into an anonymized ID using one-time salt mapping. Alongside, we provide Danish news articles published by Ekstra Bladet. Each article is enriched with textual context features such as title, abstract, body, categories, among others. Furthermore, we provide features that have been generated by proprietary models, including topics, named entity recognition (NER), and article embeddings [2]

For more information on the [dataset](https://recsys.eb.dk/dataset/).

## References
[1] [RecySys Challenge 2024 Logistics](https://recsys.eb.dk/)

[2] [Ekstra Bladet News Recommendation Dataset](https://recsys.eb.dk/dataset/)

------------------------------------------------------------------------------

### Notebook Organization
### This purpose of this notebook is for EDA only. 

- Logistics
- EDA 
    - Data Preprocessing
    - Functions
        - Plot Functions
        - Feature Functions
            - Article
            - User
            - Topic
            - Activity
    - Feature Analysis
        - Overall Feature Analysis
        - Article
        - User
        - Session
        - Topic
        - Devices
        - If subscriber
        - Gender
        - Age
        - Postcodes

We need to establish specific metrics and analyze how different features impact those metrics. Our platform generates revenue through both subscriptions and advertisements. User engagement is crucial because the more time users spend reading new articles, the greater our advertisement revenue. With this in focus, let's start with exploratory data analysis (EDA).

------------------------------------------------------------------------------------

# EDA

## Data Preprocessing

Let's import our packages used for this notebook.

In [88]:
# Packages
from datetime import datetime
from plotly.subplots import make_subplots
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import plotly.graph_objects as go

Load in the three separate data sources of the dataset:

**Articles**: Detailed information of news articles.[*](https://recsys.eb.dk/dataset/#articles)

**Behaviors**: Impression Logs. [*](https://recsys.eb.dk/dataset/#behaviors)

**History**: Click histories of users. [*](https://recsys.eb.dk/dataset/#history)

In [None]:
# Load in various dataframes
# Articles
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/train/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/train/history.parquet")

In [89]:
# Load in various dataframes
# Articles
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/validation/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/validation/history.parquet")

What feature can we join the data sources on?

- Articles & Behavior: Article ID

- History & Behavior: User ID

Before we can join, we need to modify the behavior['article_ids_clicked'] column.

In [90]:
# Convert datatype of column first
df_bev['article_id'] = df_bev['article_id'].apply(lambda x: x if isinstance(x, str) else int(x) if not np.isnan(x) else x)

# Join bevhaiors to article
df = df_bev.join(df_art.set_index("article_id"), on="article_id")

# Join bevhaiors to history
df = df.join(df_his.set_index("user_id"), on="user_id")

# Drop all other dataframes from me
df_bev = []
df_his = []
df_art = []

More preprocessing needed before we can begin further analysis.

In [91]:
def device_(x):
    """ 
    Changes the device input from a int to a str
    Keyword arguments:
        x -- int
    Output:
        str
    """
    if x == 1:
        return 'Desktop'
    elif x == 2:
        return 'Mobile'
    else:
        return 'Tablet'

def gender_(x):
    """ 
    Changes the gender input from a float to a str
    Keyword arguments:
        x -- float
    Output:
        str
    """
    if x == 0.0:
        return 'Male'
    elif x == 1.0:
        return 'Female'
    else:
        return None


def postcodes_(x):
    """ 
    Changes the postcodes input from a float to a str
    Keyword arguments:
        x -- float
    Output:
        str
    """
    if x == 0.0:
        return 'Metropolitan'
    elif x == 1.0:
        return 'Rural District'

    elif x == 2.0:
        return 'Municipality'

    elif x == 3.0:
        return 'Provincial'

    elif x == 4.0:
        return 'Big City'

    else:
        return None

In [92]:
# Preprocessing
df.dropna(subset=['article_id'], inplace=True)

# Change article IDs into int
df['article_id'] = df['article_id'].apply(lambda x: int(x))
df['article_id'] = df['article_id'].astype(np.int64)

# Change age from int to string
df['device_type'] = df['device_type'].apply(lambda x: device_(x))

# Change genders from float to string
df['gender'] = df['gender'].apply(lambda x: gender_(x))

# Change age to str it's a range
df['age'] = df['age'].astype('Int64')
df['age'] = df['age'].astype(str)
df['age'] = df['age'].apply(
    lambda x: x if x == '<NA>' else x + ' - ' + x[0] + '9')


# Change postcodes from int to str
df['postcode'] = df['postcode'].apply(lambda x: postcodes_(x))

Next section will be on all the helper functions used in this notebook!

-------------------------------------------------------------------------------------

# MODELING

## Try content based apporach


In [93]:
# Create a new column which has stuff we can compare:
## Title, Body, category, article type, NER, entities, topics

## so we have to look at the user's ID, figure out what stuff he has looked at. join that stuff all together and then doa  cosine similarity compared to the impressions

In [94]:
df['topics_str'] = df['topics'].apply(lambda x: ' '.join(x))

In [95]:
df['full_content'] = df['title'] + " " + df['body'] + " " +  df["category_str"] + " " + df['article_type'] + " " + df['ner_clusters'] + " " + df['entity_groups'] + df['topics_str']

In [96]:
# Ensure the column is a Series of strings
df['full_content'] = df['full_content'].astype(str)

# Convert the Series to a list of strings
content_list = df['full_content'].tolist()

In [97]:
df.columns

Index(['impression_id', 'article_id', 'impression_time', 'read_time',
       'scroll_percentage', 'device_type', 'article_ids_inview',
       'article_ids_clicked', 'user_id', 'is_sso_user', 'gender', 'postcode',
       'age', 'is_subscriber', 'session_id', 'next_read_time',
       'next_scroll_percentage', 'title', 'subtitle', 'last_modified_time',
       'premium', 'body', 'published_time', 'image_ids', 'article_type', 'url',
       'ner_clusters', 'entity_groups', 'topics', 'category', 'subcategory',
       'category_str', 'total_inviews', 'total_pageviews', 'total_read_time',
       'sentiment_score', 'sentiment_label', 'impression_time_fixed',
       'scroll_percentage_fixed', 'article_id_fixed', 'read_time_fixed',
       'topics_str', 'full_content'],
      dtype='object')

In [None]:
df_art[df_art["article_id"] == 9749275]

In [138]:
df_art[df_art["article_id"] == 9749275]['ner_clusters'] 

16071    [Ekstra Bladet, Emilie Meng, Emilie Meng, Kari...
Name: ner_clusters, dtype: object

In [139]:
df_art['topics_str']= df_art['topics'].apply(lambda x: ' '.join(x))
df_art['entity_groups_str'] = df_art['entity_groups'].apply(lambda x: ' '.join(x))
df_art['ner_clusters_str'] = df_art['ner_clusters'].apply(lambda x: ' '.join(x))
full_content = df_art[df_art["article_id"] == 9749275]['title'] + " " + df_art[df_art["article_id"] == 9749275]['body'] + " " +  df_art[df_art["article_id"] == 9749275]["category_str"] + " " + df_art[df_art["article_id"] == 9749275]['article_type'] + " " + df_art[df_art["article_id"] == 9749275]['ner_clusters_str'] + " " + df_art[df_art["article_id"] == 9749275]['entity_groups_str'] + df_art[df_art["article_id"] == 9749275]['topics_str']

In [144]:
full_content.values

array(['Fængslingsfrist: Dommer skal ikke vurdere Emilie-sag Anklagemyndigheden vil endnu ikke bede en dommer vurdere, om mistankegrundlaget mod en 32-årig mand er velunderbygget nok til at fængsle ham i sagen om drab på Emilie Meng. Det oplyser specialanklager Susanne Bluhm til Ritzau.\nI midten af næste uge udløber fængslingsfristen for den 32-årige. Det betyder, at retten på ny skal tage stilling til om, han fortsat skal sidde bag tremmer som arrestant.\nAnklageren uddyber over for Ekstra Bladet, at hun har en forventning om, at retten til det retsmøde alene skal vurdere grundlaget i den sag, han allerede er fængslet på.\nOm hvornår retten så eventuelt skal tage stilling til, om fængslingsgrundlaget skal udvides, har hun over for Ekstra Bladet ingen kommentarer til på nuværende tidspunkt.\nVen til sigtet 32-årig: Har jeg siddet over for en morder?\nDen 32-årige blev 17. april varetægtsfængslet i en sag om bortførelse og voldtægt af en 13-årig pige. Han er fængslet frem til 11. maj.\

In [126]:
full_content[0]

KeyError: 0

In [117]:
# Get user ID
## Get profile of user's previous articles he has looked at
### compare cosine similarity to articles in users view


# for loop to iterate through eahc index
for i in df.index:
    # get user id
    user_id = df['user_id'][i] 

    # previous profile information
    user_article_history = df['article_id_fixed'][i]
    
    # get the contents of each article in that list 
    corpus = ' '
    for x in user_article_history:
        # go to that row in history
        df_art['article_id'] == x


KeyboardInterrupt: 

In [114]:
article_history

array([9749275, 9749392, 9749392, 9749948, 9749965, 9749948, 9749076,
       9749857, 9749947, 9749751, 9750076, 9727216, 9749756, 9750389,
       9750002, 9750397, 9749240, 9749668, 9740330, 9750533, 9749637,
       9750478, 9747411, 9747411, 9750793, 9661024, 9750726, 9748321,
       9750078, 9747859, 9748041, 9749034, 9751385, 9751139, 9751411,
       9751367, 9751367, 9748750, 9749184, 9751508, 9749143, 9751517,
       9751517, 9751030, 9751349, 9751349, 9751786, 9752124, 9751252,
       9631275, 9752155, 9752299, 9750873, 9750227, 9750873, 9752146,
       9750873, 9751895, 9751786, 9750873, 9750873, 9752299, 9752288,
       9752312, 9753168, 9753097, 9753295, 9752998, 9752882, 9752882,
       9753503, 9752882, 9753479, 9752882, 9753415, 9753442, 9752463,
       9753473, 9753543, 9754271, 9747961, 9754294, 9754159, 9754814,
       9754798, 9754798, 9754929, 9752552, 9754365, 9754350, 9754133,
       9755224, 9753995, 9755119, 9755298, 9755328, 9755327, 9755285,
       9753949, 9753

In [None]:
# TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(content_list)

print(x)

In [None]:
df['']

In [None]:
df[]

In [115]:
df_art = pd.read_parquet("Data/Small/articles.parquet")

# Behaviors
df_bev = pd.read_parquet("Data/Small/validation/behaviors.parquet")

# History
df_his = pd.read_parquet("Data/Small/validation/history.parquet")

In [108]:
df_bev.columns

Index(['impression_id', 'article_id', 'impression_time', 'read_time',
       'scroll_percentage', 'device_type', 'article_ids_inview',
       'article_ids_clicked', 'user_id', 'is_sso_user', 'gender', 'postcode',
       'age', 'is_subscriber', 'session_id', 'next_read_time',
       'next_scroll_percentage'],
      dtype='object')

In [109]:
df_his.columns

Index(['user_id', 'impression_time_fixed', 'scroll_percentage_fixed',
       'article_id_fixed', 'read_time_fixed'],
      dtype='object')