## Analyzing and understanding the data

In order to have a sensible approach in the recommandation system, especially our proposed image extension, a good insight can be infered through analyzing the article data and user behavior as well as the underlying distributions.

### Load the Articles, History and Behavior

For a more in depth explanation of each colum please refer to https://recsys.eb.dk/dataset/

In [1]:
import pandas as pd

In [2]:
original_df_path ='data/ebnerd_small/articles.parquet'
original_df = pd.read_parquet(original_df_path)
original_df.columns

Index(['article_id', 'title', 'subtitle', 'last_modified_time', 'premium',
       'body', 'published_time', 'image_ids', 'article_type', 'url',
       'ner_clusters', 'entity_groups', 'topics', 'category', 'subcategory',
       'category_str', 'total_inviews', 'total_pageviews', 'total_read_time',
       'sentiment_score', 'sentiment_label'],
      dtype='object')

In [3]:
behaviours_path ='data/ebnerd_small/train/behaviors.parquet'
behaviours = pd.read_parquet(behaviours_path)
behaviours.columns

Index(['impression_id', 'article_id', 'impression_time', 'read_time',
       'scroll_percentage', 'device_type', 'article_ids_inview',
       'article_ids_clicked', 'user_id', 'is_sso_user', 'gender', 'postcode',
       'age', 'is_subscriber', 'session_id', 'next_read_time',
       'next_scroll_percentage'],
      dtype='object')

In [4]:
history_path ='data/ebnerd_small/train/history.parquet'
history = pd.read_parquet(history_path)
history.columns

Index(['user_id', 'impression_time_fixed', 'scroll_percentage_fixed',
       'article_id_fixed', 'read_time_fixed'],
      dtype='object')

## Distribution of articles with images
First, we aim to understand how articles on the news website are displayed to the user. We are aware that there are articles with one or more images as well as articles with no image associated with them. 
By visually analyzing the website we can see that the user is prompted with articles showing an image and the title of the article as well as only titles. Thus, looking at the distribution we might get a better idea of the user experience.

In [5]:
all_articles = original_df['article_id'].nunique()

articles_with_images = original_df[original_df['image_ids'].notna()]['article_id'].nunique()

print(f'Total Articles count : {all_articles}')
print(f'Number of articles with images : {articles_with_images}')
print(f'Percentage of articles that have images : {(articles_with_images / all_articles) * 100}')

Total Articles count : 20738
Number of articles with images : 18860
Percentage of articles that have images : 90.94416047834892


We can see that a large amount of articles **(90.94%)** have images associated with them. Thus, the next aim is to further understand user behavior in context of this distribution.

## Users Behavior
The Behavior dataset includes relevant information on the decision-making process of the user. To get a better understanding of the click probability we aim to calculate the following :

**Total clicks on Articles with Images:**


$ \text{Total clicks on Image Articles} = \sum_{all users}^{}(\text{Number of clicks on image articles by the user})$

**Total Articles with Images presented to the user:**


$ \text{Total Image Articles} = \sum_{all users}^{}(\text{Number of times image articles were shown to the user})$

**Total clicks on Articles without Images:**


$ \text{Total clicks on no Image Articles} = \sum_{all users}^{}(\text{Number of clicks on  articles without image by the user})$

**Total Articles without Images presented to the user:**


$ \text{Total no Image Articles} = \sum_{all users}^{}(\text{Number of times articles without images were shown to the user})$


Using this, we can compute :

**Probability of clicking on an article with an Image**

$P(\text{click} | \text{image}) = \frac{\text{Total clicks on Image Articles}}{\text{Total Image Articles}}$

**Probability of clicking on an article without an Image**

$P(\text{click} | \text{image}) = \frac{\text{Total clicks on no Image Articles}}{\text{Total no Image Articles}}$

In [7]:
articles_dict = original_df.set_index('article_id')['image_ids'].to_dict()


total_clicks_on_image_articles = 0
total_image_articles_presented = 0
total_clicks_on_no_image_articles = 0
total_no_image_articles_presented = 0

for idx, row in behaviours.iterrows():
    articles_inview = row['article_ids_inview']
    articles_clicked = row['article_ids_clicked']
    
    for article_id in articles_inview:
        if articles_dict.get(article_id) is not None:
            total_image_articles_presented += 1
        else:
            total_no_image_articles_presented += 1
    
    for article_id in articles_clicked:
        if articles_dict.get(article_id) is not None:
            total_clicks_on_image_articles += 1
        else:
            total_clicks_on_no_image_articles += 1

prob_click_given_image = total_clicks_on_image_articles / total_image_articles_presented if total_image_articles_presented > 0 else 0
prob_click_given_no_image = total_clicks_on_no_image_articles / total_no_image_articles_presented if total_no_image_articles_presented > 0 else 0

print(f'Total Clicks on Image Articles : {total_clicks_on_image_articles}')
print(f'Total Image Articles Presented : {total_image_articles_presented}')
print(f'Total Clicks on No Image Articles : {total_clicks_on_no_image_articles}')
print(f'Total No Image Articles Presented : {total_no_image_articles_presented}')
print(f'Probability of Clicking on Image Article : {prob_click_given_image}')
print(f'Probability of Clicking on No Image Article : {prob_click_given_no_image}')


Total Clicks on Image Articles : 212877
Total Image Articles Presented : 2358623
Total Clicks on No Image Articles : 21400
Total No Image Articles Presented : 227124
Probability of Clicking on Image Article : 0.09025478001359268
Probability of Clicking on No Image Article : 0.0942216586534228


In [18]:
articles_dict = original_df.set_index('article_id')['image_ids'].to_dict()

total_clicks_on_image_articles = 0
total_image_articles_presented = 0
total_clicks_on_no_image_articles = 0
total_no_image_articles_presented = 0
total_articles_presented = 0
total_articles_clicked = 0

for idx, row in behaviours.iterrows():
    articles_inview = row['article_ids_inview']
    articles_clicked = row['article_ids_clicked']
    
    for article_id in articles_inview:
        total_articles_presented += 1
        if articles_dict.get(article_id) is not None:
            total_image_articles_presented += 1
        else:
            total_no_image_articles_presented += 1
    
    for article_id in articles_clicked:
        total_articles_clicked += 1
        if articles_dict.get(article_id) is not None:
            total_clicks_on_image_articles += 1
        else:
            total_clicks_on_no_image_articles += 1

prob_click_given_image = total_clicks_on_image_articles / total_image_articles_presented if total_image_articles_presented > 0 else 0
prob_click_given_no_image = total_clicks_on_no_image_articles / total_no_image_articles_presented if total_no_image_articles_presented > 0 else 0

overall_ctr = total_articles_clicked / total_articles_presented if total_articles_presented > 0 else 0

print(f'Total Clicks on Image Articles : {total_clicks_on_image_articles}')
print(f'Total Image Articles Presented : {total_image_articles_presented}')
print(f'Total Clicks on No Image Articles : {total_clicks_on_no_image_articles}')
print(f'Total No Image Articles Presented : {total_no_image_articles_presented}')
print(f'Probability of Clicking on Image Article : {prob_click_given_image:.4f}%')
print(f'Probability of Clicking on No Image Article : {prob_click_given_no_image:.4f}%')
print(f'Overall Click-Through Rate : {overall_ctr:.4f}%')

Total Clicks on Image Articles : 212877
Total Image Articles Presented : 2358623
Total Clicks on No Image Articles : 21400
Total No Image Articles Presented : 227124
Probability of Clicking on Image Article : 0.0903%
Probability of Clicking on No Image Article : 0.0942%
Overall Click-Through Rate : 0.0906%


We observe the probability of clicking on no image articles **0.0942%** is slightly larger than clicking an image article **0.0903%**.

Given the low number of no image articles showed **227124** (compared to image articles **2358623**), we can infer that out of a series of articles shown, there is only a small set of no image article shown to the user. This can support the larger clicking probability, thus we will investigate the distribution further.

In [16]:
user_ids = []
image_articles_presented = []
no_image_articles_presented = []
users_shown_only_image_articles = 0
total_no_image_articles = 0
total_articles_presented = 0

for idx, row in behaviours.iterrows():
    user_id = row['user_id']
    articles_inview = row['article_ids_inview']
    
    count_image_articles = 0
    count_no_image_articles = 0
    
    for article_id in articles_inview:
        total_articles_presented += 1
        if articles_dict.get(article_id) is not None:
            count_image_articles += 1
        else:
            count_no_image_articles += 1
            total_no_image_articles += 1
    
    user_ids.append(user_id)
    image_articles_presented.append(count_image_articles)
    no_image_articles_presented.append(count_no_image_articles)
    
    if count_no_image_articles == 0:
        users_shown_only_image_articles += 1

percentage_no_image_articles = (total_no_image_articles / total_articles_presented) * 100 if total_articles_presented > 0 else 0

users_shown_at_least_one_no_image_article = len(behaviours) - users_shown_only_image_articles

percentage_users_shown_at_least_one_no_image_article = users_shown_at_least_one_no_image_article / len(behaviours) * 100

print(f'Percentage of no image articles out of all artices showed to users : {percentage_no_image_articles:.2f}%')
print(f'Percentage of users that are shown at least one article with no image: {percentage_users_shown_at_least_one_no_image_article:.2f}%')


Percentage of no image articles out of all artices showed to users : 8.78%
Percentage of users that are shown at least one article with no image: 55.03%


In [20]:
articles_dict = original_df.set_index('article_id')['image_ids'].to_dict()

user_ids = []
image_articles_presented = []
no_image_articles_presented = []
image_articles_clicked = []
no_image_articles_clicked = []

for idx, row in behaviours.iterrows():
    user_id = row['user_id']
    articles_inview = row['article_ids_inview']
    articles_clicked = row['article_ids_clicked']
    
    count_image_articles_presented = 0
    count_no_image_articles_presented = 0
    count_image_articles_clicked = 0
    count_no_image_articles_clicked = 0
    
    for article_id in articles_inview:
        if articles_dict.get(article_id) is not None:
            count_image_articles_presented += 1
        else:
            count_no_image_articles_presented += 1
    
    for article_id in articles_clicked:
        if articles_dict.get(article_id) is not None:
            count_image_articles_clicked += 1
        else:
            count_no_image_articles_clicked += 1
    
    user_ids.append(user_id)
    image_articles_presented.append(count_image_articles_presented)
    no_image_articles_presented.append(count_no_image_articles_presented)
    image_articles_clicked.append(count_image_articles_clicked)
    no_image_articles_clicked.append(count_no_image_articles_clicked)

user_stats = pd.DataFrame({
    'user_id': user_ids,
    'image_articles_presented': image_articles_presented,
    'no_image_articles_presented': no_image_articles_presented,
    'image_articles_clicked': image_articles_clicked,
    'no_image_articles_clicked': no_image_articles_clicked
})

only_image_users = user_stats[user_stats['no_image_articles_presented'] == 0]
both_type_users = user_stats[user_stats['no_image_articles_presented'] > 0]

only_image_ctr = only_image_users['image_articles_clicked'].sum() / only_image_users['image_articles_presented'].sum() if only_image_users['image_articles_presented'].sum() > 0 else 0
both_type_ctr = both_type_users[['image_articles_clicked', 'no_image_articles_clicked']].sum().sum() / both_type_users[['image_articles_presented', 'no_image_articles_presented']].sum().sum() if both_type_users[['image_articles_presented', 'no_image_articles_presented']].sum().sum() > 0 else 0

print(f"CTR for users shown only image articles: {only_image_ctr:.4f}%")
print(f"CTR for users shown both image and no-image articles: {both_type_ctr:.4f}%")


CTR for users shown only image articles: 0.1220%
CTR for users shown both image and no-image articles: 0.0749%


Out of articles shown to the users, only **8.78%** are no image articles. Furthermore, only **55.03%** of the users are shown at least one no image article. 
This shows the importance of image articles in the recommender pipeline. 

Furthermore, analyzing the click through rates for the mixed vs only image articles shown, in context of the overall click through rate **0.0906%** we can see that image content is associated with an increased engagement **0.12%**.

## History
Further information can be taken from the complete history

In [21]:
clicked_articles = history['article_id_fixed'].explode().unique()

clicked_articles_df = original_df[original_df['article_id'].isin(clicked_articles)]
articles_with_images = clicked_articles_df[clicked_articles_df['image_ids'].notna()]


In [22]:
total_clicked = len(clicked_articles)

clicked_with_images = articles_with_images['article_id'].nunique()

clicked_without_images = total_clicked - clicked_with_images


print(f"Total Articles Clicked: {total_clicked}")
print(f"Articles with Images Clicked: {clicked_with_images}")
print(f"Articles without Images Clicked: {clicked_without_images}")

Total Articles Clicked: 8786
Articles with Images Clicked: 8051
Articles without Images Clicked: 735
