# Demonstration of the Data Processing file

This file will go through the reasoning behind the preprocessing done in utils/data_preprocessing

## Setup

We import the DataProcesser, and configure polars for the demonstration

In [1]:
import os
import sys
import polars as pl
import numpy as np

parent_dir = os.path.abspath(os.path.join(os.getcwd(), "../utils"))
sys.path.append(parent_dir)

from data_preprocessing import DataProcesser
dataProcesser = DataProcesser()

pl.Config.set_tbl_cols(-1)

polars.config.Config

The DataProcesser contains the imported dataframes as local variables, so that we don't have to reimport them for every processing

In [2]:
print("Articles shape:        ", dataProcesser.articles_df.shape)
print("Document vectors shape:", dataProcesser.articles_df.shape)
print("Train Behaviours shape:", dataProcesser.articles_df.shape)
print("Test Behaviours shape: ", dataProcesser.articles_df.shape)

Articles shape:         (20738, 21)
Document vectors shape: (20738, 21)
Train Behaviours shape: (20738, 21)
Test Behaviours shape:  (20738, 21)


I will no go trough the processing done to each of the dataframes (with the Behaviours being combined into one)

## Preprocessing

### Articles DataFrame

#### Exploration

This DataFrame contains information about the different articles used in the dataframe

In [3]:
dataProcesser.articles_df.head(1)

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3001353,"""Natascha var ikke den første""","""Politiet frygter nu, at Natasc…",2023-06-29 06:20:33,False,"""Sagen om den østriske Natascha…",2006-08-31 08:06:45,[3150850],"""article_default""","""https://ekstrabladet.dk/krimi/…",[],[],"[""Kriminalitet"", ""Personfarlig kriminalitet""]",140,[],"""krimi""",,,,0.9955,"""Negative"""


By inspecting null values, we see that of 20738, are around half of inviews, pageviews and readtime rows are empty. Furthermore around 1900 image_ids (~10%) are missing.

In [4]:
print("Dataframe Shape = ", dataProcesser.articles_df.shape)
dataProcesser.articles_df.null_count()

Dataframe Shape =  (20738, 21)


article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,1878,0,0,0,0,0,0,0,0,10770,10882,10882,0,0


#### Summary

**Filtering:** We can remove the user interactions from the DataFrame, as these values can be found by using the Behaviours table. Furthermore, the image_id's shouldn't be important, so these can be removed too. 

**Sorting:** We can sort by published_time if we want to. But this isn't required

**Summary:** Filter out ["total_inviews", "total_pageviews", "total_read_time", "image_ids"], sort by "published_time"

In [5]:
droppable_columns = ["total_inviews", "total_pageviews", "total_read_time", "image_ids"]
dataProcesser.process_dataframe(df = dataProcesser.articles_df, remove_columns = droppable_columns, sort_by = "published_time")

article_id,title,subtitle,last_modified_time,premium,body,published_time,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],str,str,list[str],list[str],list[str],i16,list[i16],str,f32,str
9800521,"""Panser: - Jeg gik i stykker""","""Han blev kaldt 'Knogleknuseren…",2023-07-12 08:28:14,true,"""Jo farligere opgaverne var, jo…",2023-07-11 05:05:36,"""article_default""","""https://ekstrabladet.dk/nyhede…","[""Bageriet"", ""Ballerup"", … ""Thomas""]","[""LOC"", ""LOC"", … ""PER""]","[""Konflikt og krig"", ""Terror""]",118,[133],"""nyheder""",0.9934,"""Negative"""
9803607,"""Aktion mod svindlere: Seks per…","""Flere kvinder er ifølge politi…",2023-06-29 06:49:26,false,"""Mindst otte personer er blevet…",2023-06-08 06:54:53,"""article_default""","""https://ekstrabladet.dk/krimi/…","[""Adam"", ""Bente"", … ""Twitter""]","[""PER"", ""PER"", … ""ORG""]","[""Kriminalitet"", ""Bedrageri""]",140,[],"""krimi""",0.9948,"""Negative"""
9803525,"""Dansk skuespiller: - Jeg nægte…","""Julie R. Ølgaard fik akut kejs…",2023-06-29 06:49:26,false,"""Mens hun lå søvnløs, lød kakof…",2023-06-08 06:45:46,"""article_default""","""https://ekstrabladet.dk/underh…","[""Cooper"", ""Englemageren"", … ""Svangerskabsforgiftning""]","[""PER"", ""PROD"", … ""MISC""]","[""Kendt"", ""Livsstil"", … ""Sygdom og behandling""]",414,[425],"""underholdning""",0.7737,"""Negative"""
9803178,"""Mia-sagen: 'Jeg blev smigret'""","""Torsdag fortsætter 37-årigs fo…",2023-06-29 06:49:26,false,"""""",2023-06-08 06:38:24,"""article_scribblelive""","""https://ekstrabladet.dk/krimi/…","[""Skadhauge""]","[""PER""]","[""Kriminalitet""]",140,[],"""krimi""",0.9346,"""Negative"""
9803560,"""Så slemt er det: 14.000 huse e…","""Tusindvis af huse står under v…",2023-06-29 06:49:26,false,"""Et område på omkring 600 kvadr…",2023-06-08 06:25:42,"""article_default""","""https://ekstrabladet.dk/nyhede…","[""Dnepr"", ""Kherson"", … ""Ukraine""]","[""LOC"", ""LOC"", … ""LOC""]","[""International politik"", ""Katastrofe"", … ""Politik""]",118,[],"""nyheder""",0.9927,"""Negative"""
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
4790718,"""FCK-fan fik hovedet knust""","""Stak hovedet op af vinduet i d…",2023-06-29 07:10:50,false,"""For 41 fodboldfans fra FC Købe…",2000-08-14 04:05:00,"""article_default""","""https://ekstrabladet.dk/sport/…","[""Greve""]","[""LOC""]","[""Transportmiddel"", ""Katastrofe"", … ""Større transportmiddel""]",142,"[196, 199]","""sport""",0.9955,"""Negative"""
4389859,"""Lær at tale sort""","""Imponer vennerne eller undgå m…",2023-06-29 06:55:30,false,"""SORT SNAK Mens mange efterhånd…",2000-07-18 09:30:00,"""article_default""","""https://ekstrabladet.dk/nyhede…",[],[],"[""Erhverv"", ""Privat virksomhed"", … ""Forbrugerelektronik""]",118,[133],"""nyheder""",0.9206,"""Neutral"""
4794746,"""Træneren: Jeg skjuler ikke nog…","""""",2023-06-29 07:11:07,false,"""Galoptræner Brian Wilson, Ordr…",2000-06-22 12:45:00,"""article_default""","""https://ekstrabladet.dk/nyhede…",[],[],"[""Sport"", ""Dyr""]",118,[133],"""nyheder""",0.6719,"""Neutral"""
4812866,"""Suzuki Waggon R""","""Suzuki Waggon R - testet af Ek…",2023-06-29 07:12:22,false,"""Suzukis lille Wagon R på vej m…",1999-11-09 23:35:00,"""article_default""","""https://ekstrabladet.dk/biler/…",[],[],"[""Transportmiddel"", ""Bil""]",529,[530],"""biler""",0.6152,"""Neutral"""


### Behaviours DataFrame

#### Exploration

This DataFrame contains all information about interactions between users and items over 7 days

In [6]:
dataProcesser.train_behaviors_df.head(1)

impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
u32,i32,datetime[μs],f32,f32,i8,list[i32],list[i32],u32,bool,i8,i8,i8,bool,u32,f32,f32
149474,,2023-05-24 07:47:53,13.0,,2,"[9778623, 9778682, … 9778728]",[9778657],139836,False,,,,False,759,7.0,22.0


By inspecting null values, we see that of 232887 interactions, around half of article_id's are empty and almost all gender, postcode and age values are lacking. While scroll_percentage also have around half of it's values missing, these may be corresponding to the article_id's that are null

In [7]:
print("Dataframe Shape = ", dataProcesser.train_behaviors_df.shape)
dataProcesser.train_behaviors_df.null_count()

Dataframe Shape =  (232887, 17)


impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,162466,0,0,163789,0,0,0,0,0,216668,228214,226546,0,0,6218,26270


We see that after removing the interactions from the mainpage, we have been reduced by 60%. Of the remaining 70K interactions, only 2% lack scroll data, this can be replaced by using mean. Furthermore, we see that around >92% of gender, postcode and age values are still missing, so we can't really predict these

In [8]:
df = dataProcesser.process_dataframe(dataProcesser.train_behaviors_df, filter_null_columns=["article_id"])
print("Dataframe Shape = ", df.shape)
df.null_count()

Dataframe Shape =  (70421, 17)


impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,1780,0,0,0,0,0,65481,68785,67972,0,0,3173,9129


#### Summary

**Filtering:** We can remove gender, postcode and age due to lacking data. We canalso remove next_read_time, next_scroll_percentage, article_ids_inview, article_ids_clicked as future data won't be used in our predictive algorithms. 

**Sorting:** We can sort by published_time if we want to. But this isn't required

**Non-Null:** We can sort out all rows where article_id is null

**Predict:** We can predict all scroll_percentage left after article_id is filtered out

In [10]:
remove_col = ["gender", "postcode", "age", "next_read_time", "next_scroll_percentage", "article_ids_inview", "article_ids_clicked"]
non_null = ["article_id"]
predict = ["scroll_percentage"] # The standard method is mean, which will work well 

dataProcesser.process_train_test_df(train_df=dataProcesser.train_behaviors_df, test_df=dataProcesser.test_behaviors_df, remove_columns = remove_col, filter_null_columns = non_null, predict_columns = predict).head()

impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,user_id,is_sso_user,is_subscriber,session_id
u32,i32,datetime[μs],f32,f32,i8,u32,bool,bool,u32
153068,9778682,2023-05-24 07:09:04,78.0,100.0,1,151570,False,False,1976
153070,9777492,2023-05-24 07:13:14,26.0,100.0,1,151570,False,False,1976
153071,9778623,2023-05-24 07:11:08,125.0,100.0,1,151570,False,False,1976
153075,9777492,2023-05-24 07:13:58,26.0,100.0,1,151570,False,False,1976
153078,9777492,2023-05-24 07:13:46,7.0,100.0,1,151570,False,False,1976


### Document Vectors

#### Exploration

We see that the dataProcesser contains no empty fields. So there isn't really much to do with this vector

In [12]:
dataProcesser.document_vectors_df.null_count()

article_id,document_vector
u32,u32
0,0


#### Summary

This vector is perfect just the way it is. It could however be combined with the articles dataframe if needed