## Demonstration of the Item-Based Collaborative Recommender

This system leverages collaborative filtering by analyzing how user-item interactions bridges items
Therefore, it focuses on the user-item relation.

It recommends articles that these similar users have engaged with, aiming to provide personalized suggestions. The model's performance is evaluated using MAP@K and NDCG@K metrics.

In [None]:
import sys
import os

parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(parent_dir)

import polars as pl
import numpy as np

from parquet_data_reader import ParquetDataReader
from utils.data_preprocessing import DataProcesser
from models.collaborative.item_based_CF import ItemBasedCollaborativeRecommender

pl.Config.set_tbl_cols(-1)

polars.config.Config

## Data import & Preprocessing

In [2]:
print("articles_df has the size:         ", articles_df.shape)
print("train_behaviors_df has the size:  ", train_behaviors_df.shape)
print("train_history_df has the size:    ", train_history_df.shape)
print("document_vectors_df has the size: ", document_vectors_df.shape)

articles_df has the size:          (20738, 21)
train_behaviors_df has the size:   (232887, 17)
train_history_df has the size:     (15143, 5)
document_vectors_df has the size:  (125541, 2)


### Validation set

In [4]:
test_behaviours_df = data_reader.read_data('../../data/validation/behaviors.parquet')
test_behaviours_df.head()

impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
u32,i32,datetime[μs],f32,f32,i8,list[i32],list[i32],u32,bool,i8,i8,i8,bool,u32,f32,f32
96791,,2023-05-28 04:21:24,9.0,,2,"[9783865, 9784591, … 9784710]",[9784696],22548,False,,,,False,142,72.0,100.0
96798,,2023-05-28 04:31:48,46.0,,2,"[9782884, 9783865, … 9784648]",[9784281],22548,False,,,,False,143,16.0,28.0
96801,,2023-05-28 04:30:17,14.0,,2,"[9784648, 7184889, … 9781983]",[9784444],22548,False,,,,False,143,12.0,24.0
96808,,2023-05-28 04:27:19,22.0,,2,"[9784607, 9695098, … 9781983]",[9781983],22548,False,,,,False,142,125.0,80.0
96810,,2023-05-28 04:29:47,23.0,,2,"[9781983, 7184889, … 9781520]",[9784642],22548,False,,,,False,142,,


In [5]:
# Combine train and test behaviors
combined_df = pl.concat([train_behaviors_df, test_behaviours_df])

# Generate a random mask for splitting
n = combined_df.height  # Total number of rows
test_mask = np.random.rand(n) < 0.30  # 30% test, 70% train

# Apply the mask
test_behaviors_df = combined_df.filter(test_mask)
train_behaviors_df = combined_df.filter(~test_mask)

# Verify the split
print(f"Train size: {train_behaviors_df.shape[0]}, Test size: {test_behaviors_df.shape[0]}")

Train size: 333688, Test size: 143846


### Table contents

The information on news articles. As we are going to perform user-user CF, this table is not neccesary

In [6]:
articles_df.head()

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3001353,"""Natascha var ikke den første""","""Politiet frygter nu, at Natasc…",2023-06-29 06:20:33,False,"""Sagen om den østriske Natascha…",2006-08-31 08:06:45,[3150850],"""article_default""","""https://ekstrabladet.dk/krimi/…",[],[],"[""Kriminalitet"", ""Personfarlig kriminalitet""]",140,[],"""krimi""",,,,0.9955,"""Negative"""
3003065,"""Kun Star Wars tjente mere""","""Biografgængerne strømmer ind f…",2023-06-29 06:20:35,False,"""Vatikanet har opfordret til at…",2006-05-21 16:57:00,[3006712],"""article_default""","""https://ekstrabladet.dk/underh…",[],[],"[""Underholdning"", ""Film og tv"", ""Økonomi""]",414,"[433, 434]","""underholdning""",,,,0.846,"""Positive"""
3012771,"""Morten Bruun fyret i Sønderjys…","""FODBOLD: Morten Bruun fyret me…",2023-06-29 06:20:39,False,"""Kemien mellem spillerne i Supe…",2006-05-01 14:28:40,[3177953],"""article_default""","""https://ekstrabladet.dk/sport/…",[],[],"[""Erhverv"", ""Kendt"", … ""Ansættelsesforhold""]",142,"[196, 199]","""sport""",,,,0.8241,"""Negative"""
3023463,"""Luderne flytter på landet""","""I landets tyndest befolkede om…",2023-06-29 06:20:43,False,"""Det frække erhverv rykker på l…",2007-03-24 08:27:59,[3184029],"""article_default""","""https://ekstrabladet.dk/nyhede…",[],[],"[""Livsstil"", ""Erotik""]",118,[133],"""nyheder""",,,,0.7053,"""Neutral"""
3032577,"""Cybersex: Hvornår er man utro?""","""En flirtende sms til den flott…",2023-06-29 06:20:46,False,"""De fleste af os mener, at et t…",2007-01-18 10:30:37,[3030463],"""article_default""","""https://ekstrabladet.dk/sex_og…",[],[],"[""Livsstil"", ""Partnerskab""]",565,[],"""sex_og_samliv""",,,,0.9307,"""Neutral"""


Each file consists of seven days of impression logs. The train_behaviors_df table contains all information about interactions between users and items, and can be used as a basis for user-user CF. <strong>Therefore we only need this table</strong>.

In [7]:
train_behaviors_df.head()

impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
u32,i32,datetime[μs],f32,f32,i8,list[i32],list[i32],u32,bool,i8,i8,i8,bool,u32,f32,f32
153068,9778682,2023-05-24 07:09:04,78.0,100.0,1,"[9778657, 9778669, … 9778682]",[9778669],151570,False,,,,False,1976,45.0,100.0
153071,9778623,2023-05-24 07:11:08,125.0,100.0,1,"[9777492, 9774568, … 9775990]",[9777492],151570,False,,,,False,1976,26.0,100.0
153075,9777492,2023-05-24 07:13:58,26.0,100.0,1,"[9778500, 9776420, … 9020783]",[9777034],151570,False,,,,False,1976,7.0,16.0
153078,9777492,2023-05-24 07:13:46,7.0,100.0,1,"[9778021, 9778627, … 7213923]",[9778226],151570,False,,,,False,1976,4.0,21.0
155587,9778627,2023-05-24 07:56:30,50.0,100.0,1,"[9777397, 9759955, … 9778369]",[9778375],161621,False,,,,False,3625,119.0,100.0


Each file consists of users' click histories collected over 21 days period. This table does contain the same values as the train_behaviours_df, but as that table is easier to work with we will use train_behaviours_df over this one

In [8]:
train_history_df.head()

user_id,impression_time_fixed,scroll_percentage_fixed,article_id_fixed,read_time_fixed
u32,list[datetime[μs]],list[f32],list[i32],list[f32]
13538,"[2023-04-27 10:17:43, 2023-04-27 10:18:01, … 2023-05-17 20:36:34]","[100.0, 35.0, … 100.0]","[9738663, 9738569, … 9769366]","[17.0, 12.0, … 16.0]"
14241,"[2023-04-27 09:40:18, 2023-04-27 09:40:33, … 2023-05-17 17:08:41]","[100.0, 46.0, … 100.0]","[9738557, 9738528, … 9767852]","[8.0, 9.0, … 12.0]"
20396,"[2023-04-27 12:30:44, 2023-04-27 12:31:34, … 2023-05-17 10:59:44]","[100.0, 59.0, … 13.0]","[9738760, 9738355, … 9769679]","[49.0, 34.0, … 4.0]"
34912,"[2023-04-29 07:12:49, 2023-04-29 13:01:18, … 2023-05-18 05:06:40]","[100.0, 35.0, … 27.0]","[9741802, 9741804, … 9770882]","[153.0, 7.0, … 5.0]"
37953,"[2023-04-27 19:17:10, 2023-04-27 19:17:27, … 2023-05-17 21:29:22]","[14.0, 28.0, … 18.0]","[9739205, 9739202, … 9769306]","[4.0, 16.0, … 6.0]"


List of vectors for each article. This is used to describe the items. It could be used for item-item CF, but is not relevant to user-user CF.  <strong>This table will therefore not be used</strong>

In [9]:
document_vectors_df.head()

article_id,document_vector
i32,list[f32]
3000022,"[0.065424, -0.047425, … 0.035706]"
3000063,"[0.028815, -0.000166, … 0.027167]"
3000613,"[0.037971, 0.033923, … 0.063961]"
3000700,"[0.046524, 0.002913, … 0.023423]"
3000840,"[0.014737, 0.024068, … 0.045991]"


From the analasys we see that we only need train_behaviour_df to perform user-user CF

## Preprocessing

### Remove non-needed values

We see that we have several items that are not required for performing user-user CF

In [10]:
train_behaviors_df.head()

impression_id,article_id,impression_time,read_time,scroll_percentage,device_type,article_ids_inview,article_ids_clicked,user_id,is_sso_user,gender,postcode,age,is_subscriber,session_id,next_read_time,next_scroll_percentage
u32,i32,datetime[μs],f32,f32,i8,list[i32],list[i32],u32,bool,i8,i8,i8,bool,u32,f32,f32
153068,9778682,2023-05-24 07:09:04,78.0,100.0,1,"[9778657, 9778669, … 9778682]",[9778669],151570,False,,,,False,1976,45.0,100.0
153071,9778623,2023-05-24 07:11:08,125.0,100.0,1,"[9777492, 9774568, … 9775990]",[9777492],151570,False,,,,False,1976,26.0,100.0
153075,9777492,2023-05-24 07:13:58,26.0,100.0,1,"[9778500, 9776420, … 9020783]",[9777034],151570,False,,,,False,1976,7.0,16.0
153078,9777492,2023-05-24 07:13:46,7.0,100.0,1,"[9778021, 9778627, … 7213923]",[9778226],151570,False,,,,False,1976,4.0,21.0
155587,9778627,2023-05-24 07:56:30,50.0,100.0,1,"[9777397, 9759955, … 9778369]",[9778375],161621,False,,,,False,3625,119.0,100.0


All information that does not describe a user, or a user-item interaction can therefore be removed

In [11]:
irelevant_columns = ["impression_time", "device_type", "article_ids_inview", "article_ids_clicked", "session_id", "next_read_time", "next_scroll_percentage"]
train_behaviors_df = train_behaviors_df.drop(irelevant_columns)
train_behaviors_df.head()

impression_id,article_id,read_time,scroll_percentage,user_id,is_sso_user,gender,postcode,age,is_subscriber
u32,i32,f32,f32,u32,bool,i8,i8,i8,bool
153068,9778682,78.0,100.0,151570,False,,,,False
153071,9778623,125.0,100.0,151570,False,,,,False
153075,9777492,26.0,100.0,151570,False,,,,False
153078,9777492,7.0,100.0,151570,False,,,,False
155587,9778627,50.0,100.0,161621,False,,,,False


The remaining items are the ones that can be used. But already here we see that we have several features with lacking information. We should therefore treat this

### Account for missing values

We see here that alot of the behaviours contain missing values. The therefore have to either remove or replace the values

In [12]:
print(train_behaviors_df.shape)
train_behaviors_df.null_count()

(333688, 10)


impression_id,article_id,read_time,scroll_percentage,user_id,is_sso_user,gender,postcode,age,is_subscriber
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,234868,0,236338,0,0,310941,327337,324985,0


In [13]:
train_behaviors_df = train_behaviors_df.filter(train_behaviors_df["article_id"].is_not_null())
print(train_behaviors_df.shape)
train_behaviors_df.null_count()

(98820, 10)


impression_id,article_id,read_time,scroll_percentage,user_id,is_sso_user,gender,postcode,age,is_subscriber
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,2538,0,0,92070,96625,95502,0


We see that of 70421, are there between 65-68000 missing values for gender, postcode and age. We therefore remove these as there is no use subsidizing them

In [14]:
train_behaviors_df = train_behaviors_df.drop(["gender", "postcode", "age"])
print(train_behaviors_df.shape)
train_behaviors_df.null_count()

(98820, 7)


impression_id,article_id,read_time,scroll_percentage,user_id,is_sso_user,is_subscriber
u32,u32,u32,u32,u32,u32,u32
0,0,0,2538,0,0,0


We still see that 2570/98967 rows are missing a scroll percentage. As this is very low (<3%) we can easily replace this. Intitially we just set scroll to 0

In [15]:
train_behaviors_df = train_behaviors_df.fill_null(strategy="zero")

### Account for multiple instances of the same article and user

By checking rows where the user_id and article_id are the same we see that we have 9855 instances where the user has read the same article multiple times

In [16]:
duplicates = train_behaviors_df.group_by(["article_id", "user_id"]).count().filter(pl.col("count") > 1)

print(duplicates)

shape: (12_252, 3)
┌────────────┬─────────┬───────┐
│ article_id ┆ user_id ┆ count │
│ ---        ┆ ---     ┆ ---   │
│ i32        ┆ u32     ┆ u32   │
╞════════════╪═════════╪═══════╡
│ 9789922    ┆ 381484  ┆ 5     │
│ 9783334    ┆ 1602863 ┆ 2     │
│ 9777406    ┆ 1742007 ┆ 3     │
│ 9776234    ┆ 324118  ┆ 2     │
│ 9769531    ┆ 2120677 ┆ 2     │
│ …          ┆ …       ┆ …     │
│ 9784856    ┆ 1893367 ┆ 3     │
│ 9776234    ┆ 664421  ┆ 2     │
│ 9780648    ┆ 1282038 ┆ 3     │
│ 9785310    ┆ 2454480 ┆ 2     │
│ 9772099    ┆ 1956681 ┆ 4     │
└────────────┴─────────┴───────┘


  duplicates = train_behaviors_df.group_by(["article_id", "user_id"]).count().filter(pl.col("count") > 1)


We see that we need to combine these duplicate rows. We therefore propose that for multiple instances of the same article and user, we combine the readtime and select the largest scroll percentage. This way we can preserve the data without having duplicates

In [17]:
dataProcesser = DataProcesser()
behaviors_df = dataProcesser.collaborative_filtering_preprocess()
train_df, test_df = dataProcesser.random_split(behaviors_df, test_ratio=0.2)
print(train_df.head())

shape: (5, 4)
┌────────────┬─────────┬────────────────┬────────────┐
│ article_id ┆ user_id ┆ total_readtime ┆ max_scroll │
│ ---        ┆ ---     ┆ ---            ┆ ---        │
│ i32        ┆ u32     ┆ f32            ┆ f32        │
╞════════════╪═════════╪════════════════╪════════════╡
│ 9781987    ┆ 1567944 ┆ 273.0          ┆ 100.0      │
│ 9772095    ┆ 864372  ┆ 7.0            ┆ 100.0      │
│ 9771916    ┆ 979336  ┆ 261.0          ┆ 100.0      │
│ 9775855    ┆ 1660448 ┆ 25.0           ┆ 100.0      │
│ 9772168    ┆ 1301442 ┆ 34.0           ┆ 100.0      │
└────────────┴─────────┴────────────────┴────────────┘


## Model Fit

This first model uses readtime and read percentage interactions to compare the user interactions 

In [3]:
recommender = ItemBasedCollaborativeRecommender(train_df)

In [19]:
recommender = ItemBasedCollaborativeRecommender(train_df)
recommender.fit()

{9781987: [(9775793, np.float64(0.9998934110637303)),
  (9772957, np.float64(0.9993676445467325)),
  (9773726, np.float64(0.18082506404746557)),
  (9773364, np.float64(0.10782977891939127)),
  (9775076, np.float64(0.038724353183660964)),
  (9777292, np.float64(0.02203470673476604)),
  (9779242, np.float64(0.0021768509398193414)),
  (9779423, np.float64(0.0016495383602512792)),
  (9788462, np.float64(0.0012473571153392982)),
  (9789417, np.float64(0.0010144435684085185))],
 9772095: [(9768566, np.float64(0.5440265049477685)),
  (9776449, np.float64(0.5437357441109495)),
  (9771842, np.float64(0.21933478153412045)),
  (9775905, np.float64(0.11069389292749798)),
  (9779891, np.float64(0.11056052381642101)),
  (9778628, np.float64(0.1062262001405071)),
  (9783803, np.float64(0.10190301846128114)),
  (9780962, np.float64(0.06032382413642556)),
  (9784334, np.float64(0.04148847874148942)),
  (9660631, np.float64(0.04113542308359375))],
 9771916: [(9716607, np.float64(0.3555007027046668)),
  

This first model just compares all artilces read by users when comparing users

In [20]:
binary_recommender = ItemBasedCollaborativeRecommender(train_df, binary_model=True)
binary_recommender.fit()

{9781987: [(9782418, np.float64(0.16069099615140692)),
  (9782499, np.float64(0.12373054103598746)),
  (9749668, np.float64(0.11867816581938528)),
  (9336443, np.float64(0.11867816581938528)),
  (9405451, np.float64(0.11867816581938528)),
  (9615540, np.float64(0.11867816581938528)),
  (9728036, np.float64(0.11867816581938528)),
  (7377747, np.float64(0.11867816581938528)),
  (9625041, np.float64(0.11867816581938528)),
  (9733679, np.float64(0.11867816581938528))],
 9772095: [(9768566, np.float64(0.1740776559556978)),
  (9625180, np.float64(0.1740776559556978)),
  (9736092, np.float64(0.1740776559556978)),
  (9769559, np.float64(0.1269591790297564)),
  (9736745, np.float64(0.1230914909793327)),
  (9777016, np.float64(0.1230914909793327)),
  (9765943, np.float64(0.1230914909793327)),
  (9665648, np.float64(0.1230914909793327)),
  (9702111, np.float64(0.1230914909793327)),
  (9767231, np.float64(0.1230914909793327))],
 9771916: [(9771948, np.float64(0.17792218241704139)),
  (9771919, np.

Of the original 15143 users, only 9194 can be accounted for with the current solution. This should be changed in the future

## Model presentation

### Article Recommendation

In [6]:
for user in [630220, 620796, 1067393, 1726258, 17205]:
    print("reccomended for user ", user, ": ", recommender.recommend_n_articles(user_id=user, n=5, allow_read_articles=True))

reccomended for user  630220 :  [9785596, 9777846, 9780193, 9775985, 9774764]
reccomended for user  620796 :  [9786618, 9498042, 9790811, 9339920, 9773943]
reccomended for user  1067393 :  [9780039, 9771197, 9782836, 9774125, 9781855]
reccomended for user  1726258 :  [9786209, 9777324, 9778842, 9779520, 9783197]
reccomended for user  17205 :  [9640315, 8518755, 9627627, 9776688, 9659139]


In [22]:
for user in [630220, 620796, 1067393, 1726258, 17205]:
    print("reccomended for user ", user, ": ", binary_recommender.recommend_n_articles(user_id=user, n=5, allow_read_articles=True))

reccomended for user  630220 :  [9603946, 9699524, 9660886, 9676767, 9717962]
reccomended for user  620796 :  [9765326, 9766178, 9749469, 9462935, 9740047]
reccomended for user  1067393 :  [9787659, 9771627, 9084355, 9769471, 9768377]
reccomended for user  1726258 :  [9775804, 9718262, 9566544, 9306867, 9709965]
reccomended for user  17205 :  [9768638, 9705425, 9768860, 9640315, 9627627]


### Evaluation Scores

#### Without ability to reccomend read articles

The complex model only reccomending articles the user has not yet read

In [23]:
results = recommender.evaluate_recommender(test_df, k=10, n_jobs=4, user_sample=200, allow_read_articles=False)
results

{'MAP@K': np.float64(0.004301075268817204),
 'NDCG@K': np.float64(0.009723807850803602)}

The binary reccomender model only reccomending articles the user has not yet read

In [24]:
results = binary_recommender.evaluate_recommender(test_df, k=10, n_jobs=4, user_sample=200, allow_read_articles=False)
results

{'MAP@K': np.float64(0.0008474576271186442),
 'NDCG@K': np.float64(0.0011495929951902354)}

#### With ability to reccomend previously read articles

The complex model reccomending articles the user, even if they have read them before

In [25]:
results = recommender.evaluate_recommender(test_df, k=10, n_jobs=4, user_sample=200, allow_read_articles=True)
results

{'MAP@K': np.float64(0.004), 'NDCG@K': np.float64(0.006687768782857766)}

The binary recommender model reccomending articles the user, even if they have read them before

In [26]:
results = binary_recommender.evaluate_recommender(test_df, k=10, n_jobs=4, user_sample=200, allow_read_articles=True)
results

{'MAP@K': np.float64(0.0020408163265306124),
 'NDCG@K': np.float64(0.009206245092687207)}

## Model Experimentation

In [27]:
test_user_id = 630220

predictions = recommender.recommend_n_articles(user_id=test_user_id, n=100, allow_read_articles=True)
results = set(test_behaviours_df.filter(pl.col("user_id") == test_user_id)["article_id"])

print(results)
print(predictions)

for prediction in predictions:
    if prediction in results:
        print("Yes")

{9786243, 9787524, 9781902, 9784591, 9783824, 9786111, 9776916, 9779615, 9788705, 9789473, 9428643, 9783334, 9782315, 9756075, 9787441, 9782722, 9786821, 9782726, 9786566, 9789896, 9787465, 9788362, 9791049, 9782092, 9780815, None, 9783509, 9772508, 9786718, 9786719, 9787487, 9790942, 9783655, 9786351, 9780849, 9781875, 9788661, 9781878, 9787510, 9786618, 9673979, 9780348, 9781887}
[9785596, 9777846, 9780193, 9775985, 9774764, 9778413, 9778155, 9787261, 9782180, 9786932, 9779724, 9771355, 9776694, 9772221, 9778845, 8534547, 9789502, 9781859, 9788048, 9494299, 9239697, 9749857, 9779289, 9735085, 9725042, 9760042, 9767481, 9772297, 9771749, 9780374, 9783585, 9785076, 9781389, 9667585, 9625415, 9764444, 9783276, 9772032, 9776553, 9780697, 9772601, 9772751, 9539706, 9785030, 9788797, 9771166, 9773292, 9777397, 9788950, 9767233, 9768062, 9776691, 9781870, 9786351, 9775283, 7373406, 9775567, 9774079, 9432542, 9780962, 9787230, 9657822, 9786860, 9735224, 9782633, 9771352, 9769650, 9776715, 97

In [28]:
test_user_id = 630220

predictions = recommender.recommend_n_articles(user_id=test_user_id, n=100, allow_read_articles=True)
results = set(test_df.filter(pl.col("user_id") == test_user_id)["article_id"])

print(results)
print(predictions)

for prediction in predictions:
    if prediction in results:
        print("Yes")

{9787524, 9784591, 9774352, 9774864, 9776916, 9774120, 9756075, 9780020, 9777856, 9773248, 9782722, 9782726, 9776071, 9788362, 9780815, 9776862, 9771627, 9776497, 9776246, 9771127, 9778939, 9781887}
[9785596, 9777846, 9780193, 9775985, 9774764, 9778413, 9778155, 9787261, 9782180, 9786932, 9779724, 9771355, 9776694, 9772221, 9778845, 8534547, 9789502, 9781859, 9788048, 9494299, 9239697, 9749857, 9779289, 9735085, 9725042, 9760042, 9767481, 9772297, 9771749, 9780374, 9783585, 9785076, 9781389, 9667585, 9625415, 9764444, 9783276, 9772032, 9776553, 9780697, 9772601, 9772751, 9539706, 9785030, 9788797, 9771166, 9773292, 9777397, 9788950, 9767233, 9768062, 9776691, 9781870, 9786351, 9775283, 7373406, 9775567, 9774079, 9432542, 9780962, 9787230, 9657822, 9786860, 9735224, 9782633, 9771352, 9769650, 9776715, 9780096, 9789831, 9775939, 9779629, 9779417, 9770997, 9779538, 9663762, 9780280, 9772256, 9080070, 9683742, 9716607, 9766011, 9772367, 9674243, 9776087, 9789379, 9773617, 9772291, 9777292,

#### Binary Item Based Recommender Metrics


In [29]:
test_user_id = 630220

predictions = recommender.recommend_n_articles(user_id=test_user_id, n=100, allow_read_articles=True)
results = set(test_df.filter(pl.col("user_id") == test_user_id)["article_id"])

print(results)
print(predictions)

for prediction in predictions:
    if prediction in results:
        print("Yes")

{9788705, 9789473, 9772355, 9778500, 9786566, 9774120, 9778413, 9781902, 9778351, 9774352, 9783824, 9779860, 9783509, 9776406, 9778939, 9776862, 9781887}
[9776917, 9787261, 9782180, 4265340, 9771355, 9772710, 9787586, 9775985, 9759717, 9759345, 9776691, 9783655, 9772221, 9756546, 9769367, 9790019, 9774864, 9767639, 9776694, 9767852, 9771568, 9767233, 9764444, 9778845, 9789404, 8534547, 9766803, 9239697, 9494299, 9777200, 8860119, 9777397, 9789427, 9785475, 9782915, 9789502, 9777464, 9773873, 9783585, 9714168, 9625415, 9783276, 9789494, 9777492, 9765759, 9775568, 9578459, 9514605, 9790755, 9773316, 9539706, 9722202, 9080070, 9774079, 9789745, 9778021, 9776566, 9783405, 9710762, 9788524, 9768062, 9775990, 9738729, 9569934, 9766635, 9673979, 9787230, 9654458, 9779242, 9782361, 9770450, 9655559, 9775573, 9769679, 9686860, 9769917, 9786860, 9672256, 9774287, 9440043, 9787243, 9538375, 9778745, 9732481, 9780096, 9766225, 9778682, 9780815, 9781991, 9772256, 8315213, 9781389, 9778219, 9782315,

In [None]:
from utils.evaluation import perform_model_evaluation
metrics = perform_model_evaluation(binary_recommender, test_data=test_df, k=5)
metrics

{'precision@k': np.float64(0.003818891942716621),
 'recall@k': np.float64(0.010799706264772084),
 'fpr@k': np.float64(0.0021827124788052843)}

In [30]:
from utils.evaluation import append_model_metrics
append_model_metrics(metrics, "item based binary")

#### Complex Item Based Recommender Metrics

In [31]:
metrics_complex = perform_model_evaluation(recommender, test_data=test_df, k=5)
print(metrics_complex)

append_model_metrics(metrics_complex, "item based complex")

{'precision@k': np.float64(0.002256617966150731), 'recall@k': np.float64(0.0034282432411652055), 'fpr@k': np.float64(0.0021865176445951336)}


### Diversity Evaluation
Calculates the aggrigate diversity of the recommender model recommendations, and appends the result to the `/evaluation_summary/model_overview_diversity.csv`-file. 

In [None]:
users_df = train_history_df["user_id"].unique()
users_df

user_id
u32
10068
10200
10201
10623
10701
…
2590015
2590054
2590471
2590571


In [33]:
articles_ids_df = articles_df["article_id"].unique()
articles_ids_df

article_id
i32
3001353
3003065
3012771
3023463
3032577
…
9803492
9803505
9803525
9803560


In [34]:
from utils.evaluation import aggregate_diversity
from utils.evaluation import append_aggregate_diversity

print(recommender.item_similarity_matrix)

diversity = aggregate_diversity(recommender, articles_df, users_df=users_df, user_sample=1000)

print("Diversity")
print(diversity)

append_aggregate_diversity(diversity, "item based CF")

{9781987: [(9775793, np.float64(0.9998934110637303)), (9772957, np.float64(0.9993676445467325)), (9773726, np.float64(0.18082506404746557)), (9773364, np.float64(0.10782977891939127)), (9775076, np.float64(0.038724353183660964)), (9777292, np.float64(0.02203470673476604)), (9779242, np.float64(0.0021768509398193414)), (9779423, np.float64(0.0016495383602512792)), (9788462, np.float64(0.0012473571153392982)), (9789417, np.float64(0.0010144435684085185))], 9772095: [(9768566, np.float64(0.5440265049477685)), (9776449, np.float64(0.5437357441109495)), (9771842, np.float64(0.21933478153412045)), (9775905, np.float64(0.11069389292749798)), (9779891, np.float64(0.11056052381642101)), (9778628, np.float64(0.1062262001405071)), (9783803, np.float64(0.10190301846128114)), (9780962, np.float64(0.06032382413642556)), (9784334, np.float64(0.04148847874148942)), (9660631, np.float64(0.04113542308359375))], 9771916: [(9716607, np.float64(0.3555007027046668)), (9771576, np.float64(0.02975904010394559

### Gini coefficient 
Calculates the Gini coefficient for the recommender model and appends it to the `/output/model_overview_gini.csv` file.

In [35]:
from utils.evaluation import gini_coefficient
from utils.evaluation import append_gini_coefficient

gini = gini_coefficient(recommender, users_df, articles_ids_df=articles_ids_df, user_sample=1000)
append_gini_coefficient(gini, "item based CF")

Sampling users


Computing Gini coefficient
[9781932, 9781987, 9775793, 9772957, 9776053, 9788921, 9791165, 9783667, 9779813, 9778277, 9780302, 9787332, 9788677, 9688780, 9782746, 9769155, 9773673, 9786139, 9780329, 9766434, 9785062, 9553271, 9784583, 9417521, 9789692, 9779653, 9787243, 9570877, 9782499, 9789647, 9771992, 9782219, 9775688, 9708367, 9790559, 9774972, 9785437, 9782746, 9775079, 9775183, 9774096, 9777750, 9788947, 9786563, 9784662, 9783643, 9776710, 9779694, 9780039, 9758825, 9672256, 9766225, 9786860, 9782770, 9774287, 9786204, 9774032, 9778669, 9788921, 9791165, 9770867, 9785500, 9770051, 9774789, 9783790, 9738729, 9772297, 9781262, 9770867, 9785500, 9775752, 9786293, 9787469, 9531745, 9781932, 9788766, 9784138, 9682026, 9784160, 9782057, 9627627, 9640315, 8518755, 9781998, 9780302, 9774944, 9782604, 9770729, 9767909, 9776544, 9749014, 9673564, 9775985, 9785596, 9774764, 9785596, 9775985, 9774764, 9777846, 9789537, 9783667, 9775582, 9776322, 9772660, 9774032, 9768687, 9777565, 9776046, 

### Carbon Footprint
This section creates an emissions.csv file in the "output"-folder
It utilizes the code_carbon (`codecarbon EmissionsTracker`) to record the carbon footprint of the `fit` and the `recommend` methods of the model.

In [36]:
from utils.evaluation import track_model_energy

print("\nCarbon footprint of the recommender:")
footprint = track_model_energy(recommender, "item_based", user_id=test_user_id, n=5)
footprint

[codecarbon INFO @ 16:52:41] [setup] RAM Tracking...
[codecarbon INFO @ 16:52:41] [setup] CPU Tracking...
 Windows OS detected: Please install Intel Power Gadget to measure CPU




Carbon footprint of the recommender:


[codecarbon INFO @ 16:52:43] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-13700H
[codecarbon INFO @ 16:52:43] [setup] GPU Tracking...
[codecarbon INFO @ 16:52:43] No GPU found.
[codecarbon INFO @ 16:52:43] >>> Tracker's metadata:
[codecarbon INFO @ 16:52:43]   Platform system: Windows-10-10.0.26100-SP0
[codecarbon INFO @ 16:52:43]   Python version: 3.11.9
[codecarbon INFO @ 16:52:43]   CodeCarbon version: 2.8.3
[codecarbon INFO @ 16:52:43]   Available RAM : 15.731 GB
[codecarbon INFO @ 16:52:43]   CPU count: 20
[codecarbon INFO @ 16:52:43]   CPU model: 13th Gen Intel(R) Core(TM) i7-13700H
[codecarbon INFO @ 16:52:43]   GPU count: None
[codecarbon INFO @ 16:52:43]   GPU model: None
[codecarbon INFO @ 16:52:46] Saving emissions data to file c:\Users\magnu\NewDesk\An.sys\TDT4215\recommender_system\demostrations\output\item_based_fit_emission.csv
[codecarbon INFO @ 16:53:01] Energy consumed for RAM : 0.000025 kWh. RAM Power : 5.899243354797363 W
[codecarbon INFO @ 

{'fit': ({9781987: [(9775793, np.float64(0.9998934110637303)),
    (9772957, np.float64(0.9993676445467325)),
    (9773726, np.float64(0.18082506404746557)),
    (9773364, np.float64(0.10782977891939127)),
    (9775076, np.float64(0.038724353183660964)),
    (9777292, np.float64(0.02203470673476604)),
    (9779242, np.float64(0.0021768509398193414)),
    (9779423, np.float64(0.0016495383602512792)),
    (9788462, np.float64(0.0012473571153392982)),
    (9789417, np.float64(0.0010144435684085185))],
   9772095: [(9768566, np.float64(0.5440265049477685)),
    (9776449, np.float64(0.5437357441109495)),
    (9771842, np.float64(0.21933478153412045)),
    (9775905, np.float64(0.11069389292749798)),
    (9779891, np.float64(0.11056052381642101)),
    (9778628, np.float64(0.1062262001405071)),
    (9783803, np.float64(0.10190301846128114)),
    (9780962, np.float64(0.06032382413642556)),
    (9784334, np.float64(0.04148847874148942)),
    (9660631, np.float64(0.04113542308359375))],
   977191