## Demonstration of the User-Based Collaborative Recommender

This system leverages collaborative filtering by analyzing user interactions, such as scroll length and read time, to identify users with similar behavior. 
Therefore, it focuses on the user-item relation.

It recommends articles that these similar users have engaged with, aiming to provide personalized suggestions. The model's performance is evaluated using MAP@K and NDCG@K metrics.



In [1]:
import sys
import os

parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(parent_dir)

import polars as pl
import numpy as np

from parquet_data_reader import ParquetDataReader
from utils.data_preprocessing import DataProcesser
from models.hybrid.most_popular_user_CF import MostPopularCollaborativeRecommender
parquet_reader = ParquetDataReader()

pl.Config.set_tbl_cols(-1)

polars.config.Config

## Data Import and EDA

In [2]:
dataProcesser = DataProcesser()
train_history_df = parquet_reader.read_data("../../data/train/history.parquet")
articles_df = parquet_reader.read_data("../../data/articles.parquet")
behaviors_df = dataProcesser.collaborative_filtering_preprocess()
train_df, test_df = dataProcesser.random_split(behaviors_df, test_ratio=0.3)
print(train_df.head())

shape: (5, 4)
┌────────────┬─────────┬────────────────┬────────────┐
│ article_id ┆ user_id ┆ total_readtime ┆ max_scroll │
│ ---        ┆ ---     ┆ ---            ┆ ---        │
│ i32        ┆ u32     ┆ f32            ┆ f32        │
╞════════════╪═════════╪════════════════╪════════════╡
│ 9774404    ┆ 2268968 ┆ 32.0           ┆ 100.0      │
│ 9790822    ┆ 969888  ┆ 85.0           ┆ 100.0      │
│ 9772601    ┆ 1334444 ┆ 98.0           ┆ 100.0      │
│ 9778351    ┆ 547339  ┆ 220.0          ┆ 100.0      │
│ 9771127    ┆ 1955572 ┆ 35840.0        ┆ 100.0      │
└────────────┴─────────┴────────────────┴────────────┘


## Model Fit

This first model uses readtime and read percentage interactions to compare the user interactions 

In [3]:
recommender = MostPopularCollaborativeRecommender(train_df)
recommender.fit(scroll_weight=0.1, readtime_weight=1.0)

{2268968: [(1223167, np.float64(0.9351154618349603)),
  (299807, np.float64(0.9351154618349603)),
  (1043725, np.float64(0.9351154618349603)),
  (807921, np.float64(0.9351154618349603)),
  (108781, np.float64(0.9351154618349603)),
  (413546, np.float64(0.9351154618349603)),
  (1320327, np.float64(0.9351154618349603)),
  (1057167, np.float64(0.9351154618349603)),
  (2587200, np.float64(0.9351154618349602)),
  (771814, np.float64(0.9351154618349602))],
 969888: [(2250791, np.float64(0.19583202014600198)),
  (552794, np.float64(0.19583202014600198)),
  (1614829, np.float64(0.19583202014600198)),
  (1108153, np.float64(0.19583202014600198)),
  (1892102, np.float64(0.19583202014600198)),
  (1501895, np.float64(0.19583202014600198)),
  (1702733, np.float64(0.19583202014600198)),
  (1196547, np.float64(0.19583202014600198)),
  (1075421, np.float64(0.19583202014600198)),
  (716187, np.float64(0.19583202014600198))],
 1334444: [(952062, np.float64(0.9958158804648102)),
  (2114301, np.float64(0.

This first model just compares all artilces read by users when comparing users

In [4]:
binary_recommender = MostPopularCollaborativeRecommender(train_df, binary_model=True)
binary_recommender.fit()

{2268968: [(1522656, np.float64(0.30151134457776363)),
  (2295802, np.float64(0.30151134457776363)),
  (899311, np.float64(0.30151134457776363)),
  (1973813, np.float64(0.30151134457776363)),
  (90756, np.float64(0.30151134457776363)),
  (2587200, np.float64(0.30151134457776363)),
  (85149, np.float64(0.30151134457776363)),
  (1320327, np.float64(0.30151134457776363)),
  (1922063, np.float64(0.30151134457776363)),
  (541663, np.float64(0.30151134457776363))],
 969888: [(1090824, np.float64(0.40824829046386313)),
  (716187, np.float64(0.40824829046386313)),
  (1108153, np.float64(0.40824829046386313)),
  (552794, np.float64(0.40824829046386313)),
  (2090196, np.float64(0.40824829046386313)),
  (1501895, np.float64(0.40824829046386313)),
  (1064716, np.float64(0.40824829046386313)),
  (53577, np.float64(0.40824829046386313)),
  (239437, np.float64(0.40824829046386313)),
  (1196547, np.float64(0.40824829046386313))],
 1334444: [(1404104, np.float64(0.33333333333333326)),
  (1310910, np.fl

Of the original 15143 users, only 9194 can be accounted for with the current solution. This should be changed in the future

## Model Presentation

### Article Recommendation

In [5]:
for user in [630220, 620796, 1067393, 1726258, 17205]:
    print("reccomended for user ", user, ": ", recommender.recommend_n_articles(user_id=user, n=5, allow_read_articles=True))

reccomended for user  630220 :  [9782722, 9783137, 9774297, 9786359, 9773574]
reccomended for user  620796 :  [9771224, 9789997, 9759891, 9791050, 9781878]
reccomended for user  1067393 :  [9773282, 9776234, 9785475, 9790548, 9781476]
reccomended for user  1726258 :  [9771224, 9789997, 9759891, 9791050, 9781878]
reccomended for user  17205 :  [9780325, 9765941, 9774142, 9773282, 9776234]


In [6]:
for user in [630220, 620796, 1067393, 1726258, 17205]:
    print("reccomended for user ", user, ": ", binary_recommender.recommend_n_articles(user_id=user, n=5, allow_read_articles=True))

reccomended for user  630220 :  [9783334, 9759955, 9787465, 9778657, 9776497]
reccomended for user  620796 :  [9771686, 9769432, 9783278, 9783334, 9771576]
reccomended for user  1067393 :  [9773282, 9776234, 9785475, 9790548, 9781476]
reccomended for user  1726258 :  [9771224, 9775905, 9778942, 9785475, 9774789]
reccomended for user  17205 :  [9775562, 9780325, 9779269, 9771846, 9777804]


### Evaluation Scores

#### Without the Ability to Recommend Read Articles

The complex model only reccomending articles the user has not yet read

In [7]:
results = recommender.evaluate_recommender(test_df, k=100, n_jobs=4, user_sample=20, allow_read_articles=False)
results

{'MAP@K': np.float64(0.0069230769230769216),
 'NDCG@K': np.float64(0.0770049263688442)}

The binary reccomender model only reccomending articles the user has not yet read

In [8]:
results = binary_recommender.evaluate_recommender(test_df, k=100, n_jobs=4, user_sample=20, allow_read_articles=False)
results

{'MAP@K': np.float64(0.006875), 'NDCG@K': np.float64(0.07523710827369631)}

#### With the Ability to Recommend Previously Read Articles

The complex model reccomending articles the user, even if they have read them before

In [9]:
results = recommender.evaluate_recommender(test_df, k=100, n_jobs=4, user_sample=20, allow_read_articles=True)
results

{'MAP@K': np.float64(0.007333333333333333),
 'NDCG@K': np.float64(0.0763631926287255)}

The binary reccomender model reccomending articles the user, even if they have read them before

In [10]:
results = binary_recommender.evaluate_recommender(test_df, k=100, n_jobs=4, user_sample=20, allow_read_articles=True)
results

{'MAP@K': np.float64(0.010833333333333334),
 'NDCG@K': np.float64(0.0657668914112209)}

## Model Experimentation

In [11]:
test_user_id = 630220

predictions = recommender.recommend_n_articles(user_id=test_user_id, n=1000, allow_read_articles=True)
results = set(test_df.filter(pl.col("user_id") == test_user_id)["article_id"])

print(results)
print(predictions)

for prediction in predictions:
    if prediction in results:
        print("Yes")

{9786243, 9781902, 9778448, 9783824, 9774864, 9779860, 9779615, 9789473, 9428643, 9774120, 9773868, 9771948, 9778351, 9776691, 9780020, 9777856, 9773248, 9772355, 9788362, 9771473, 9780181, 9758424, 9772508, 9780193, 9738729, 9781875, 9771127}
[9782722, 9783137, 9774297, 9786359, 9773574, 9778375, 9784662, 9786378, 9781814, 9779184, 9770288, 9790052, 9773364, 9789703, 9782993, 9775489, 9783051, 9782695, 9779737, 9774032, 9788149, 9780325, 9772706, 9773877, 9783334, 9784702, 9771612, 9780467, 9784870, 9772045, 9786247, 9771686, 9773486, 9783349, 9788760, 9778168, 9780280, 9774352, 9779045, 9775331, 9773282, 9776234, 9785475, 9790548, 9781476, 9775776, 9789810, 9779748, 9775562, 9771919, 9782423, 9787465, 9780195, 9781086, 9787524, 9789065, 9785668, 9782884, 9783213, 9782092, 9770082, 9784856, 9786932, 9775484, 9783057, 9773846, 9781998, 9776508, 9782517, 9779289, 9774142, 9773744, 9771916, 9774079, 9778804, 9774074, 9784591, 9783278, 9776041, 9776497, 9784947, 9785835, 9785310, 9782407,

In [12]:
test_user_id = 630220

predictions = recommender.recommend_n_articles(user_id=test_user_id, n=1000, allow_read_articles=True)
results = set(test_df.filter(pl.col("user_id") == test_user_id)["article_id"])

print(results)
print(predictions)

for prediction in predictions:
    if prediction in results:
        print("Yes")

{9786243, 9781902, 9778448, 9783824, 9774864, 9779860, 9779615, 9789473, 9428643, 9774120, 9773868, 9771948, 9778351, 9776691, 9780020, 9777856, 9773248, 9772355, 9788362, 9771473, 9780181, 9758424, 9772508, 9780193, 9738729, 9781875, 9771127}
[9782722, 9783137, 9774297, 9786359, 9773574, 9778375, 9784662, 9786378, 9781814, 9779184, 9770288, 9790052, 9773364, 9789703, 9782993, 9775489, 9783051, 9782695, 9774032, 9779737, 9788149, 9780325, 9772706, 9773877, 9784702, 9783334, 9771612, 9780467, 9784870, 9772045, 9786247, 9771686, 9773486, 9783349, 9778168, 9788760, 9780280, 9774352, 9779045, 9775331, 9773282, 9776234, 9785475, 9790548, 9781476, 9775776, 9789810, 9779748, 9775562, 9771919, 9782423, 9787465, 9780195, 9781086, 9787524, 9789065, 9785668, 9782884, 9783213, 9782092, 9770082, 9784856, 9786932, 9775484, 9783057, 9773846, 9781998, 9776508, 9782517, 9779289, 9774142, 9773744, 9771916, 9774079, 9778804, 9774074, 9784591, 9783278, 9776041, 9776497, 9784947, 9785835, 9785310, 9782407,

In [13]:
from utils.evaluation import perform_model_evaluation
from utils.evaluation import append_model_metrics

matrics = perform_model_evaluation(recommender, test_df, k=5)
print(matrics)

append_model_metrics(matrics, "most_popular_user_CF_hybrid")


{'precision@k': np.float64(0.006359491240700744), 'recall@k': np.float64(0.010640919008793163), 'fpr@k': np.float64(0.0022489079165506584)}


In [14]:
# Finds the unique user ids in the history data
users_ids = train_history_df['user_id'].unique()

### Diversity

In [15]:
from utils.evaluation import aggregate_diversity
from utils.evaluation import append_aggregate_diversity

# For the random split model
diversity = aggregate_diversity(recommender, item_df=articles_df, users_df=users_ids, user_sample=1000)

print("Diversity Random Split")
print(diversity)
append_aggregate_diversity(diversity, "most_popular_user_CF_hybrid")


Diversity Random Split
0.046870479313337834


### Gini

In [16]:
from utils.evaluation import gini_coefficient
from utils.evaluation import append_gini_coefficient

# For the random split model
gini_random = gini_coefficient(recommender, articles_ids_df=articles_df, users_ids_df=users_ids, user_sample=1000)
print("Gini Coefficient Random Split")
print(gini_random)
append_gini_coefficient(gini_random, "most_popular_user_CF_hybrid")

Sampling users
Computing Gini coefficient
[9776148, 9771948, 9774352, 9776152, 9780096, 9775800, 9780447, 9787332, 9775484, 9783803, 9787487, 9769504, 9773846, 9773282, 9771473, 9778657, 9771938, 9788841, 9784044, 9786139, 9781906, 9790752, 9773350, 9790572, 9781947, 9773282, 9776234, 9785475, 9790548, 9781476, 9789911, 9767507, 9776238, 9773464, 9781476, 9766592, 9775983, 9773282, 9776234, 9785475, 9773282, 9776234, 9785475, 9790548, 9781476, 9773282, 9776234, 9785475, 9790548, 9781476, 9773282, 9776234, 9785475, 9790548, 9781476, 9780472, 9789977, 9773846, 9780986, 9788352, 9790822, 9775785, 9785017, 9779777, 9783278, 9773282, 9776234, 9785475, 9790548, 9781476, 9783790, 9788404, 9778813, 9788352, 9781998, 9773282, 9776234, 9785475, 9790548, 9781476, 9773282, 9776234, 9785475, 9790548, 9781476, 9773282, 9776234, 9785475, 9790548, 9781476, 9773282, 9776234, 9785475, 9790548, 9781476, 9773282, 9776234, 9785475, 9790548, 9781476, 9776394, 9779777, 9767490, 9776259, 9777804, 9785112, 977

### Carbon Footprint
This section creates an emissions.csv file in the "output"-folder
It utilizes the code_carbon (`codecarbon EmissionsTracker`) to record the carbon footprint of the `fit` and the `recommend` methods of the model.

In [17]:
from utils.evaluation import track_model_energy

print("\nCarbon footprint of the recommender:")
footprint = track_model_energy(recommender, "most_popular_user_CF_hybrid", user_id=test_user_id, n=5)
footprint

[codecarbon INFO @ 14:50:10] [setup] RAM Tracking...
[codecarbon INFO @ 14:50:10] [setup] CPU Tracking...
 Windows OS detected: Please install Intel Power Gadget to measure CPU




Carbon footprint of the recommender:


[codecarbon INFO @ 14:50:12] CPU Model on constant consumption mode: 13th Gen Intel(R) Core(TM) i7-13700H
[codecarbon INFO @ 14:50:12] [setup] GPU Tracking...
[codecarbon INFO @ 14:50:12] No GPU found.
[codecarbon INFO @ 14:50:12] >>> Tracker's metadata:
[codecarbon INFO @ 14:50:12]   Platform system: Windows-10-10.0.26100-SP0
[codecarbon INFO @ 14:50:12]   Python version: 3.11.9
[codecarbon INFO @ 14:50:12]   CodeCarbon version: 2.8.3
[codecarbon INFO @ 14:50:12]   Available RAM : 15.731 GB
[codecarbon INFO @ 14:50:12]   CPU count: 20
[codecarbon INFO @ 14:50:12]   CPU model: 13th Gen Intel(R) Core(TM) i7-13700H
[codecarbon INFO @ 14:50:12]   GPU count: None
[codecarbon INFO @ 14:50:12]   GPU model: None
[codecarbon INFO @ 14:50:15] Saving emissions data to file c:\Users\magnu\NewDesk\An.sys\TDT4215\recommender_system\demostrations\output\most_popular_user_CF_hybrid_fit_emission.csv
[codecarbon INFO @ 14:50:30] Energy consumed for RAM : 0.000025 kWh. RAM Power : 5.899243354797363 W
[c

{'fit': ({2268968: [(1057167, np.float64(0.7738445170981876)),
    (771814, np.float64(0.7738445170981876)),
    (413546, np.float64(0.7738445170981875)),
    (90756, np.float64(0.7738445170981875)),
    (1320327, np.float64(0.7738445170981875)),
    (299807, np.float64(0.7738445170981875)),
    (1223167, np.float64(0.7738445170981875)),
    (807921, np.float64(0.7738445170981875)),
    (1043725, np.float64(0.7738445170981875)),
    (2072705, np.float64(0.7738445170981875))],
   969888: [(2250791, np.float64(0.26411226771541263)),
    (1702733, np.float64(0.26411226771541263)),
    (1075421, np.float64(0.26411226771541263)),
    (2134755, np.float64(0.26411226771541263)),
    (716187, np.float64(0.26411226771541263)),
    (1614829, np.float64(0.26411226771541263)),
    (552794, np.float64(0.26411226771541263)),
    (1892102, np.float64(0.26411226771541263)),
    (2451504, np.float64(0.26411226771541263)),
    (1501895, np.float64(0.26411226771541263))],
   1334444: [(2281712, np.float6