# ClayRS experiment on existing news representation

**News representation:** LDA embeddings with 128 dimensions (FairUMAP paper 2022)

**Algorithms :**
* Centroid Vector (Cosine similarity)
* Classifiers:
    *  GaussianProcess
    *  KNN
    *  SVC

In [1]:
import pandas as pd
import json 

from clayrs import content_analyzer as ca
from clayrs import recsys as rs
from clayrs import evaluation as eva

In [2]:
import ast
import numpy as np

# Recommender System

## Get train/test (temporal)

In [25]:
ratings = ca.Ratings(ca.CSVFile('../data/ratings_10k.csv'))

Importing ratings:  100%|██████████| 2052051/2052051 [00:08<00:00]


In [26]:
print(ratings)

        user_id item_id  score
0        504290  106909    0.0
1        504290  101469    0.0
2        504290   95605    0.0
3        504290   96061    0.0
4        504290  130031    0.0
...         ...     ...    ...
2052046  339186    1767    0.0
2052047  339186  118908    0.0
2052048  339186   14612    0.0
2052049  339186    9471    0.0
2052050  366874   65373    1.0

[2052051 rows x 3 columns]


## Launch experiment

In [27]:
catalog = set(ratings.item_id_column)

In [28]:
len(catalog)

18186

In [34]:
#Definition of the recommender algorithm
cos_algo = rs.CentroidVector(
    {'lda_128': 'text_lda'},  
    similarity=rs.CosineSimilarity()
)

In [33]:
rs.ContentBasedExperiment(
    ratings,
    items_directory='news_codified_lda_128',
    partitioning_technique=rs.HoldOutPartitioning(train_set_size=0.75, shuffle=False),
    # algorithm_list=[knn_algo],
    algorithm_list=[cos_algo],
    metric_list=[
        eva.PrecisionAtK(k=10, sys_average='macro'),
        # eva.RecallAtK(k=10, sys_average='macro'),
        eva.FMeasureAtK(k=10, sys_average='macro'),
        eva.NDCGAtK(k=10),
        # eva.CatalogCoverage(catalog),
        # eva.GiniIndex()
    ],
    report=True,
    output_folder='report_baseline_10k_cv_complete',
    overwrite_if_exists=True
).rank(n_recs=len(catalog), methodology=rs.TestRatingsMethodology(), num_cpus=1)

Performing HoldOutPartitioning:  100%|██████████| 10000/10000 [00:00<00:00]

[39mINFO[0m - ******* Processing alg CentroidVector *******
[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 9999:  100%|██████████| 10000/10000 [02:48<00:00]
[39mINFO[0m - Performing evaluation on metrics chosen
  return actual / ideal
Performing NDCG@10:  100%|██████████| 3/3 [01:25<00:00]

[39mINFO[0m - Results saved in 'report_baseline_10k_cv_complete_v2/CentroidVector_1'
