# ClayRS experiment on existing news representation

**News representation:** LDA embeddings with 128 dimensions (FairUMAP paper 2022)

**Algorithms :**
* Centroid Vector (Cosine similarity)
* Classifiers:
    *  GaussianProcess
    *  KNN
    *  SVC

In [1]:
import pandas as pd
import json 

from clayrs import content_analyzer as ca
from clayrs import recsys as rs
from clayrs import evaluation as eva

In [2]:
import ast
import numpy as np

# Import Data

In [3]:
news = pd.read_csv('../data_mind_large_news/news_large_filtered.csv', index_col=0)

In [4]:
news = news[~news['Embedding'].isna()]
news = news.reset_index(drop=True)
news['Embedding'] = news['Embedding'].apply(ast.literal_eval)

In [5]:
#Remove the N before the item ID to be consistent with the other dataframe
news['NewsID'] = news['NewsID'].astype(int)

In [6]:
news.head()

Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,URL,Title_entities,Abstract_entities,Source,Text,Embedding,Date
0,93187,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId...",nytimes.com,zolote ukraine lt ivan molchanets peek parapet...,"[5.403303475759458e-06, 0.23899737000465393, 5...",2019-10-25
1,40259,news,newsworld,Chile: Three die in supermarket fire amid prot...,Three people have died in a supermarket fire a...,https://assets.msn.com/labs/mind/AAJ43pw.html,"[{""Label"": ""Chile"", ""Type"": ""G"", ""WikidataId"":...","[{""Label"": ""Santiago"", ""Type"": ""G"", ""WikidataI...",cnn.com,people die supermarket fire angry protest chil...,"[3.7435991544043645e-05, 3.7435991544043645e-0...",2019-10-20
2,13152,news,newsscienceandtechnology,"How to report weather-related closings, delays","When there are active closings, view them here...",https://assets.msn.com/labs/mind/AAlErhA.html,[],"[{""Label"": ""WXII-TV"", ""Type"": ""M"", ""WikidataId...",,active closing view \n wxii news receive numbe...,"[2.5583496608305722e-05, 2.5583496608305722e-0...",2019-11-12
3,21935,news,newspolitics,Elijah Cummings to lie in state at US Capitol ...,"Cummings, a Democrat whose district included s...",https://assets.msn.com/labs/mind/AAJgNxm.html,"[{""Label"": ""Elijah Cummings"", ""Type"": ""P"", ""Wi...","[{""Label"": ""Elijah Cummings"", ""Type"": ""P"", ""Wi...",usatoday.com,washington rep elijah cummings lie state thurs...,"[0.06840043514966965, 1.1936484042962547e-05, ...",2019-10-24
4,35373,news,newsscienceandtechnology,"New iPad Pro 2019 release date, price, news an...",Apple is likely gearing up to release a new se...,https://assets.msn.com/labs/mind/AAGZjoR.html,"[{""Label"": ""IPad Pro"", ""Type"": ""U"", ""WikidataI...","[{""Label"": ""Apple Inc."", ""Type"": ""O"", ""Wikidat...",,new ipad pro tablet overdue unclear launch sli...,"[4.335052381065907e-06, 4.335052381065907e-06,...",2019-11-14


In [7]:
items_infos = pd.read_json('../data_mind_large_news/items_info.json')

In [8]:
items_infos

Unnamed: 0,itemID,Category,SubCategory,Title,Abstract,Text
0,45436,news,newsscienceandtechnology,Walmart Slashes Prices on Last-Generation iPads,Apple's new iPad releases bring big deals on l...,year walmart wait black friday offer steep dea...
1,93187,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,zolote ukraine lt ivan molchanets peek parapet...
2,51947,news,newsscienceandtechnology,"How to record your screen on Windows, macOS, i...",The easiest way to record what's happening on ...,lot reason record screen window pc mac phone t...
3,40259,news,newsworld,Chile: Three die in supermarket fire amid prot...,Three people have died in a supermarket fire a...,people die supermarket fire angry protest chil...
4,13152,news,newsscienceandtechnology,"How to report weather-related closings, delays","When there are active closings, view them here...",active closing view \n wxii news receive numbe...
...,...,...,...,...,...,...
38602,90549,news,newsworld,'Community heroes' may get housing help in N C...,"CHARLOTTE, N.C. (AP) Police officers, first ...",charlotte n.c ap police officer responder publ...
38603,48578,news,newsscienceandtechnology,Boeing Shifts 777 Work Back To Humans Followin...,In a world full of fears of robots taking jobs...,world fear robot take job boeing co nyse ba co...
38604,27920,news,newsus,Seven area football teams gear up for Sac-Joaq...,STOCKTON - Of the seven area football teams pl...,stockton seven area football team play tonight...
38605,1192,news,newsus,3 new businesses to check out in Paradise Valley,Looking to discover the newest restaurants and...,interested discover new restaurant retail addi...


In [9]:
#Only keep news with no missing values
list_news = news['NewsID'].tolist()
items_infos = items_infos[items_infos['itemID'].isin(list_news)]

In [10]:
news = news.sort_values(by='NewsID')
items_infos = items_infos.sort_values(by='itemID')

In [11]:
news.head()

Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,URL,Title_entities,Abstract_entities,Source,Text,Embedding,Date
1176,2,news,newsopinion,Mormons to the Rescue?,The one religious faith that is the most heavi...,https://assets.msn.com/labs/mind/AAICO6z.html,"[{""Label"": ""Mormons"", ""Type"": ""O"", ""WikidataId...","[{""Label"": ""Donald Trump"", ""Type"": ""P"", ""Wikid...",chatsports.com,editor note opinion article author publish con...,"[0.13700850307941437, 9.641810720495414e-06, 9...",2019-10-12
3715,9,news,newsopinion,Time for the White House and Congressional Rep...,The president and his allies must mount a defe...,https://assets.msn.com/labs/mind/AAJy0Nq.html,"[{""Label"": ""White House"", ""Type"": ""F"", ""Wikida...",[],cafes.news,editor note opinion article author publish con...,"[0.2655191123485565, 9.194827725877985e-06, 9....",2019-10-30
9647,12,news,newspolitics,Jacksonville City Council rejects resolution t...,The Jacksonville City Council deadlocked on Tu...,https://assets.msn.com/labs/mind/AAJcJG0.html,"[{""Label"": ""Jacksonville City Council"", ""Type""...","[{""Label"": ""Duval County Public Schools"", ""Typ...",jacksonville.com,jacksonville city council deadlocke tuesday re...,"[1.8368036762694828e-05, 1.8368036762694828e-0...",2019-10-24
21819,13,news,newsscienceandtechnology,T-Mobile confirms customers' personal data acc...,It's been a rough month for customers who care...,https://assets.msn.com/labs/mind/BBXc5Ii.html,"[{""Label"": ""T-Mobile"", ""Type"": ""O"", ""WikidataI...","[{""Label"": ""T-Mobile"", ""Type"": ""O"", ""WikidataI...",onenewspage.com,rough month customer care privacy data breach ...,"[3.589615516830236e-05, 3.589615516830236e-05,...",2019-11-23
9477,14,news,newspolitics,Anxiety rises among Democrats worried about pa...,As they look at a weakened Republican presiden...,https://assets.msn.com/labs/mind/AAJcH91.html,"[{""Label"": ""Democratic Party (United States)"",...","[{""Label"": ""Democratic Party (United States)"",...",washingtonpost.com,democratic presidential contest kick early yea...,"[0.17524464428424835, 0.011845671571791172, 5....",2019-10-23


In [12]:
items_infos.head()

Unnamed: 0,itemID,Category,SubCategory,Title,Abstract,Text
1637,2,news,newsopinion,Mormons to the Rescue?,The one religious faith that is the most heavi...,editor note opinion article author publish con...
5283,9,news,newsopinion,Time for the White House and Congressional Rep...,The president and his allies must mount a defe...,editor note opinion article author publish con...
13706,12,news,newspolitics,Jacksonville City Council rejects resolution t...,The Jacksonville City Council deadlocked on Tu...,jacksonville city council deadlocke tuesday re...
30698,13,news,newsscienceandtechnology,T-Mobile confirms customers' personal data acc...,It's been a rough month for customers who care...,rough month customer care privacy data breach ...
13464,14,news,newspolitics,Anxiety rises among Democrats worried about pa...,As they look at a weakened Republican presiden...,democratic presidential contest kick early yea...


# Content Analyzer

To use news representation that are not processed by the library, we can use the `FromNPY()` class. 

To do that, we need:
* A numpy file, corresponding to a matrix whone number of rows corresponds to the number of documents. The array contains as many arrays as documents. 
* A field in the source (items_json) where we have values corresponding to the indexes of each document on the numpy matrix. 


## Data preparation

In [13]:
#First, convert lists to np.array
news['Embedding'] = news['Embedding'].apply(np.array)

In [14]:
#From the items_info dataframe, we create a new column corresponding to the indexes of each news in the numpy matrix
items_infos = items_infos.reset_index(drop=True)
items_infos = items_infos.reset_index(names='lda_128')

In [15]:
#This column must be set as string to comply with the library's rules
items_infos['lda_128'] = items_infos['lda_128'].astype(str)

In [16]:
items_infos

Unnamed: 0,lda_128,itemID,Category,SubCategory,Title,Abstract,Text
0,0,2,news,newsopinion,Mormons to the Rescue?,The one religious faith that is the most heavi...,editor note opinion article author publish con...
1,1,9,news,newsopinion,Time for the White House and Congressional Rep...,The president and his allies must mount a defe...,editor note opinion article author publish con...
2,2,12,news,newspolitics,Jacksonville City Council rejects resolution t...,The Jacksonville City Council deadlocked on Tu...,jacksonville city council deadlocke tuesday re...
3,3,13,news,newsscienceandtechnology,T-Mobile confirms customers' personal data acc...,It's been a rough month for customers who care...,rough month customer care privacy data breach ...
4,4,14,news,newspolitics,Anxiety rises among Democrats worried about pa...,As they look at a weakened Republican presiden...,democratic presidential contest kick early yea...
...,...,...,...,...,...,...,...
22502,22502,130355,news,newspolitics,"In letter to Gov. Bill Lee, U.S. Rep. Steve Co...","""At best, this has resulted in Tennessee's gro...",u.s rep steve cohen give gov bill lee office w...
22503,22503,130359,news,newsus,"Strong overnight storm leaves 218,000 without ...","At around 10:30 a.m., about 218,000 were witho...",utility crew work restore electricity massachu...
22504,22504,130367,news,newsus,Microsoft says it will follow California's dig...,Microsoft says it will follow California's dig...,diane bartz nandita bose \n washington reuters...
22505,22505,130377,news,newsus,Strong winds could fuel more fires in Northern...,Hurricane-force winds created blowtorch-like c...,"img class=""image spinner_image alt= src=""https..."


In [17]:
#We save this news source 
items_infos.to_csv('../data_mind_large_news/items_infos_lda_128.csv')

Now we create the numpy matrix and save it

In [18]:
np_matrix = np.array(news['Embedding'])

In [19]:
#Save the matrix as npy file
np.save('../data_mind_large_news/lda_128_matrix', np_matrix, allow_pickle=True)

## Apply the analyzer to serialize the content based on provided news representations

In [20]:
news_ca_config_lda_128 = ca.ItemAnalyzerConfig(
    source=ca.CSVFile('../data_mind_large_news/items_infos_lda_128.csv'), #file we just created
    id='itemID', #column containing the ids of the news
    output_directory='news_codified_lda_128', #folder where the serialized content is saved
    export_json=True
)

In [21]:
news_ca_config_lda_128.add_single_config(
    'lda_128',
    ca.FieldConfig(ca.FromNPY('../data_mind_large_news/lda_128_matrix.npy'),
    id='text_lda'
   ),
)

In [22]:
ca.ContentAnalyzer(config=news_ca_config_lda_128).fit()

[39mINFO[0m - *********   Processing field: lda_128   **********
 100%|██████████| 22507/22507 [00:00<00:00]
Serializing contents:  100%|██████████| 22507/22507 [03:09<00:00]


# Recommender System

## Get train/test (temporal)

In [20]:
ratings_complet = pd.read_csv('../data_mind_large_news/ratings_large_filtered.csv')

In [21]:
users_10k = pd.read_csv('../10k_users.csv')['UserID'].tolist()

In [22]:
ratings_10k = ratings_complet[ratings_complet['UserID'].isin(users_10k)].reset_index(drop=True)

In [23]:
#ratings_10k.to_csv('../data_mind_large_news/ratings_10k.csv', index=False)

In [24]:
ratings_complet['Time'] = pd.to_datetime(ratings_complet['Time'])

In [25]:
ratings = ca.Ratings(ca.CSVFile('../data_mind_large_news/ratings_10k.csv'))

Importing ratings:  100%|██████████| 2052051/2052051 [00:08<00:00]


In [26]:
print(ratings)

        user_id item_id  score
0        504290  106909    0.0
1        504290  101469    0.0
2        504290   95605    0.0
3        504290   96061    0.0
4        504290  130031    0.0
...         ...     ...    ...
2052046  339186    1767    0.0
2052047  339186  118908    0.0
2052048  339186   14612    0.0
2052049  339186    9471    0.0
2052050  366874   65373    1.0

[2052051 rows x 3 columns]


## Launch experiment

In [27]:
catalog = set(ratings.item_id_column)

In [28]:
len(catalog)

18186

In [34]:
#Definition of the recommender algorithms
cos_algo = rs.CentroidVector(
    {'lda_128': 'text_lda'},  
    similarity=rs.CosineSimilarity()
)

# knn_algo = rs.ClassifierRecommender(
#     {'lda_128':'text_lda'},
#     classifier=rs.SkKNN()
# )

# gp_algo = rs.ClassifierRecommender(
#     {'lda_128':'text_lda'},
#     classifier=rs.SkGaussianProcess()
# )

# svc_algo = rs.ClassifierRecommender(
#     {'lda_128':'text_lda'},
#     classifier=rs.SkSVC()
# )

In [33]:
rs.ContentBasedExperiment(
    ratings,
    items_directory='news_codified_lda_128',
    partitioning_technique=rs.HoldOutPartitioning(train_set_size=0.75, shuffle=False),
    # algorithm_list=[knn_algo],
    algorithm_list=[cos_algo],
    metric_list=[
        eva.PrecisionAtK(k=10, sys_average='macro'),
        # eva.RecallAtK(k=10, sys_average='macro'),
        eva.FMeasureAtK(k=10, sys_average='macro'),
        eva.NDCGAtK(k=10),
        # eva.CatalogCoverage(catalog),
        # eva.GiniIndex()
    ],
    report=True,
    output_folder='report_baseline_10k_cv_complete_v2',
    overwrite_if_exists=True
).rank(n_recs=len(catalog), methodology=rs.TestRatingsMethodology(), num_cpus=1)

Performing HoldOutPartitioning:  100%|██████████| 10000/10000 [00:00<00:00]

[39mINFO[0m - ******* Processing alg CentroidVector *******
[39mINFO[0m - Don't worry if it looks stuck at first
[39mINFO[0m - First iterations will stabilize the estimated remaining time
Computing fit_rank for user 9999:  100%|██████████| 10000/10000 [02:48<00:00]
[39mINFO[0m - Performing evaluation on metrics chosen
  return actual / ideal
Performing NDCG@10:  100%|██████████| 3/3 [01:25<00:00]

[39mINFO[0m - Results saved in 'report_baseline_10k_cv_complete_v2/CentroidVector_1'
