## Demonstration of Content-Based Recommender System

This system enforces a content-based filtering approach by using article embeddings to capture features of the content. 

An online logistic regression model (SGDClassifier) is trained in mini-batches—using PCA for dimensionality reduction—to predict the likelihood of a user clicking on an article. 

Recommendations are generated by ranking articles based on their predicted click probabilities, and the model's performance is evaluated using MAP@K and NDCG@K metrics.


In [1]:
import sys
import os

parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(parent_dir)
from parquet_data_reader import ParquetDataReader
from utils.process_data import  user_item_binary_interaction
from models.content_based.interaction_content_based import SGDContentBased
parquet_reader = ParquetDataReader()

### Data Extraction and Processing

In [2]:
train_behavior_df = parquet_reader.read_data("../../data/train/behaviors.parquet")
user_df = parquet_reader.read_data("../../data/train/history.parquet")
item_df = parquet_reader.read_data("../../data/articles.parquet")
test_behavior_df = parquet_reader.read_data("../../data/validation/behaviors.parquet")
embedding_df = parquet_reader.read_data("../../data/document_vector.parquet")
binary_interaction = user_item_binary_interaction(train_behavior_df, user_df, item_df)
test_binary_interaction = user_item_binary_interaction(test_behavior_df, user_df, item_df)
binary_interaction


user_id,article_id,clicked
u32,i32,i32
13538,3001353,0
13538,3003065,0
13538,3012771,0
13538,3023463,0
13538,3032577,0
…,…,…
1710834,9803492,0
1710834,9803505,0
1710834,9803525,0
1710834,9803560,0


### Data Analysis

In [3]:
# Count all the rows in the binary_interaction where click is 1
clicks = binary_interaction.filter(binary_interaction["clicked"]== 1).count()
clicks["clicked"]

clicked
u32
70421


In [4]:
model = SGDContentBased(binary_interaction= binary_interaction, articles_embedding=embedding_df, test_data=test_behavior_df)

model.fit()


recomendations = model.recommend(user_id=13538, n_recommendations=5)
recomendations

Training complete!


user_id,article_id,clicked,document_vector,prediction
u32,i32,i32,list[f32],f64
13538,9776087,0,"[0.013578, 0.017813, … 0.005463]",0.000226
13538,9688372,0,"[0.00813, 0.005233, … -0.032117]",0.000226
13538,9783858,0,"[0.036281, 0.059355, … 0.009815]",0.000225
13538,9189678,0,"[0.014872, 0.03328, … -0.007712]",0.000225
13538,9731676,0,"[0.038218, 0.039659, … 0.015651]",0.000225


### Evaluation Metrics

In [7]:
evaluation = model.evaluate_recommender(user_sample=100)
print("Accuracy metric: ", evaluation)



Accuracy metric:  {'MAP@K': np.float64(0.0), 'NDCG@K': np.float64(0.0)}


In [6]:
diversity = model.aggregate_diversity(item_df=item_df, user_sample=1000)
print("Diversity metric: ", diversity)


AttributeError: 'SGDContentBased' object has no attribute 'user_ids'

In [None]:
popularity_bias = model.gini_coefficient(item_df=item_df, user_sample=1000)
print("Popularity bias metric: ", popularity_bias)