## Demonstration of Content-Based Recommender System

This system enforces a content-based filtering approach by using article embeddings to capture features of the content. 

An online logistic regression model (SGDClassifier) is trained in mini-batches—using PCA for dimensionality reduction—to predict the likelihood of a user clicking on an article. 

Recommendations are generated by ranking articles based on their predicted click probabilities, and the model's performance is evaluated using MAP@K and NDCG@K metrics.


In [None]:
import sys
import os

parent_dir = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(parent_dir)
from parquet_data_reader import ParquetDataReader
from utils.process_data import  user_item_binary_interaction
from models.content_based.interaction_content_based import SGDContentBased
parquet_reader = ParquetDataReader()

### Data Extraction and Processing

In [4]:
train_behavior_df = parquet_reader.read_data("../../data/train/behaviors.parquet")
user_df = parquet_reader.read_data("../../data/train/history.parquet")
item_df = parquet_reader.read_data("../../data/articles.parquet")
embedding_df = parquet_reader.read_data("../../data/document_vector.parquet")
binary_interaction = user_item_binary_interaction(train_behavior_df, user_df, item_df)
binary_interaction

user_id,article_id,clicked
u32,i32,i32
13538,3001353,0
13538,3003065,0
13538,3012771,0
13538,3023463,0
13538,3032577,0
…,…,…
1710834,9803492,0
1710834,9803505,0
1710834,9803525,0
1710834,9803560,0


### Data Analysis

In [5]:
# Count all the rows in the binary_interaction where click is 1
clicks = binary_interaction.filter(binary_interaction["clicked"]== 1).count()
clicks["clicked"]

clicked
u32
70421


In [6]:
model = SGDContentBased(binary_interaction= binary_interaction, articles_embedding=embedding_df)

model.fit()


recomendations = model.recommend(user_id=13538, n_recommendations=5)
recomendations

KeyboardInterrupt: 

### Evaluation Metrics

In [None]:
evaluation = model.evaluate_recommender(user_sample=1000)
print("Accuracy metric: ", evaluation)

In [None]:
diversity = model.aggregate_diversity(item_df=item_df, user_sample=1000)
print("Diversity metric: ", diversity)


In [None]:
popularity_bias = model.gini_coefficient(item_df=item_df, user_sample=1000)
print("Popularity bias metric: ", popularity_bias)