# Recommender system implementation 

In [1]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json
from bertopic import BERTopic
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer, util
import numpy as np

# Path to the file
json_path = "/Users/dionnespaltman/Desktop/Luiss /Data Science in Action/Project/openalex_results_clean.json"

# Open and load the JSON data
with open(json_path, 'r') as f:
    data = json.load(f)

# Convert to DataFrame 
df = pd.DataFrame(data)
df_clean = df[df['abstract'].notna()].copy()
docs = df_clean['abstract'].tolist()

# Topic modeling
topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")
topics, probs = topic_model.transform(docs)

df_clean['topic_id'] = topics
df_clean['topic_label'] = df_clean['topic_id'].apply(
    lambda x: topic_model.topic_labels_[x] if x != -1 and x < len(topic_model.topic_labels_) else "Unknown"
)

# Embedding
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(docs, show_progress_bar=True)
df_clean['embedding'] = list(embeddings)

# Finalize main paper DataFrame
papers_df = df_clean.copy()
paper_embeddings = np.vstack(papers_df['embedding'].values)


2025-04-02 17:58:39.281798: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

2025-04-02 17:59:42,718 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

In [2]:
def recommend_similar_papers_from_query(query_text, top_k=5, boost_topic=True):
    query_embedding = embedding_model.encode(query_text, convert_to_tensor=True)
    cosine_scores = util.pytorch_cos_sim(query_embedding, paper_embeddings)[0].cpu().numpy()

    # Optional: get the predicted topic for the query
    if boost_topic:
        query_topic, _ = topic_model.transform([query_text])
        topic_boost_mask = (papers_df['topic_id'] == query_topic[0]).values.astype(float)
        cosine_scores += 0.05 * topic_boost_mask  # Boost same-topic papers slightly

    top_results = np.argsort(-cosine_scores)[:top_k]

    for idx in top_results:
        print(f"Title: {papers_df.iloc[idx]['title']}")
        print(f"Score: {cosine_scores[idx]:.4f}")
        print(f"Topic: {papers_df.iloc[idx]['topic_label']}")
        print(f"Abstract: {papers_df.iloc[idx]['abstract']}\n")


In [3]:
# Define your query
query = "Reinforcement learning for automated pricing in e-commerce"

# Get recommendations
recommend_similar_papers_from_query(query_text=query, top_k=5)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-04-02 18:03:16,633 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


Title: Dynamic Pricing Model of E-Commerce Platforms Based on Deep Reinforcement Learning
Score: 0.7632
Topic: 1821_commerce_retailers_shopping_retailing
Abstract: With the continuous development of artificial intelligence technology, its application field has gradually expanded. To further apply the deep reinforcement learning technology to the field of dynamic pricing, we build an intelligent dynamic pricing system, introduce the reinforcement learning technology related to dynamic pricing, and introduce existing research on the number of suppliers (single supplier and multiple suppliers), environmental models, and selection algorithms. A two-period dynamic pricing game model is designed to assess the optimal pricing strategy for e-commerce platforms under two market conditions and two consumer participation conditions. The first step is to analyze the pricing strategies of e-commerce platforms in mature markets, analyze the optimal pricing and profits of various enterprises under di

In [4]:
query = "Effectiveness of personalized promotion strategies using customer data"
recommend_similar_papers_from_query(query_text=query, top_k=5)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-04-02 18:03:19,812 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


Title: Hyper-Personalization
Score: 0.6511
Topic: 1250_marketing_advertising_market_consumers
Abstract: Personalization is widely used to attract and retain customers in online business addressing one size fits all issues, but little is addressed to contextualise users' real-time needs. E-commerce website owners use these strategies for customer-centric marketing through enhanced experience but fail in designing effective personalization due to the dynamic nature of users' needs and pace of information exposure. To address this, this chapter explores hyper-personalization strategies to overcome users' implicit need to be served better. The research presents a hyper-personalization process with learning (ML) and artificial intelligence (AI) techniques for marketing functions like segmentation, targeting, and positioning based on real-time analytics throughout the customer journey and key factors driving effective customer-centric marketing. This chapter facilitates marketers to use AI-e

In [5]:
query = "Natural language processing for sentiment-driven pricing strategies"
recommend_similar_papers_from_query(query_text=query, top_k=5)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-04-02 18:03:21,936 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


Title: Research on Chinese Consumers’ Attitudes Analysis of Big-Data Driven Price Discrimination Based on Machine Learning
Score: 0.6251
Topic: 1250_marketing_advertising_market_consumers
Abstract: From the end of 2018 in China, the Big-data Driven Price Discrimination (BDPD) of online consumption raised public debate on social media. To study the consumers' attitude about the BDPD, this study constructed a semantic recognition frame to deconstruct the Affection-Behavior-Cognition (ABC) consumer attitude theory using machine learning models inclusive of the Labeled Latent Dirichlet Allocation (LDA), Long Short-Term Memory (LSTM), and Snow Natural Language Processing (NLP), based on social media comments text dataset. Similar to the questionnaires published results, this article verified that 61% of consumers expressed negative sentiment toward BDPD in general. Differently, on a finer scale, this study further measured the negative sentiments that differ significantly among different to