# Topic Modeling Summary

When you have abstracts from papers (e.g., 100 papers), you can use **topic modeling** to automatically extract the main themes (topics) from the text. These topics can then be used to **match user interests** with relevant papers.


#### 1. **Basic (Vintage) Approach**
- Use a **Bag of Words** model to represent text as vectors.
- Apply **LDA (Latent Dirichlet Allocation)** to identify topics. *(Note: This is not PCA!)*
- Simple and quick to implement — can be done in just a few lines of Python.

#### 2. **Advanced Approach**
- Use **pretrained transformer models** from [Hugging Face](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending).
- These models, like **BERT**, are based on the **transformer architecture** and can capture the semantic meaning of text.
- Ideal for:
  - **Sentence similarity**
  - **Text classification**
  - **Topic modeling** using embeddings

In [1]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import json

# Path to the file
json_path = "/Users/dionnespaltman/Desktop/Luiss /Data Science in Action/Project/openalex_results_clean.json"

# Open and load the JSON data
with open(json_path, 'r') as f:
    data = json.load(f)

# Convert to DataFrame (if it's a list of dicts)
df = pd.DataFrame(data)

# Display the first few rows of the DataFrame
display(df.head())

Unnamed: 0,id,doi,title,display_name,relevance_score,publication_year,publication_date,ids,language,primary_location,...,best_oa_location,sustainable_development_goals,referenced_works_count,referenced_works,related_works,cited_by_api_url,counts_by_year,updated_date,created_date,abstract
0,https://openalex.org/W3047327247,https://doi.org/10.1080/13675567.2020.1803246,Machine learning demand forecasting and supply...,Machine learning demand forecasting and supply...,851.8783,2020,2020-08-04,{'openalex': 'https://openalex.org/W3047327247...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",...,"{'is_oa': True, 'landing_page_url': 'https://d...",[],102,"[https://openalex.org/W122374401, https://open...","[https://openalex.org/W4313057686, https://ope...",https://api.openalex.org/works?filter=cites:W3...,"[{'year': 2025, 'cited_by_count': 13}, {'year'...",2025-03-19T09:40:40.806765,2020-08-10,"In many supply chains, firms staged in upstrea..."
1,https://openalex.org/W3024362711,https://doi.org/10.3390/su12104035,Influences of the Industry 4.0 Revolution on t...,Influences of the Industry 4.0 Revolution on t...,754.7955,2020,2020-05-14,{'openalex': 'https://openalex.org/W3024362711...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",...,"{'is_oa': True, 'landing_page_url': 'https://d...","[{'id': 'https://metadata.un.org/sdg/10', 'sco...",147,"[https://openalex.org/W1560042744, https://ope...","[https://openalex.org/W648638662, https://open...",https://api.openalex.org/works?filter=cites:W3...,"[{'year': 2025, 'cited_by_count': 13}, {'year'...",2025-03-27T13:03:33.223437,2020-05-21,"Automation and digitalization, as long-term ev..."
2,https://openalex.org/W2923129012,https://doi.org/10.1155/2019/9067367,An Improved Demand Forecasting Model Using Dee...,An Improved Demand Forecasting Model Using Dee...,726.105,2019,2019-01-01,{'openalex': 'https://openalex.org/W2923129012...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",...,"{'is_oa': True, 'landing_page_url': 'https://d...","[{'id': 'https://metadata.un.org/sdg/16', 'sco...",41,"[https://openalex.org/W124243540, https://open...","[https://openalex.org/W4375867731, https://ope...",https://api.openalex.org/works?filter=cites:W2...,"[{'year': 2025, 'cited_by_count': 4}, {'year':...",2025-03-23T10:02:26.057098,2019-04-01,Demand forecasting is one of the main issues o...
3,https://openalex.org/W2980994438,https://doi.org/10.1016/j.ijforecast.2019.07.001,DeepAR: Probabilistic forecasting with autoreg...,DeepAR: Probabilistic forecasting with autoreg...,693.075,2019,2019-10-19,{'openalex': 'https://openalex.org/W2980994438...,en,"{'is_oa': True, 'landing_page_url': 'https://d...",...,"{'is_oa': True, 'landing_page_url': 'https://d...","[{'id': 'https://metadata.un.org/sdg/9', 'scor...",47,"[https://openalex.org/W129305155, https://open...","[https://openalex.org/W3188413760, https://ope...",https://api.openalex.org/works?filter=cites:W2...,"[{'year': 2025, 'cited_by_count': 57}, {'year'...",2025-03-27T11:10:07.481385,2019-10-25,"Probabilistic forecasting, i.e., estimating a ..."
4,https://openalex.org/W4387379065,https://doi.org/10.1080/08874417.2023.2261010,The Potential of Generative Artificial Intelli...,The Potential of Generative Artificial Intelli...,688.9533,2023,2023-10-05,{'openalex': 'https://openalex.org/W4387379065...,en,"{'is_oa': False, 'landing_page_url': 'https://...",...,,"[{'id': 'https://metadata.un.org/sdg/9', 'scor...",91,"[https://openalex.org/W2772633599, https://ope...","[https://openalex.org/W4380551139, https://ope...",https://api.openalex.org/works?filter=cites:W4...,"[{'year': 2025, 'cited_by_count': 46}, {'year'...",2025-03-23T10:47:46.228229,2023-10-06,ABSTRACTIn a short span of time since its intr...


In [2]:
# Get the 'abstract' column as a Pandas Series
abstracts = df['abstract']
abstracts_list = df['abstract'].tolist()

# BERTopic wikipedia
Wikipedia BERTopic:  https://huggingface.co/MaartenGr/BERTopic_Wikipedia

How to use it: 

 `from bertopic import BERTopic `
 
 `topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia”) `

In [None]:
# Make sure you've installed these in your terminal before running the code
# It can be a trial and error to see what you need 

# pip install -U bertopic
# pip install -U safetensors
# !pip install bertopic
# !pip install safetensors
# conda update numba numpy
# !pip uninstall -y bertopic
# !pip install bertopic
# pip install tf-keras
# !pip install sentence-transformers


In [3]:
import pandas as pd
from bertopic import BERTopic

2025-04-09 10:40:25.061927: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
# Make a clean df 
df_clean = df[df['abstract'].notna()].copy()

# DataFrame is called df and it has a column 'abstract'
docs = df_clean['abstract'].tolist()

# Load the pre-trained BERTopic model from Hugging Face
topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")

# Apply the model to your documents
topics, probs = topic_model.transform(docs)

# Add results back to your dataframe
df_clean['topic_id'] = topics
df_clean['topic_label'] = df_clean['topic_id'].apply(
    lambda x: topic_model.topic_labels_[x] if x != -1 and x < len(topic_model.topic_labels_) else "Unknown"
)

# Add topic_id and topic_label to the original DataFrame, defaulting to NaN
df['topic_id'] = pd.NA
df['topic_label'] = pd.NA

# Update only the rows that had non-null abstracts
df.loc[df['abstract'].notna(), 'topic_id'] = df_clean['topic_id'].values
df.loc[df['abstract'].notna(), 'topic_label'] = df_clean['topic_label'].values

# preview the topics 
display(topic_model.get_topic_info().head())  # Summary of topics

Batches:   0%|          | 0/30 [00:00<?, ?it/s]

2025-04-09 10:42:08,627 - BERTopic - Predicting topic assignments through cosine similarity of topic and document embeddings.


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,633881,-1_cast_films_film_movie,"[cast, films, film, movie, 2020, comedy, relea...",
1,0,18441,0_goalscorer_scored_goals_goal,"[goalscorer, scored, goals, goal, goalkeeper, ...",
2,1,8518,1_khan_actor_raj_shah,"[khan, actor, raj, shah, crore, hai, actress, ...",
3,2,7521,2_married_divorced_couple_remarried,"[married, divorced, couple, remarried, engaged...",
4,3,6765,3_cast_actress_starred_actor,"[cast, actress, starred, actor, actors, starri...",


The topics seem to be unrelated to our topic. 

In [None]:
# Check if my input is correct
for i, text in enumerate(docs[:3]):
    print(f"\nAbstract {i+1}:\n{text}")



Abstract 1:
In many supply chains, firms staged in upstream of the chain suffer from variance amplification emanating from demand information distortion in a multi-stage supply chain and, consequently, their operation inefficiency. Prior research suggest that employing advanced demand forecasting, such as machine learning, could mitigate the effect and improve the performance; however, it is less known what is the extent and magnitude of savings as tangible supply chain performance outcomes. In this research, hybrid demand forecasting methods grounded on machine learning i.e. ARIMAX and Neural Network is developed. Both time series and explanatory factors are feed into the developed method. The method was applied and evaluated in the context of functional product and a steel manufacturer. The statistically significant supply chain performance improvement differences were found across traditional and ML-based demand forecasting methods. The implications for the theory and practice are 

In the following code we're doing topic modeling with BERTopic:
- Uses sentence embeddings to cluster similar abstracts.
- Assigns a topic number and optionally a topic label to each document.
- Helps you identify groups/themes in your datase

In [8]:
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# Load embedding model for better semantic understanding
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Train a new BERTopic model from scratch
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)
topics, _ = topic_model.fit_transform(docs)

# Add to your cleaned DataFrame (clone to avoid overwrite)
df_custom = df_clean.copy()
df_custom['topic_id_custom'] = topics
df_custom['topic_label_custom'] = df_custom['topic_id_custom'].apply(
    lambda x: topic_model.get_topic(x)[0] if x != -1 else "Unknown"
)

# Optional: Preview custom topic summary
display(topic_model.get_topic_info().head())


2025-04-09 11:00:38,207 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

2025-04-09 11:01:15,709 - BERTopic - Embedding - Completed ✓
2025-04-09 11:01:15,709 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-09 11:01:17,200 - BERTopic - Dimensionality - Completed ✓
2025-04-09 11:01:17,201 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-09 11:01:17,233 - BERTopic - Cluster - Completed ✓
2025-04-09 11:01:17,236 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-04-09 11:01:17,459 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,183,-1_the_and_of_in,"[the, and, of, in, to, data, for, is, on, this]",[Information technologies in general and artif...
1,0,207,0_the_of_demand_forecasting,"[the, of, demand, forecasting, to, and, in, sa...","[Compared to other industries, fashion apparel..."
2,1,194,1_the_and_ai_of,"[the, and, ai, of, to, in, for, this, on, cust...",[The thought-provoking paper by Cooper (2021) ...
3,2,64,2_recommendation_the_system_of,"[recommendation, the, system, of, commerce, pe...",[Under the background of leap-forward developm...
4,3,48,3_supply_chain_and_the,"[supply, chain, and, the, ai, of, in, to, mana...",[The integration of Artificial Intelligence (A...


# Semantic similarity analysis

Here we're manually generating embeddings using the same model. This is useful when you want to:
- Compare documents pairwise using cosine similarity.
- Build a recommendation system (e.g., “show me similar papers”).
- Visualize distances, do nearest neighbor search, etc.

In [9]:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Work with non-null abstracts
df_clean = df[df['abstract'].notna()].copy()

# Convert to list
docs = df_clean['abstract'].tolist()

# Create sentence embeddings
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Add embeddings to the cleaned DataFrame
df_clean['embedding'] = list(embeddings)

# Add empty column to the original DataFrame
df['embedding'] = pd.NA

# Merge back into the original DataFrame
df.loc[df['abstract'].notna(), 'embedding'] = df_clean['embedding'].values


Batches:   0%|          | 0/30 [00:00<?, ?it/s]

In [11]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity matrix (this compares each doc with every other doc)
similarity_matrix = cosine_similarity(embeddings)

print(similarity_matrix)


[[1.         0.16464816 0.74838334 ... 0.67583954 0.5825738  0.15442173]
 [0.16464816 1.         0.11452664 ... 0.15400772 0.14109099 0.19103307]
 [0.74838334 0.11452664 0.99999994 ... 0.80420303 0.5861476  0.24352759]
 ...
 [0.67583954 0.15400772 0.80420303 ... 0.99999976 0.5578631  0.16358986]
 [0.5825738  0.14109099 0.5861476  ... 0.5578631  0.9999999  0.15703568]
 [0.15442173 0.19103307 0.24352759 ... 0.16358986 0.15703568 0.99999994]]


# Simple recommendation function (preview of next task)

In [12]:
def recommend_similar_papers(index, top_n=5):
    sim_scores = similarity_matrix[index]
    top_indices = np.argsort(sim_scores)[::-1][1:top_n+1]  # skip the paper itself
    return df.iloc[top_indices][['title', 'abstract', 'topic_label']]


In [13]:
recommend_similar_papers(10)  # Recommend similar to paper at index 10


Unnamed: 0,title,abstract,topic_label
718,Analyzing the Role of Artificial Intelligence ...,Purpose: The aim of the study was to analyze t...,1818_logistics_freight_warehousing_procurement
761,Framework for collaborative intelligence in fo...,Electricity price forecasting in wholesale mar...,181_neural_neuron_neurons_convolutions
321,Evaluation of deep learning with long short-te...,Performance analysis and forecasting the evolu...,181_neural_neuron_neurons_convolutions
394,Event-driven forecasting of wholesale electric...,,
950,AI Meets the Shopper: Psychosocial Factors in ...,The evolution of e-retail and the contribution...,1821_commerce_retailers_shopping_retailing
