# Topic modelling 

If you have the abstract, suppose you have a 100 papers, you can run topic modelling on the abstract in order to extract the topics in each paper. Then you do this sort of matching. You find a measure of similarity between what a user wants, in order to retrieve the matched papers. (This is called topic modeling: given a bunch of texts, extract its topics.)

- Vintage approach: bag of words model. You can start with that. YOu embed the text thanks to the bag of words model. There are many tutorial that show you how to do that.  
- Run PCA on the bag of words model! This is called DLA, not PCA. The goal of LDA (latent … allocation). In python this can be done in 3 lines of code. Simplest approach. 
- More advanced techniques: Hugging face is a python repository that contains thousands pretrained models. At the heart of large language models (we use them on a daily basis), there is a very peculiar deep learning architecture  which is called transformers. In short, transformers are are the standard for natural language generative models. You can go on hugging face. Hugging face > models > natural language processing (thousands of tasks) > sentence siimilarity/text classification/

https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending 

Interesting article: 
- Leveraging BERTopic for the Analysis of Scientific Papers on Seaweed https://ieeexplore.ieee.org/document/10285737



In [None]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
import json

# Path to the file
json_path = "/Users/dionnespaltman/Desktop/Luiss /Data Science in Action/Project/openalex_results_clean.json"

# Open and load the JSON data
with open(json_path, 'r') as f:
    data = json.load(f)

# Convert to DataFrame (if it's a list of dicts)
df = pd.DataFrame(data)


In [None]:
display(df.head())

In [None]:
# Get the 'abstract' column as a Pandas Series
abstracts = df['abstract']
display(abstracts)


In [None]:
abstracts_list = df['abstract'].tolist()
# print(abstracts_list)

# BERTopic wikipedia
Wikipedia BERTopic:  https://huggingface.co/MaartenGr/BERTopic_Wikipedia

? Unclear what format the data should be in 
? Also a relatively small model, so perhaps it's better to use another one

In [None]:
# Make sure you've installed these in your terminal before running the code
# pip install -U bertopic
# pip install -U safetensors

In [None]:
!pip install bertopic


In [None]:
!pip install safetensors

In [None]:
conda update numba numpy


In [None]:
!pip uninstall -y bertopic
!pip install bertopic


In [None]:
pip install tf-keras

In [None]:
import pandas as pd
from bertopic import BERTopic

In [None]:
# Make a clean df 
df_clean = df[df['abstract'].notna()].copy()

# DataFrame is called df and it has a column 'abstract'
docs = df_clean['abstract'].tolist()

# Load the pre-trained BERTopic model from Hugging Face
topic_model = BERTopic.load("MaartenGr/BERTopic_Wikipedia")

# Apply the model to your documents
topics, probs = topic_model.transform(docs)

# Add results back to your dataframe
df_clean['topic_id'] = topics
df_clean['topic_label'] = df_clean['topic_id'].apply(
    lambda x: topic_model.topic_labels_[x] if x != -1 and x < len(topic_model.topic_labels_) else "Unknown"
)


In [None]:
# Add topic_id and topic_label to the original DataFrame, defaulting to NaN
df['topic_id'] = pd.NA
df['topic_label'] = pd.NA

# Update only the rows that had non-null abstracts
df.loc[df['abstract'].notna(), 'topic_id'] = df_clean['topic_id'].values
df.loc[df['abstract'].notna(), 'topic_label'] = df_clean['topic_label'].values


In [None]:
# preview the topics 
display(topic_model.get_topic_info().head())  # Summary of topics


In [None]:
# visualize topics 
topic_model.visualize_topics()

# Semantic similarity analysis

In [None]:
!pip install sentence-transformers


In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np

# Work with non-null abstracts
df_clean = df[df['abstract'].notna()].copy()

# Convert to list
docs = df_clean['abstract'].tolist()

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create sentence embeddings
embeddings = model.encode(docs, show_progress_bar=True)

# Add embeddings to the cleaned DataFrame
df_clean['embedding'] = list(embeddings)

# Add empty column to the original DataFrame
df['embedding'] = pd.NA

# Merge back into the original DataFrame
df.loc[df['abstract'].notna(), 'embedding'] = df_clean['embedding'].values


In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute cosine similarity matrix (this compares each doc with every other doc)
similarity_matrix = cosine_similarity(embeddings)


# Simple recommendation function 

In [None]:
def recommend_similar_papers(index, top_n=5):
    sim_scores = similarity_matrix[index]
    top_indices = np.argsort(sim_scores)[::-1][1:top_n+1]  # skip the paper itself
    return df.iloc[top_indices][['title', 'abstract', 'topic_label']]


In [None]:
recommend_similar_papers(10)  # Recommend similar to paper at index 10
