---
title: "BERTopic on Philosophy of Biology (from Philpapers)"
date: 2025-04-28
author:
  - name: Jacob Hamel-Mottiez
    id: jc
    orcid: 0009-0007-3666-908X
    email: jacob.hamel-mottiez.1@ulaval.ca
    affiliation: 
      - name: Laval University

execute: 
  enabled: true # This is so that Plotly is rendered. 
format: 
  html:
    code-fold: false
keywords:
  - Philosophy of Biology
  - Biology
  - Bibliometrics
  - Topic modeling
  - BERTopic

theme:
  #light: flatly
  dark : darkly


license: "CC BY"
copyright: 
  holder: Jacob Hamel-Mottiez
  year: 2024
funding: "The author received funding from the Social Sciences and Humanities Canadian Reseach Council (SSHCRC) as well as from the Fonds de recherche du Québec - Société et culture."
---

In [7]:
#Directory
PATH_TO_DATA = r"C:\Users\jacob\OneDrive - Université Laval\DATA\\"
PATH_TO_VIZ = r"C:\Users\jacob\OneDrive - Université Laval\biophilo\Visualisation\\"
# Packages to import. 
import pandas as pd 
import numpy as np
import datamapplot


from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

## 📄 Data
This data is from *Biology and Philosophy* journal from 1986 to 2023. I removed all the entries with no abstract. 


In [85]:
#df = pd.read_csv("C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\\Data\\PHILOSOPHY_OF_BIOLOGY_ALL.csv")
#df = df.fillna("NULL")

#test
df_0 = pd.read_csv(r"C:\Users\jacob\OneDrive - Université Laval\biophilo\Data\2025-03-25_archive\PHILPAPERS_PHILO_DATA\philpapers_data_pyblio.csv")
df_0 = df_0.fillna("NULL")

df_01 = pd.read_csv(r"C:\Users\jacob\OneDrive - Université Laval\biophilo\Data\2025-03-25_archive\PHILPAPERS_PHILO_DATA\unclassified_philpapers_data_pyblio.csv")
df_01 = df_01.fillna("NULL")

df = pd.concat([df_0, df_01])




In [86]:
df_cleaned  = df[df['description'] != 'NULL']
df_cleaned

Unnamed: 0.1,Unnamed: 0,eid,doi,pii,pubmed_id,title,subtype,subtypeDescription,creator,afid,...,pageRange,description,authkeywords,citedby_count,openaccess,freetoread,freetoreadLabel,fund_acr,fund_no,fund_sponsor
0,0,2-s2.0-85209763600,10.1007/s11229-024-04815-5,,,Driftability and niche construction,ar,Article,Fábregas-Tejeda A.,60025063,...,,Niche construction is the process of organisms...,Drift | Driftability | Evolutionary causation ...,0,1,publisherhybridgold,Hybrid Gold,FWO,G070122N,Fonds Wetenschappelijk Onderzoek
1,1,2-s2.0-85202899987,10.1007/s10539-024-09957-x,,,Explanatory gaps in evolutionary theory,ar,Article,Aaby B.H.,60025063;60010348,...,,Proponents of the extended evolutionary synthe...,Evolutionary theory | Explanatory gap | Natura...,0,1,publisherhybridgold,Hybrid Gold,,3H210777,KU Leuven
2,2,2-s2.0-85175425123,10.1016/j.shpsa.2023.10.006,S0039368123001449,37907020.0,A Wolf in Sheep's Clothing: Idealisations and ...,ar,Article,Serpico D.,60021361;60015986,...,72-83,Research in pharmacogenomics and precision med...,Genetic prediction | GWAS | Idealisations | Po...,0,1,repositoryvor,Green,ERC,805498,European Research Council
3,3,2-s2.0-85152302150,10.1016/j.shpsa.2023.03.002,S0039368123000705,37068423.0,Joint representation: Modeling a phenomenon wi...,ar,Article,Yoshida Y.,60029445,...,67-76,Biologists often study particular biological s...,Collective cell migration | Generalization | M...,0,0,repositoryvor,Green,UMN,undefined,University of Minnesota
4,4,2-s2.0-85148606416,10.3390/philosophies8010008,,,"Turing’s Biological Philosophy: Morphogenesis,...",ar,Article,Greif H.,60003675,...,,Alan M. Turing’s last published work and some ...,Alan M. Turing | Darwinian evolution | D’Arcy ...,1,1,publisherfullgold,Gold,NCN,2020/37/B/HS1/01809,Narodowe Centrum Nauki
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3159,3159,2-s2.0-84898444008,10.1080/21550085.2014.885490,,,A Constructivist Approach Toward a General Def...,ar,Article,Meinard Y.,60068952;60012614;114099298,...,88-104,Biodiversity sciences witness a double dynamic...,economics | ecosystem management | philosophy ...,7,0,repositoryam,Green,FRJ,undefined,Fondation pour la recherche juridique
3160,3160,2-s2.0-27844440845,10.3197/096327105774462683,,,"The commons, game theory and aspects of human ...",ar,Article,Dodds W.K.,60000689,...,411-425,Fundamental aspects of human use of the enviro...,Commons | Game theory | Global environment | H...,15,0,,,,undefined,
3161,3161,2-s2.0-0036297383,10.3197/096327102129341073,,,The role of customary institutions in the cons...,ar,Article,Virtanen P.,60011170,...,227-241,Recently the role of customary local instituti...,Africa | Biodiversity | Conservation areas | L...,66,0,,,,undefined,
3162,3162,2-s2.0-0034962075,10.3197/096327101129340804,,,"Exotic species, naturalisation, and biological...",ar,Article,Hettinger N.,60013340,...,193-224,"Contrary to frequent characterisations, exotic...",Exotics | Native | Nativism | Naturalisation |...,55,0,,,,undefined,


In [88]:
from itables import show
unique_values = df_cleaned['subtypeDescription'].unique()

# Create a new DataFrame with unique values of 'citing_year'
unique_values_df = pd.DataFrame(unique_values, columns=['subtypeDescription'])

# If you want to count how many times each unique value appears in the original DataFrame
value_counts = df_cleaned['subtypeDescription'].value_counts()

# Optionally, you can merge the unique values with the count if needed
unique_values_df['Count'] = unique_values_df['subtypeDescription'].map(value_counts)

unique_values_df.rename(columns={'subtypeDescription': 'Document Type'}, inplace=True)
unique_values_df = unique_values_df.sort_values(by='Count', ascending=False)

# Display the new table (DataFrame)
show(unique_values_df)

Unnamed: 0,Document Type,Count
Loading ITables v2.2.4 from the internet... (need help?),,


In [None]:
df_cleaned  = df_cleaned[df_cleaned['subtypeDescription'] == 'Article']
df_cleaned

Unnamed: 0.1,Unnamed: 0,eid,doi,pii,pubmed_id,title,subtype,subtypeDescription,creator,afid,...,pageRange,description,authkeywords,citedby_count,openaccess,freetoread,freetoreadLabel,fund_acr,fund_no,fund_sponsor
0,0,2-s2.0-85209763600,10.1007/s11229-024-04815-5,,,Driftability and niche construction,ar,Article,Fábregas-Tejeda A.,60025063,...,,Niche construction is the process of organisms...,Drift | Driftability | Evolutionary causation ...,0,1,publisherhybridgold,Hybrid Gold,FWO,G070122N,Fonds Wetenschappelijk Onderzoek
1,1,2-s2.0-85202899987,10.1007/s10539-024-09957-x,,,Explanatory gaps in evolutionary theory,ar,Article,Aaby B.H.,60025063;60010348,...,,Proponents of the extended evolutionary synthe...,Evolutionary theory | Explanatory gap | Natura...,0,1,publisherhybridgold,Hybrid Gold,,3H210777,KU Leuven
2,2,2-s2.0-85175425123,10.1016/j.shpsa.2023.10.006,S0039368123001449,37907020.0,A Wolf in Sheep's Clothing: Idealisations and ...,ar,Article,Serpico D.,60021361;60015986,...,72-83,Research in pharmacogenomics and precision med...,Genetic prediction | GWAS | Idealisations | Po...,0,1,repositoryvor,Green,ERC,805498,European Research Council
3,3,2-s2.0-85152302150,10.1016/j.shpsa.2023.03.002,S0039368123000705,37068423.0,Joint representation: Modeling a phenomenon wi...,ar,Article,Yoshida Y.,60029445,...,67-76,Biologists often study particular biological s...,Collective cell migration | Generalization | M...,0,0,repositoryvor,Green,UMN,undefined,University of Minnesota
4,4,2-s2.0-85148606416,10.3390/philosophies8010008,,,"Turing’s Biological Philosophy: Morphogenesis,...",ar,Article,Greif H.,60003675,...,,Alan M. Turing’s last published work and some ...,Alan M. Turing | Darwinian evolution | D’Arcy ...,1,1,publisherfullgold,Gold,NCN,2020/37/B/HS1/01809,Narodowe Centrum Nauki
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3159,3159,2-s2.0-84898444008,10.1080/21550085.2014.885490,,,A Constructivist Approach Toward a General Def...,ar,Article,Meinard Y.,60068952;60012614;114099298,...,88-104,Biodiversity sciences witness a double dynamic...,economics | ecosystem management | philosophy ...,7,0,repositoryam,Green,FRJ,undefined,Fondation pour la recherche juridique
3160,3160,2-s2.0-27844440845,10.3197/096327105774462683,,,"The commons, game theory and aspects of human ...",ar,Article,Dodds W.K.,60000689,...,411-425,Fundamental aspects of human use of the enviro...,Commons | Game theory | Global environment | H...,15,0,,,,undefined,
3161,3161,2-s2.0-0036297383,10.3197/096327102129341073,,,The role of customary institutions in the cons...,ar,Article,Virtanen P.,60011170,...,227-241,Recently the role of customary local instituti...,Africa | Biodiversity | Conservation areas | L...,66,0,,,,undefined,
3162,3162,2-s2.0-0034962075,10.3197/096327101129340804,,,"Exotic species, naturalisation, and biological...",ar,Article,Hettinger N.,60013340,...,193-224,"Contrary to frequent characterisations, exotic...",Exotics | Native | Nativism | Naturalisation |...,55,0,,,,undefined,


In [115]:
df_cleaned.publicationName.unique()

array(['Not in top 10', 'Biology and Philosophy', 'Acta Biotheoretica',
       'Biological Theory', 'BioEssays',
       'Studies in History and Philosophy of Science Part C :Studies in History and Philosophy of Biological and Biomedical Sciences',
       'Journal of the History of Biology', 'Philosophy of Science',
       'Behavioral and Brain Sciences',
       'History and Philosophy of the Life Sciences', 'Biosemiotics'],
      dtype=object)

In [117]:
#df_cleaned[df_cleaned['publicationName'] == "Biology and Philosophy"]
#df_cleaned[df_cleaned['publicationName'] == "Biological Theory"]
df_cleaned[df_cleaned['publicationName'] =='History and Philosophy of the Life Sciences']


Unnamed: 0.1,Unnamed: 0,eid,doi,pii,pubmed_id,title,subtype,subtypeDescription,creator,afid,...,pageRange,description,authkeywords,citedby_count,openaccess,freetoread,freetoreadLabel,fund_acr,fund_no,fund_sponsor
263,263,2-s2.0-85013158199,10.1007/s40656-017-0129-2,,28205138.0,Kant’s epigenesis: specificity and development...,ar,Article,Demarest B.,60002483,...,,"In this paper, I argue that Kant adopted, thro...",Epigenesis | Generative force | Kant | Preform...,7,1,repositoryvor,Green,,undefined,
369,369,2-s2.0-85088262863,10.1007/s40656-020-00329-8,,32691291.0,What’s all the fuss about? The inheritance of ...,ar,Article,Camacho M.P.,60015457,...,,"The Central Dogma of molecular biology, which ...",Difference-making | Epigenetics | Gene-centris...,4,0,,,,undefined,
390,390,2-s2.0-84995794096,10.1007/s40656-016-0121-2,,27854053.0,Epigenetics: ambiguities and implications,ar,Article,Stotz K.,60019544;60010710,...,,"Everyone has heard of ‘epigenetics’, but the t...",Epigenesis | Epigenetic inheritance | Epigenet...,29,0,,,TWCF,TWCF0063/AB37,Templeton World Charity Foundation
622,622,2-s2.0-85180664555,10.1007/s40656-023-00600-8,,38153583.0,Resilience and the shift of paradigm in ecolog...,ar,Article,Barbara L.,60003668,...,,In the shift from the balance of nature to the...,Balance of nature | Ecological resilience | Fl...,1,1,publisherfree2read,Bronze,,undefined,
682,682,2-s2.0-85063972115,10.1007/s40656-019-0252-3,,30937631.0,Taxonomy and conservation science: interdepend...,ar,Article,Conix S.,60025063,...,,The relation between conservation science and ...,Conservation science | Species classification ...,14,0,,,,3H160214,KU Leuven
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2707,2707,2-s2.0-85063972115,10.1007/s40656-019-0252-3,,30937631.0,Taxonomy and conservation science: interdepend...,ar,Article,Conix S.,60025063,...,,The relation between conservation science and ...,Conservation science | Species classification ...,14,0,,,,3H160214,KU Leuven
2711,2711,2-s2.0-85058017987,10.1007/s40656-018-0236-8,,30523424.0,At the intersection of medical geography and d...,ar,Article,Arrizabalaga J.,60028181,...,,Environmental historians are not sufficiently ...,Disease ecology | Fernand Braudel | Historical...,10,1,publisherfree2read,Bronze,,undefined,
2716,2716,2-s2.0-85046898458,10.1007/s40656-018-0194-1,,29761370.0,Multispecies individuals,ar,Article,Bourrat P.,60025709;60019544;60010710,...,,We assess the arguments for recognising functi...,Evolutionary transitions in individuality | Ho...,30,0,,,CEED,DP0878650,Centre of Excellence for Environmental Decisio...
2733,2733,2-s2.0-84988353676,10.1007/s40656-016-0113-2,,27645228.0,From ecological records to big data: the inven...,ar,Article,Devictor V.,60108488;60021260,...,,This paper is a critical assessment of the epi...,Big data | Biodiversity | Ecology | Foucault |...,50,0,,,,undefined,


In [10]:
docs = df_cleaned.description.to_list()

df_cleaned['coverDate'] = pd.to_datetime(df_cleaned['coverDate'])
df_cleaned['year'] = df_cleaned['coverDate'].dt.year

columns_to_combine = ['creator', 'year', 'title']
df_cleaned['combined'] = df_cleaned[columns_to_combine].apply(lambda row: ', '.join(map(str, row)), axis=1)
node_text = df_cleaned.combined.to_list()

## 🌌 BERTopic model

In [12]:
# Pre-calculate embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

Batches:   0%|          | 0/196 [00:00<?, ?it/s]

In [13]:
# Define sub-models
from hdbscan import HDBSCAN
from umap import UMAP
umap_model = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=20, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

## Visualisation with datamapplot

In [16]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer



# Create a CountVectorizer with the custom stopwords
vectorizer_model = CountVectorizer(stop_words="english")



topic_model = BERTopic(
  # Sub-models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  #representation_model=representation_model,
  vectorizer_model=vectorizer_model,
  
  # Hyperparameters
  top_n_words=10,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(docs, embeddings)


2025-04-28 17:51:35,885 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-28 17:51:43,952 - BERTopic - Dimensionality - Completed ✓
2025-04-28 17:51:43,953 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-28 17:51:44,074 - BERTopic - Cluster - Completed ✓
2025-04-28 17:51:44,078 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-04-28 17:51:44,746 - BERTopic - Representation - Completed ✓


In [90]:
topic_info = topic_model.get_topic_info()
outlier_count = topic_info[topic_info['Topic'] == -1]['Count'].values[0]
print(f"Number of outliers: {outlier_count}")

Number of outliers: 2294


In [142]:
topic_model.visualize_heatmap()

In [143]:
x = topic_model.get_document_info(docs)
y = x.Name
labels =y.values
labels.shape
labels

excluded_topic = str(topic_info.Name[0]) 
clean_labels= [item.replace(excluded_topic, "Unlabelled") for item in labels] # To exclude -1 topic which is typically noise. 

In [144]:
excluded_topic = str(topic_info.Name[0]) 
clean_labels= [item.replace(excluded_topic, "Unlabelled") for item in labels] # To exclude -1 topic which is typically noise. 
clean_labels
clean_labels = pd.DataFrame(clean_labels, columns = ['Name'])


In [145]:
final_labels = pd.merge(clean_labels, claude_labels, on='Name', how='left')
final_labels['topics']
final_labels = final_labels.fillna("Unlabelled")
final_labels = pd.array(final_labels['topics'])

NameError: name 'claude_labels' is not defined

The code below is to add additional information like the citation count of each papers and also their keywords.

For more informations about how to integrate and style this information, consult https://datamapplot.readthedocs.io/en/latest/ and the according *Github*.

In [155]:
add_info = pd.DataFrame(
    {"citedby_count":df_cleaned.citedby_count, "keywords":df_cleaned.authkeywords.fillna('No_value')}
)

marker_size_array = np.log(1 + df_cleaned.citedby_count.values) # log for visibility
add_info

Unnamed: 0,citedby_count,keywords
0,0,Drift | Driftability | Evolutionary causation ...
1,0,Evolutionary theory | Explanatory gap | Natura...
2,0,Genetic prediction | GWAS | Idealisations | Po...
3,0,Collective cell migration | Generalization | M...
4,1,Alan M. Turing | Darwinian evolution | D’Arcy ...
...,...,...
3159,7,economics | ecosystem management | philosophy ...
3160,15,Commons | Game theory | Global environment | H...
3161,66,Africa | Biodiversity | Conservation areas | L...
3162,55,Exotics | Native | Nativism | Naturalisation |...


In [None]:
hover_text_template = """
<div>
    <p> <strong> Title </strong>: {hover_text}</p>
    <p> <strong> Citation Count </strong>: {citedby_count}</p>
    <p> <strong> Keywords </strong>: {keywords}</p>
</div>
</div>
"""
badge_css = """
    border-radius:6px; 
    width:fit-content; 
    max-width:75%; 
    margin:2px; 
    padding: 2px 10px 2px 10px; 
    font-size: 10pt;
"""
hover_text_template = f"""
<div>
    <div style="font-size:12pt;padding:2px;">{{hover_text}}</div>
    <div style="background-color:#525356;color:#fff;{badge_css}">{{keywords}}</div>
    <div style="background-color:#eeeeeeff;{badge_css}">Citation count: {{citedby_count}}</div>
</div>
""" 

In [170]:
import datamapplot.selection_handlers
#df_cleaned['year'] = pd.to_datetime(df_cleaned['citing_year'].astype(str), format='%Y')
import glasbey
palette = glasbey.create_palette(20, chroma_bounds=(20,75), lightness_bounds=(20,60))

plot = datamapplot.create_interactive_plot(
    reduced_embeddings,
    clean_labels,
    #hover_text= node_text,
    font_family="Cinzel",
    enable_search=True,
    #inline_data=False,
    initial_zoom_fraction=0.9,
    #offline_data_prefix="cord-large-cmaps-1",
    #extra_point_data= add_info,
    #hover_text_html_template = hover_text_template,
    #marker_size_array=marker_size_array,
    #selection_handler=datamapplot.selection_handlers.DisplaySample(n_samples=25),
    #colormaps={"Type": df_cleaned.subtypeDescription, "Journal": df_cleaned.publicationName},
    #colormap_rawdata=[df_cleaned.subtypeDescription, df_cleaned.publicationName], #final_labels_legend.Name_Claude],
    #colormap_metadata=[
        #{"field": "Type", "description": "Type", "cmap": "Accent", "kind": "datetime"},
        #{"field": "Journal", "description": "Journal", "cmap": "Dark2_r", "kind": "continuous"},
    #],
    offline_mode=True,
    )
plot

ValueError: 2

In [None]:
# Use the countries list from the input document
countries = np.array(df_cleaned.affiliation_country)

def get_continent(country):
    continents = {
        'North America': ['United States', 'Canada', 'Mexico'],
        'South America': ['Brazil', 'Argentina', 'Chile', 'Colombia', 'Uruguay', 'Venezuela'],
        'Europe': ['United Kingdom', 'Germany', 'France', 'Spain', 'Italy', 'Netherlands', 'Poland', 
                   'Croatia', 'Czech Republic', 'Denmark', 'Romania', 'Austria', 'Belgium', 'Greece', 
                   'Ireland', 'Norway', 'Portugal', 'Russian Federation', 'Slovakia', 'Sweden', 'Switzerland', 
                   'Hungary', 'Finland', 'Luxembourg', 'Estonia', 'Lithuania', 'Slovenia', 'Iceland', 'Serbia', 'Cyprus'],
        'Asia': ['China', 'Israel', 'United Arab Emirates', 'Japan', 'Turkey', 'Kazakhstan', 
                 'Bangladesh', 'Malaysia', 'India', 'Pakistan', 'Taiwan', 'Singapore', 'South Korea', 
                 'Hong Kong', 'Indonesia', 'Philippines'],
        'Africa': ['Sierra Leone', 'South Africa', 'Kenya', 'Egypt'],
        'Oceania': ['Australia', 'New Zealand']
    }
    
    # If multiple countries are listed, check for unique countries
    unique_countries = set(country.split(';'))
    
    # Handle NULL or empty cases
    if 'NULL' in unique_countries or len(unique_countries) == 0:
        return 'Unknown'
    
    # If multiple unique countries
    if len(unique_countries) > 1:
        return 'Collaboration'
    
    # If only one unique country
    country_name = list(unique_countries)[0]
    for continent, countries in continents.items():
        if country_name in countries:
            return continent
    return 'Unknown'

# Create original DataFrame
df_countries = pd.DataFrame({'Original Countries': countries})

# Add Continent column
df_countries['Continent'] = df_countries['Original Countries'].apply(get_continent)

print(df_countries)
print(f"\nTotal entries: {len(df_countries)}")
print(f"Breakdown by continent:\n{df_countries['Continent'].value_counts()}")

             Original Countries      Continent
0                       Belgium         Europe
1                Belgium;Norway  Collaboration
2                  Poland;Italy  Collaboration
3                 United States  North America
4                        Poland         Europe
...                         ...            ...
6267  France;Switzerland;France  Collaboration
6268              United States  North America
6269                    Finland         Europe
6270              United States  North America
6271                     Canada  North America

[6272 rows x 2 columns]

Total entries: 6272
Breakdown by continent:
Continent
North America    2352
Europe           2132
Collaboration     872
Asia              284
Unknown           265
Oceania           265
South America      81
Africa             21
Name: count, dtype: int64


In [None]:
df_cleaned["coverDate"] = pd.to_datetime(df_cleaned["coverDate"])
date = pd.array(df_cleaned.coverDate)


journal_top = df_cleaned
top_10_journals = journal_top["publicationName"].value_counts().nlargest(10).index

In [None]:
# Replace journal names not in the top 10 with "Not in top 10"
journal_top["publicationName"] = journal_top["publicationName"].where(journal_top["publicationName"].isin(top_10_journals), "Not in top 10")

In [162]:
plot = datamapplot.create_interactive_plot(
    reduced_embeddings,
    clean_labels,
    #hover_text= node_text,
    font_family="Cinzel",
    enable_search=True,
    #inline_data=False,
    initial_zoom_fraction=0.9,
    #offline_data_prefix="cord-large-cmaps-1",
    extra_point_data= add_info,
    hover_text_html_template = hover_text_template,
    marker_size_array=marker_size_array,
    selection_handler=datamapplot.selection_handlers.DisplaySample(n_samples=25),
    #colormaps={"Collaboration": df_countries.Continent, "Type": df_cleaned.subtypeDescription, "Journal": journal_top.publicationName},
    histogram_data = df_cleaned.coverDate,
    title = f"Philosophy of Biology",  
    sub_title= f"From <i>PhilPapers</i> (n = 6,272)",
    histogram_n_bins = 35,
    histogram_settings={
    "histogram_log_scale":False,
        "histogram_title":"Publication Year",
        "histogram_bin_fill_color":"#282a36",
        "histogram_bin_unselected_fill_color":"#b5b5b9",
        "histogram_bin_selected_fill_color":"#f68571",
        "histogram_width":400,
        "histogram_height":75,
    },
    
)
plot


ValueError: 2

In [None]:
timestamps = df_cleaned.coverDate.to_list()
topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins= 100)
topics_over_time


topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=5)
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=5)

73it [00:04, 14.78it/s]


## Save the map

In [None]:
plot.save(PATH_TO_VIZ + "PHILPAPERS_all_philo_of_biology_articles_BERTopic.html")


In [None]:
from llama_cpp import Llama

# Use llama.cpp to load in a Quantized LLM
llm = Llama(model_path="C:/Users/jacob/OneDrive/Bureau/openhermes-2.5-mistral-7b.Q4_K_M.gguf", n_gpu_layers=-1, n_ctx=4096, stop=["Q:", "\n"])

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:/Users/jacob/OneDrive/Bureau/openhermes-2.5-mistral-7b.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = teknium_openhermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_load

In [None]:
from bertopic.representation import KeyBERTInspired, LlamaCPP

prompt = """ Q:
I have a topic that contains the following keywords: '[KEYWORDS]'.

Based on the above information, can you give a short label of the topic of at most 5 words?
A:
"""

representation_model = {
    "KeyBERT": KeyBERTInspired(),
    "LLM": LlamaCPP(llm, prompt=prompt),
}

NameError: name 'llm' is not defined

In [None]:
# Pre-calculate embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('BAAI/bge-small-en')
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Pre-reduce embeddings for visualization purposes
reduced_embeddings = UMAP(n_neighbors=150, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

NameError: name 'docs' is not defined

In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Define a custom list of stopwords (or use an extended one)
custom_stopwords = ["the", "and", "or", "in", "on", "at", "of", "is", "to", "be", "with", "are", "that", "this", "by", "for"]

# Create a CountVectorizer with the custom stopwords
vectorizer = CountVectorizer(stop_words=custom_stopwords)



topic_model = BERTopic(
  # Sub-models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  representation_model=representation_model,
  vectorizer_model=vectorizer,
  
  # Hyperparameters
  top_n_words=10,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(docs, embeddings)

NameError: name 'embedding_model' is not defined

# Biology from BioArxiv

In [None]:
#Directory
PATH_TO_DATA = r"C:\Users\jacob\OneDrive - Université Laval\DATA\\"
PATH_TO_VIZ = r"C:\Users\jacob\OneDrive - Université Laval\biophilo\Visualisation\\"
# Packages to import. 
import pandas as pd 
import numpy as np
import datamapplot


from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

In [None]:
#df = pd.read_csv("C:\\Users\\jacob\\OneDrive - Université Laval\\biophilo\\Data\\PHILOSOPHY_OF_BIOLOGY_ALL.csv")


#test
df = pd.read_csv("C:/Users/jacob/OneDrive - Université Laval/biophilo/Data/BIO_ARXIV_DATA/all_bio_arxiv.csv")
df = df.fillna("NULL")

In [None]:
df_cleaned  = df[df['preprint_abstract'] != 'NULL']
#df_cleaned  = df_cleaned[df_cleaned['subtypeDescription'] == 'Article']
df_cleaned = df_cleaned.reset_index(drop=True)
df_cleaned

Unnamed: 0,preprint_doi,published_doi,published_journal,preprint_platform,preprint_title,preprint_authors,preprint_category,preprint_date,published_date,preprint_abstract,preprint_author_corresponding,preprint_author_corresponding_institution
0,10.1101/001081,10.1093/bioinformatics/btu121,Bioinformatics,bioRxiv,PyRAD: assembly of de novo RADseq loci for phy...,Deren A. R. Eaton;,Bioinformatics,2013-12-03,2014-03-05,Restriction-site associated genomic markers ar...,Deren A. R. Eaton,University of Chicago
1,10.1101/001297,10.1371/journal.pone.0085203,PLOS ONE,bioRxiv,Aerodynamic characteristics of a feathered din...,Dennis Evangelista;Griselda Cardona;Eric Guent...,Biophysics,2013-12-10,2014-01-15,We report the effects of posture and morpholog...,Dennis Evangelista,UC Berkeley
2,10.1101/000422,10.3389/fgene.2014.00013,Frontiers in Genetics,bioRxiv,On the optimal trimming of high-throughput mRN...,Matthew D MacManes;,Bioinformatics,2013-11-14,2014-01-31,The widespread and rapid adoption of high-thro...,Matthew D MacManes,University of New Hampshire
3,10.1101/001396,10.1162/NECO_a_00568,Neural Computation,bioRxiv,Parametric inference in the large data limit u...,Justin B. Kinney;Gurinder S. Atwal;,Biophysics,2013-12-13,2014-03-10,Motivated by data-rich experiments in transcri...,Justin B. Kinney,Cold Spring Harbor Laboratory
4,10.1101/002980,10.1016/j.bpj.2014.01.012,Biophysical Journal,bioRxiv,Genetic drift suppresses bacterial conjugation...,Peter D. Freese;Kirill S. Korolev;Jose I Jimen...,Biophysics,2014-02-24,2014-02-18,Conjugation is the primary mechanism of horizo...,Irene A. Chen,Univ. of California - Santa Barbara
...,...,...,...,...,...,...,...,...,...,...,...,...
256833,10.1101/2023.05.11.540369,10.1073/pnas.2309306120,Proceedings of the National Academy of Sciences,bioRxiv,RAD51-mediated R-loop formation acts to repair...,"Girasol, M. J.; Krasilnikova, M.; Marques, C. ...",microbiology,2023-05-11,2023-11-21,RNA-DNA hybrids are epigenetic features of all...,Richard McCulloch,University of Glasgow
256834,10.1101/2023.06.09.544423,10.1002/jez.b.23236,Journal of Experimental Zoology Part B: Molecu...,bioRxiv,3D spheroid culturing of Astyanax mexicanus li...,"Biswas, T.; Rajendran, N.; Hassan, H.; Zhao, C...",evolutionary biology,2023-06-10,2024-01-08,In vitro assays are crucial tools for gaining ...,Nicolas Rohner,"Stowers Institute for Medical Research, Univer..."
256835,10.1101/2024.01.23.576905,10.1016/j.vaccine.2024.04.073,Vaccine,bioRxiv,Evaluation of Precision of the Plasmodium know...,"Mertens, J. E.; Rigby, C. A.; Bardelli, M.; Qu...",immunology,2024-01-24,2024-05-03,Recent data indicate increasing disease burden...,Kazutoyo Miura,NIAID/NIH
256836,10.1101/2023.09.27.557722,10.1371/journal.pbio.3002711,PLOS Biology,bioRxiv,Working together to control mutation: how coll...,"Green, R.; Wang, H.; Botchey, C.; Zhang, N.; W...",evolutionary biology,2023-09-27,2024-07-15,Mutagenesis is responsive to many environmenta...,Christopher G. Knight,University of Manchester


In [None]:
docs = df_cleaned.preprint_abstract.to_list()

df_cleaned['preprint_date'] = pd.to_datetime(df_cleaned['preprint_date'])
df_cleaned['year'] = df_cleaned['preprint_date'].dt.year

columns_to_combine = ['preprint_authors', 'year', 'preprint_title']
df_cleaned['combined'] = df_cleaned[columns_to_combine].apply(lambda row: ', '.join(map(str, row)), axis=1)
node_text = df_cleaned.combined.to_list()

In [None]:
# Pre-calculate embeddings
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('BAAI/bge-small-en')
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Pre-reduce embeddings for visualization purposes
#reduced_embeddings = UMAP(n_neighbors=150, n_components=2, min_dist=0.0, metric='cosine', random_state=42).fit_transform(embeddings)

Batches:   0%|          | 0/8027 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# Define sub-models
from hdbscan import HDBSCAN
from umap import UMAP
umap_model = UMAP(n_neighbors=100, n_components=2, min_dist=0.0, metric='cosine', random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=100, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

In [None]:
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer

# Define a custom list of stopwords (or use an extended one)
custom_stopwords = ["the", "and", "or", "in", "on", "at", "of", "is", "to", "be", "with", "are", "that", "this", "by", "for"]

# Create a CountVectorizer with the custom stopwords
vectorizer = CountVectorizer(stop_words=custom_stopwords)



topic_model = BERTopic(
  # Sub-models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  #representation_model=representation_model,
  vectorizer_model=vectorizer,
  
  # Hyperparameters
  top_n_words=10,
  verbose=True
)

# Train model
topics, probs = topic_model.fit_transform(docs, embeddings)

NameError: name 'embeddings' is not defined