# Title : One-step topic modeling

We perform a topic analysis on a dataset consisting of arxiv pre-prints based on their titles and abstracts.

## The data set
Our dataset contains the metadata from a uniform sample of 20,000 papers among those with subject tags in the following list:

- math.DS : Dynamical systems
- math.AP : PDEs
- math.MP : Mathematical Physics 
- math.DG : Differential Geometry
- math.PR : Probability

We use BERTopic model. BERTopic performs topic analysis in 3 steps
- step1) Use vector embedding model to convert title and abstracts of arxiv papers into vectors. In this notebook, we use the sentence transformer(sBERT) 'all-MiniLM-L6-v2'.
- step2) Reduce dimension of the vectors using UMAP. This step has stochastic nature. To reproduce our result, we set random_state = 623.
- step3) Use HDBScan to get topic clusters. HDBScan is hierarchical, density-based clustering algorithm.

## Strategy
To get the best BERTopic model, we should tune hyperparameters of UMAP and HDBScan. However, there are too many hyperparameters involved. Thus, we will first tune UMAP hyperparameters n_neighbors, n_components and then indirectly tune hyperparameters of HDBscan using built in reduce_topics, reduce_outliers methods. To assess how well our models perform, we will make our models to classify arxiv papers it has not seen during the training and
- compute the ratio of the papers that the model classify as outliers. We want this value to be low.
- check whether the classification is valid if the papers have not been classified as outliers.

## Table of contents

- Section 1. Tuning UMAP hypermparameters
- Section 2. Reduce the number of topic and outliers

First we install and load libraries.

In [None]:
!pip install bertopic

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install arxiv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import os
from google.colab import drive

drive.mount('/content/drive',force_remount = True)
os.chdir('/content/drive/MyDrive/Arxiv_Recommender')
!pwd

Mounted at /content/drive
/content/drive/MyDrive/Arxiv_Recommender


In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

import pandas as pd
import numpy as np
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
from hdbscan import HDBSCAN

In [None]:
# load dataframe containing metadata of 20k arxiv papers
df_lib = pd.read_parquet('data/raw_data/filter_20k.parquet')
# make a list where entries are a single string contcatenating the title(str) and abstract(str) of an arxiv paper
lib_abs = (df_lib.title + ' ' +  df_lib.abstract).to_list()
len(lib_abs)

20000

# Section 1. Tuning UMAP hypermparameters
----------------

In this section we tune the hyperparameters of UMAP inside our  training bertopic varying UMAP hyperparameters -

## - Train the models varying hyperparameters -

In [None]:
lib_vecs = pd.read_parquet('data/vector_embeddings/df_lib_vecs_20k_sbert.parquet').values

In [None]:
# these are the list of candidate values for UMAP hyperparameters
n_neighbors_candidates = [5,15,50]
n_components_candidates = [2,5,10]
# CountVectorizer will be used for finding keywords of the topics our BERTopic model found.
vectorizer_model = CountVectorizer(ngram_range=(2, 3), 
                                   stop_words="english")

In [None]:
# we vary the UMAP hyperparameters n_neighbors and n_components using for loop
for n_neighbors in n_neighbors_candidates :
  for n_components in n_components_candidates :
    # initializing UMAP model that will be used in BERTopic model
    umap_model = UMAP(n_neighbors=n_neighbors, 
                      n_components=n_components,
                      min_dist = 0, 
                      metric='cosine', 
                      random_state = 623,
                      low_memory=False)
    # initializing BERTopic model
    bertopic_model = BERTopic(embedding_model = 'all-MiniLM-L6-v2',
                              umap_model = umap_model,
                              vectorizer_model=vectorizer_model, 
                              calculate_probabilities=False,
                              verbose = True)
    # lib_topic is the list of topic numbers that the BERTopic model assigned to arxiv papers  
    lib_topics, _ = bertopic_model.fit_transform(lib_abs, lib_vecs)
    # save the trained BERTopic model
    bertopic_model.save(f'models/bertopic_20k_sbert_umap_hdbscan-n_neighbors{n_neighbors}-n_components{n_components}')	
    # save the lib_topics by converting it into dataframe
    # to save the file in parquet format, all column names should be string datatype 
    df_lib_topics = pd.DataFrame(lib_topics)
    df_lib_topics.columns = df_lib_topics.columns.astype(str)
    df_lib_topics.to_parquet(f'data/topics/df_lib_topics_20k_sbert_umap_hdbscan-n_neighbors{n_neighbors}-n_components{n_components}.parquet')
    # save subdataframe of df_lib, the dataframe containing 
    # arxiv paper meta data, that the BERTopic model classified as outliers
    mask = (df_lib_topics == -1).values.flatten()
    df_outliers = df_lib[mask]
    df_outliers.to_parquet(f'data/outliers/df_outliers_20k_all-MiniLM-L6-v2_umap_hdbscan-n_neighbors{n_neighbors}-n_components{n_components}.parquet')

2023-06-02 14:02:57,882 - BERTopic - Reduced dimensionality
2023-06-02 14:02:59,368 - BERTopic - Clustered reduced embeddings
2023-06-02 14:04:02,422 - BERTopic - Reduced dimensionality
2023-06-02 14:04:03,523 - BERTopic - Clustered reduced embeddings
2023-06-02 14:05:01,196 - BERTopic - Reduced dimensionality
2023-06-02 14:05:02,302 - BERTopic - Clustered reduced embeddings
2023-06-02 14:05:58,793 - BERTopic - Reduced dimensionality
2023-06-02 14:05:59,289 - BERTopic - Clustered reduced embeddings
2023-06-02 14:06:55,229 - BERTopic - Reduced dimensionality
2023-06-02 14:06:55,995 - BERTopic - Clustered reduced embeddings
2023-06-02 14:07:57,118 - BERTopic - Reduced dimensionality
2023-06-02 14:07:58,099 - BERTopic - Clustered reduced embeddings
2023-06-02 14:09:03,875 - BERTopic - Reduced dimensionality
2023-06-02 14:09:04,370 - BERTopic - Clustered reduced embeddings
2023-06-02 14:10:13,584 - BERTopic - Reduced dimensionality
2023-06-02 14:10:14,434 - BERTopic - Clustered reduced emb

## - Data visualization -

In [None]:
# make a dictionary whose keys are BERTopic model names and values are the corresponding BERTopic models
import glob
model_files = glob.glob(os.path.join('models',"*"))
bertopic_models = {f.split('/')[-1] : BERTopic.load(f) for f in model_files}

In [None]:
bertopic_models

{'bertopic_20k_sbert_umap_hdbscan-n_neighbors5-n_components2': <bertopic._bertopic.BERTopic at 0x7f3369a3acb0>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors5-n_components5': <bertopic._bertopic.BERTopic at 0x7f30e254c850>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors5-n_components10': <bertopic._bertopic.BERTopic at 0x7f303cb77fd0>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors15-n_components2': <bertopic._bertopic.BERTopic at 0x7f3317f3d9c0>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors15-n_components5': <bertopic._bertopic.BERTopic at 0x7f3369a3af50>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors15-n_components10': <bertopic._bertopic.BERTopic at 0x7f30e1161660>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors50-n_components2': <bertopic._bertopic.BERTopic at 0x7f2fdcbc56c0>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors50-n_components5': <bertopic._bertopic.BERTopic at 0x7f2fdcddcd90>,
 'bertopic_20k_sbert_umap_hdbscan-n_neighbors50-n_components10': <bertopic._bertopic.BERTopic at 

In [None]:
# for each model, check how many documents have been classified as 
# outliers and the number of total topic clusters
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    df_lib_topic_freqs = bertopic_model.get_topic_freq()
    num_lib_outliers = df_lib_topic_freqs['Count'][df_lib_topic_freqs['Topic']==-1].iloc[0]
    print(f'For n_neighbors={n} and n_components = {m},')
    print(f"{num_lib_outliers} documents have not been classified")
    print(f"The other {df_lib_topic_freqs['Count'].sum() - num_lib_outliers} documents are {df_lib_topic_freqs['Topic'].shape[0]-1} topics\n\n")

For n_neighbors=5 and n_components = 2,
6534 documents have not been classified
The other 13466 documents are 335 topics


For n_neighbors=5 and n_components = 5,
7456 documents have not been classified
The other 12544 documents are 347 topics


For n_neighbors=5 and n_components = 10,
7483 documents have not been classified
The other 12517 documents are 351 topics


For n_neighbors=15 and n_components = 2,
7915 documents have not been classified
The other 12085 documents are 222 topics


For n_neighbors=15 and n_components = 5,
9178 documents have not been classified
The other 10822 documents are 227 topics


For n_neighbors=15 and n_components = 10,
9491 documents have not been classified
The other 10509 documents are 219 topics


For n_neighbors=50 and n_components = 2,
8744 documents have not been classified
The other 11256 documents are 187 topics


For n_neighbors=50 and n_components = 5,
11104 documents have not been classified
The other 8896 documents are 167 topics


For n_nei

Here we notice that increasing n_neighbors and n_components amounts to increasing unclassified documents and we do not want this.

In [None]:
# for each model, we visualize topics
# the size of the circle tells you how many documents are in that cluster
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    print(f'For n_neighbors={n} and n_components = {m},')
    bertopic_model.visualize_topics().show()
    print('\n\n')

Output hidden; open in https://colab.research.google.com to view.

n_neighbors = 5, n_components = 2
<img src="images/visualize_topics-5-2.png" alt="n_neighbors = 5, n_components = 2" width="800"/>

n_neighbors = 5, n_components = 5
<img src="images/visualize_topics-5-5.png" alt="n_neighbors = 5, n_components = 5" width="800"/>

n_neighbors = 5, n_components = 10
<img src="images/visualize_topics-5-10.png" alt="n_neighbors = 5, n_components = 10" width="800"/>


n_neighbors = 15, n_components = 2
<img src="images/visualize_topics-15-2.png" alt="n_neighbors = 15, n_components = 2" width="800"/>


n_neighbors = 15, n_components = 5
<img src="images/visualize_topics-15-5.png" alt="n_neighbors = 15, n_components = 5" width="800"/>


n_neighbors = 50, n_components = 2
<img src="images/visualize_topics-50-2.png" alt="n_neighbors = 50, n_components = 2" width="800"/>


n_neighbors = 50, n_components = 5
<img src="images/visualize_topics-50-5.png" alt="n_neighbors = 50, n_components = 5" width="800"/>

n_neighbors = 50, n_components = 10
<img src="images/visualize_topics-50-10.png" alt="n_neighbors = 50, n_components = 10" width="800"/>

In [None]:
# visualizing documents
# each point is a document
# topic clusters are labeled by colors
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    print(f'For n_neighbors={n} and n_components = {m},')
    bertopic_model.visualize_documents(lib_abs, 
                                       hide_document_hover=True, 
                                       hide_annotations=True).show()
    print('\n\n')

Output hidden; open in https://colab.research.google.com to view.


n_neighbors = 5, n_components = 2
<img src="images/visualize_documents-5-2.png" alt="n_neighbors = 5, n_components = 2" width="800"/>


n_neighbors = 5, n_components = 5
<img src="images/visualize_documents-5-5.png" alt="n_neighbors = 5, n_components = 5" width="800"/>


n_neighbors = 5, n_components = 10
<img src="images/visualize_documents-5-10.png" alt="n_neighbors = 5, n_components = 10" width="800"/>


n_neighbors = 15, n_components = 2
<img src="images/visualize_documents-15-2.png" alt="n_neighbors = 15, n_components = 2" width="800"/>


n_neighbors = 15, n_components = 5
<img src="images/visualize_documents-15-5.png" alt="n_neighbors = 15, n_components = 5" width="800"/>


n_neighbors = 15, n_components = 10
<img src="images/visualize_documents-15-10.png" alt="n_neighbors = 15, n_components = 10" width="800"/>


n_neighbors = 50, n_components = 2
<img src="images/visualize_documents-50-2.png" alt="n_neighbors = 50, n_components = 2" width="800"/>


n_neighbors = 50, n_components = 5
<img src="images/visualize_documents-50-5.png" alt="n_neighbors = 50, n_components = 5" width="800"/>


n_neighbors = 50, n_components = 10
<img src="images/visualize_documents-50-10.png" alt="n_neighbors = 50, n_components = 10" width="800"/>

consistent with the above result, increasing n_neighbors increase the grey dots which are unclassifed by the models.

## - Evaluate the performances of the models -

In this section, we check how our models classify papers that they have not seen during the training. 

In [None]:
# add topic keyword column to the dataframe of topic numbers
def get_keyword_col(df, model) :
  topic_info = model.get_topic_info()
  df['keywords'] = df.iloc[:,0].apply(lambda x : topic_info.loc[topic_info.Topic == x].Name.iloc[0])
  return df
# add column of documents that are most representative of topics
def get_rep_doc_col(df, model) :
  topic_info = model.get_topic_info()
  df['representative_doc'] = df.iloc[:,0].apply(model.get_representative_docs)
  return df

In [None]:
import sys
sys.path.append('')
import arxiv
import base_model
from data_utils import clean_data

In [None]:
# this is the set where we will test our models against
df_dev = pd.read_csv('data/dev_sets/full_dev_set.csv')

In [None]:
df_dev.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,entry_id,updated,published,title,summary,comment,journal_ref,doi,primary_category,categories,pdf_url,authors,clean_title_abstract
0,0,0.0,http://arxiv.org/abs/2305.19190v1,2023-05-30 16:34:28+00:00,2023-05-30 16:34:28+00:00,Inverse Approximation Theory for Nonlinear Rec...,We prove an inverse approximation theorem for ...,,,,cs.LG,"['cs.LG', 'cs.AI', 'math.DS']",http://arxiv.org/pdf/2305.19190v1,"Shida Wang,Zhong Li,Qianxiao Li",inverse approximation theory for nonlinear rec...
1,1,1.0,http://arxiv.org/abs/2302.02004v3,2023-05-30 12:58:48+00:00,2023-02-03 21:19:56+00:00,Sharp Spectral Rates for Koopman Operator Lear...,Non-linear dynamical systems can be handily de...,"10 pages, 3 figures, 6 appendices",,,cs.LG,"['cs.LG', 'math.DS']",http://arxiv.org/pdf/2302.02004v3,"Vladimir Kostic,Karim Lounici,Pietro Novelli,M...",sharp spectral rates for koopman operator lear...
2,2,2.0,http://arxiv.org/abs/2305.18986v1,2023-05-30 12:39:58+00:00,2023-05-30 12:39:58+00:00,Clustering and Arnoux-Rauzy words,We characterize the clustering of a word under...,,,,math.DS,"['math.DS', '68R15']",http://arxiv.org/pdf/2305.18986v1,"Sébastien Ferenczi,Luca Q. Zamboni",clustering and arnoux rauzy wordswe characteri...
3,3,3.0,http://arxiv.org/abs/2305.18965v1,2023-05-30 11:53:40+00:00,2023-05-30 11:53:40+00:00,Node Embedding from Neural Hamiltonian Orbits ...,"In the graph node embedding problem, embedding...",,"International Conference on Machine Learning, ...",,cs.LG,"['cs.LG', 'math.DS', 'physics.class-ph']",http://arxiv.org/pdf/2305.18965v1,"Qiyu Kang,Kai Zhao,Yang Song,Sijie Wang,Wee Pe...",node embedding from neural hamiltonian orbits ...
4,4,4.0,http://arxiv.org/abs/2305.13959v2,2023-05-30 11:17:42+00:00,2023-05-23 11:37:19+00:00,Equidistribution of iterations of holomorphic ...,In this paper we analyze a certain family of h...,"Fixed typos, minor change in exposition of the...",,,math.DS,['math.DS'],http://arxiv.org/pdf/2305.13959v2,Nils Hemmingsson,equidistribution of iterations of holomorphic ...


In [None]:
# make a list of strings obtained by concatenating 
# titles and abstracts of the papers in df_dev
dev_abs = (df_dev.clean_title_abstract).to_list()
len(dev_abs)

50

In [None]:
# using the trained BERTopic models see how the they classify papers 
# from df_dev they have not seen during the training
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = BERTopic.load(f'models/bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}')
    df_dev_topics = pd.DataFrame({ 'topic_number': bertopic_model.transform(dev_abs)[0]})
    df_dev_topics = get_keyword_col(df_dev_topics,bertopic_model)
    df_dev_topics = get_rep_doc_col(df_dev_topics,bertopic_model)
    df_dev_topics = pd.concat([df_dev,df_dev_topics],axis =1)
    df_dev_topics.columns = df_dev_topics.columns.astype(str)
    df_dev_topics.to_parquet(f'data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}.parquet')

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:18:12,604 - BERTopic - Reduced dimensionality
2023-06-02 14:18:12,610 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:18:21,142 - BERTopic - Reduced dimensionality
2023-06-02 14:18:21,150 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:18:29,222 - BERTopic - Reduced dimensionality
2023-06-02 14:18:29,229 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:18:38,135 - BERTopic - Reduced dimensionality
2023-06-02 14:18:38,141 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:18:46,405 - BERTopic - Reduced dimensionality
2023-06-02 14:18:46,412 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:18:54,808 - BERTopic - Reduced dimensionality
2023-06-02 14:18:54,816 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:03,861 - BERTopic - Reduced dimensionality
2023-06-02 14:19:03,867 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:12,252 - BERTopic - Reduced dimensionality
2023-06-02 14:19:12,258 - BERTopic - Predicted clusters


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:20,948 - BERTopic - Reduced dimensionality
2023-06-02 14:19:20,956 - BERTopic - Predicted clusters


In [None]:
# check whether the topic_number, keywords and representative_doc columns have been added correctly
df = pd.read_parquet('data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors5-n_components5.parquet')

In [None]:
df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'entry_id', 'updated', 'published',
       'title', 'summary', 'comment', 'journal_ref', 'doi', 'primary_category',
       'categories', 'pdf_url', 'authors', 'clean_title_abstract',
       'topic_number', 'keywords', 'representative_doc'],
      dtype='object')

In [None]:
df = pd.read_parquet(f'data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors5-n_components5.parquet')
df.head()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,entry_id,updated,published,title,summary,comment,journal_ref,doi,primary_category,categories,pdf_url,authors,clean_title_abstract,topic_number,keywords,representative_doc
0,0,0.0,http://arxiv.org/abs/2305.19190v1,2023-05-30 16:34:28+00:00,2023-05-30 16:34:28+00:00,Inverse Approximation Theory for Nonlinear Rec...,We prove an inverse approximation theorem for ...,,,,cs.LG,"['cs.LG', 'cs.AI', 'math.DS']",http://arxiv.org/pdf/2305.19190v1,"Shida Wang,Zhong Li,Qianxiao Li",inverse approximation theory for nonlinear rec...,335,335_gradient descent_stochastic gradient_stoch...,[Weak error analysis for stochastic gradient d...
1,1,1.0,http://arxiv.org/abs/2302.02004v3,2023-05-30 12:58:48+00:00,2023-02-03 21:19:56+00:00,Sharp Spectral Rates for Koopman Operator Lear...,Non-linear dynamical systems can be handily de...,"10 pages, 3 figures, 6 appendices",,,cs.LG,"['cs.LG', 'math.DS']",http://arxiv.org/pdf/2302.02004v3,"Vladimir Kostic,Karim Lounici,Pietro Novelli,M...",sharp spectral rates for koopman operator lear...,108,108_koopman operator_model reduction_mode deco...,[Centering Data Improves the Dynamic Mode Deco...
2,2,2.0,http://arxiv.org/abs/2305.18986v1,2023-05-30 12:39:58+00:00,2023-05-30 12:39:58+00:00,Clustering and Arnoux-Rauzy words,We characterize the clustering of a word under...,,,,math.DS,"['math.DS', '68R15']",http://arxiv.org/pdf/2305.18986v1,"Sébastien Ferenczi,Luca Q. Zamboni",clustering and arnoux rauzy wordswe characteri...,24,24_random permutations_random variables_partit...,[New results for the Coupon Collector's proble...
3,3,3.0,http://arxiv.org/abs/2305.18965v1,2023-05-30 11:53:40+00:00,2023-05-30 11:53:40+00:00,Node Embedding from Neural Hamiltonian Orbits ...,"In the graph node embedding problem, embedding...",,"International Conference on Machine Learning, ...",,cs.LG,"['cs.LG', 'math.DS', 'physics.class-ph']",http://arxiv.org/pdf/2305.18965v1,"Qiyu Kang,Kai Zhao,Yang Song,Sijie Wang,Wee Pe...",node embedding from neural hamiltonian orbits ...,9,9_random graph_random graphs_preferential atta...,[Limit laws for selfloops and multiple edges i...
4,4,4.0,http://arxiv.org/abs/2305.13959v2,2023-05-30 11:17:42+00:00,2023-05-23 11:37:19+00:00,Equidistribution of iterations of holomorphic ...,In this paper we analyze a certain family of h...,"Fixed typos, minor change in exposition of the...",,,math.DS,['math.DS'],http://arxiv.org/pdf/2305.13959v2,Nils Hemmingsson,equidistribution of iterations of holomorphic ...,8,8_topological entropy_dynamical systems_interv...,[Statistical properties of interval maps with ...


In [None]:
import textwrap

In [None]:
# check how the models classified papers from df_dev 
# the ones they have not seen during the training
i = -10
print('For the paper with the following title + abstract\n\n ',
      textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
      '\n\n\n')
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    df = pd.read_parquet(f'data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}.parquet')
    df = df.loc[:,['clean_title_abstract','keywords', 'representative_doc']]
    print(f'BERTopic model with hyperparameters n_neighbors = {n}, n_components = {m}\n\n,')
    print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
    print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
          textwrap.fill(df.iloc[i,2][0], width=50))
    print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  the algebraic and analytic compactifications of
the hitchin moduli spacefollowing the work of
mazzeo swoboda weiss witt and mochizuki there is a
map between the algebraic compactification of the
dolbeault moduli space of higgs bundles on a
smooth projective curve coming from the action and
the analytic compactification of hitchins moduli
space of solutions to the self duality equations
on a riemann surface obtained by adding solutions
to the decoupled equations known as limiting
configurations this map extends the classical
kobayashi hitchin correspondence the main result
of this paper is that fails to be continuous at
the boundary over a certain subset of the
discriminant locus of the hitchin fibration this
suggests the possibility of a third refined
compactification which dominates both 



BERTopic model with hyperparameters n_neighbors = 5, n_components = 2

,
 - assigned the following topic keywords to the paper -

  -1_boundary

For the above cell, notice that when n_components = 2, the model classified it as an outlier. This is interesting because the overall number of outliers decreased when we set n_components = 2. 

In [None]:
# check how the models classified papers from df_dev 
# the ones they have not seen during the training
i = -9
print('For the paper with the following title + abstract\n\n ',
      textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
      '\n\n\n')
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    df = pd.read_parquet(f'data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}.parquet')
    df=df.loc[:,['clean_title_abstract','keywords','representative_doc']]
    print(f'BERTopic model with hyperparameters n_neighbors = {n}, n_components = {m}\n\n,')
    print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
    print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
          textwrap.fill(df.iloc[i,2][0], width=50))
    print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  strange random topology of the circlewe
characterise high dimensional topology that arises
from a random cech complex constructed on the
circle expected euler characteristic curve is
computed where we observe limiting spikes the
spikes correspond to expected betti numbers
growing arbitrarily large over shrinking intervals
of filtration radii using the fact that the
homotopy type of the random cech complex is either
an odd dimensional sphere or a bouquet of even
dimensional spheres we give probabilistic bounds
of the homotopy types by departing from the
conventional practice of scaling down filtration
radii as the sample size grows large our findings
indicate that the full breadth of filtration radii
leads to interesting systematic behaviour that
cannot be regarded as topological noise 



BERTopic model with hyperparameters n_neighbors = 5, n_components = 2

,
 - assigned the following topic keywords to the paper -

  13_random graph

In the above cell, the model with n_components=2 classified a paper concerning random graphs, which otherwise would have been assigned as an outlier. Despite the paper's primary category being probability, it extensively delves into homotopy calculations. Given a larger collection of algebraic topology papers in our corpus, this paper would likely cluster with them. However, since our corpus predominantly contains papers related to dynamics, PDEs, mathematical physics, differential geometry, and probability, it would have been more appropriate to classify it as an outlier.

In [None]:
# check how the models classified papers from df_dev 
# the ones they have not seen during the training
i = -8
print('For the paper with the following title + abstract\n\n ',
      textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
      '\n\n\n')
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    df = pd.read_parquet(f'data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}.parquet')
    df=df.loc[:,['clean_title_abstract','keywords','representative_doc']]
    print(f'BERTopic model with hyperparameters n_neighbors = {n}, n_components = {m}\n\n,')
    print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
    print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
          textwrap.fill(df.iloc[i,2][0], width=50))
    print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  systolic inequalities for k surfaces via stability
conditionswe introduce the notions of categorical
systoles and categorical volumes of bridgeland
stability conditions on triangulated categories we
prove that for any projective k surface there
exists a constant c depending only on the rank and
discriminant of its picard group such that holds
for any stability condition on the derived
category of coherent sheaves on the k surface this
is an algebro geometric generalization of a
classical systolic inequality on two tori we also
discuss applications of this inequality in
symplectic geometry 



BERTopic model with hyperparameters n_neighbors = 5, n_components = 2

,
 - assigned the following topic keywords to the paper -

  -1_boundary conditions_differential equations_quantum mechanics_vector fields 


 - The most representative abstract with the above topic keywords has the following title + abstract - 

 Convergence rate of solution

In [None]:
# check how the models classified papers from df_dev 
# the ones they have not seen during the training
i = -7
print('For the paper with the following title + abstract\n\n ',
      textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
      '\n\n\n')
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    df = pd.read_parquet(f'data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}.parquet')
    df=df.loc[:,['clean_title_abstract','keywords','representative_doc']]
    print(f'BERTopic model with hyperparameters n_neighbors = {n}, n_components = {m}\n\n,')
    print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
    print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
          textwrap.fill(df.iloc[i,2][0], width=50))
    print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  family floer syz conjecture for singularitywe
resolve a mathematically precise syz conjecture
for singularity by building a quantum corrected t
duality between two singular torus fibrations
related to the kahler geometry of the smoothing
and the berkovich geometry of the resolution
respectively our approach involves heavy
computations that embody a non archimedean version
of the partition of unity and it confirms the
strategy that patching verified local singularity
models brings global syz conjecture solutions like
k surfaces within reach there is also remarkably
explicit extra evidence concerning the collision
of singular fibers and braid group actions on one
hand we address the central challenge of matching
syz singular loci identified by joyce in reality
we construct not merely an isolated syz mirror
fibration partner but a parameter dependent one
that always keeps the matching singular loci plus
integral affine structure even wh

In [None]:
# check how the models classified papers from df_dev 
# the ones they have not seen during the training
i = -6
print('For the paper with the following title + abstract\n\n ',
      textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
      '\n\n\n')
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    df = pd.read_parquet(f'data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}.parquet')
    df=df.loc[:,['clean_title_abstract','keywords','representative_doc']]
    print(f'BERTopic model with hyperparameters n_neighbors = {n}, n_components = {m}\n\n,')
    print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
    print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
          textwrap.fill(df.iloc[i,2][0], width=50))
    print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  enumerative geometry via the moduli space of super
riemann surfacesin this paper we relate volumes of
moduli spaces of super riemann surfaces to
integrals over the moduli space of stable riemann
surfaces  this allows us to prove via algebraic
geometry a recursion between the volumes of moduli
spaces of super hyperbolic surfaces previously
proven via super geometry techniques by stanford
and witten the recursion between the volumes of
moduli spaces of super hyperbolic surfaces is
proven to be equivalent to the fact that a
generating function for the intersection numbers
of a natural collection of cohomology classes with
tautological classes on is a kdv tau function this
is analogous to mirzakhanis proof of the
kontsevich witten theorem regarding a generating
function for the intersection numbers of
tautological classes on using volumes of moduli
spaces of hyperbolic surfaces 



BERTopic model with hyperparameters n_neighbors = 5, n_c

here we caculate the ratio of outliers in the prediction

In [None]:
# caculate the ratio of the papers that the models classified as outliers
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    dev_topics,_ = bertopic_model.transform(dev_abs)
    num_dev_outliers = dev_topics.count(-1)
    print(f'\n\nBERTopic model with hyperparameters n_neighbors = {n}, n_components = {m} :,')
    print(f"In full_dev_set, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:22,489 - BERTopic - Reduced dimensionality
2023-06-02 14:19:22,496 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 2 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.46




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:23,344 - BERTopic - Reduced dimensionality
2023-06-02 14:19:23,350 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 5 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.38




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:24,982 - BERTopic - Reduced dimensionality
2023-06-02 14:19:24,990 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 10 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.42




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:25,889 - BERTopic - Reduced dimensionality
2023-06-02 14:19:25,895 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 2 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.54




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:26,821 - BERTopic - Reduced dimensionality
2023-06-02 14:19:26,828 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 5 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.6




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:27,840 - BERTopic - Reduced dimensionality
2023-06-02 14:19:27,849 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 10 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.66




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:28,862 - BERTopic - Reduced dimensionality
2023-06-02 14:19:28,868 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 2 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.7




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:30,051 - BERTopic - Reduced dimensionality
2023-06-02 14:19:30,058 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 5 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.72




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:31,481 - BERTopic - Reduced dimensionality
2023-06-02 14:19:31,488 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 10 :,
In full_dev_set, the ratio of the papers that have been classified as outliers is 0.7




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DS(dynamical systems)
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    dev_topics,_ = bertopic_model.transform(dev_abs)
    mask = ['math.DS' in categories for categories in df_dev.categories]
    dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
    num_dev_outliers = dev_topics.count(-1)
    print(f'\n\nBERTopic model with hyperparameters n_neighbors = {n}, n_components = {m} :,')
    print(f"Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:32,314 - BERTopic - Reduced dimensionality
2023-06-02 14:19:32,320 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 2 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.25




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:33,161 - BERTopic - Reduced dimensionality
2023-06-02 14:19:33,168 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 5 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.25




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:34,000 - BERTopic - Reduced dimensionality
2023-06-02 14:19:34,007 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 10 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.25




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:35,676 - BERTopic - Reduced dimensionality
2023-06-02 14:19:35,682 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 2 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.375




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:36,593 - BERTopic - Reduced dimensionality
2023-06-02 14:19:36,600 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 5 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.625




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:37,578 - BERTopic - Reduced dimensionality
2023-06-02 14:19:37,586 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 10 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.5




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:38,577 - BERTopic - Reduced dimensionality
2023-06-02 14:19:38,582 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 2 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.375




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:39,893 - BERTopic - Reduced dimensionality
2023-06-02 14:19:39,901 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 5 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.75




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:41,477 - BERTopic - Reduced dimensionality
2023-06-02 14:19:41,486 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 10 :,
Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.75




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.AP(PDEs)
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    dev_topics,_ = bertopic_model.transform(dev_abs)
    mask = ['math.AP' in categories for categories in df_dev.categories]
    dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
    num_dev_outliers = dev_topics.count(-1)
    print(f'\n\nBERTopic model with hyperparameters n_neighbors = {n}, n_components = {m} :,')
    print(f"Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:42,367 - BERTopic - Reduced dimensionality
2023-06-02 14:19:42,373 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 2 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.47058823529411764




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:43,219 - BERTopic - Reduced dimensionality
2023-06-02 14:19:43,226 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 5 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.35294117647058826




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:44,097 - BERTopic - Reduced dimensionality
2023-06-02 14:19:44,105 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 10 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.29411764705882354




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:45,796 - BERTopic - Reduced dimensionality
2023-06-02 14:19:45,802 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 2 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.47058823529411764




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:46,726 - BERTopic - Reduced dimensionality
2023-06-02 14:19:46,734 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 5 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.5882352941176471




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:47,717 - BERTopic - Reduced dimensionality
2023-06-02 14:19:47,724 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 10 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.6470588235294118




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:48,763 - BERTopic - Reduced dimensionality
2023-06-02 14:19:48,769 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 2 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.7647058823529411




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:50,001 - BERTopic - Reduced dimensionality
2023-06-02 14:19:50,009 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 5 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.7058823529411765




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:51,431 - BERTopic - Reduced dimensionality
2023-06-02 14:19:51,439 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 10 :,
Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.6470588235294118




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.MP(mathematical physics)
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    # Get the most frequent topics
    dev_topics,_ = bertopic_model.transform(dev_abs)
    mask = ['math.MP' in categories for categories in df_dev.categories]
    dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
    num_dev_outliers = dev_topics.count(-1)
    print(f'\n\nBERTopic model with hyperparameters n_neighbors = {n}, n_components = {m} :,')
    print(f"Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:52,290 - BERTopic - Reduced dimensionality
2023-06-02 14:19:52,296 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 2 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.4




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:53,169 - BERTopic - Reduced dimensionality
2023-06-02 14:19:53,176 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 5 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.4666666666666667




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:54,071 - BERTopic - Reduced dimensionality
2023-06-02 14:19:54,078 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 10 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.6666666666666666




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:54,957 - BERTopic - Reduced dimensionality
2023-06-02 14:19:54,964 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 2 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.6




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:56,741 - BERTopic - Reduced dimensionality
2023-06-02 14:19:56,749 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 5 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.6




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:57,778 - BERTopic - Reduced dimensionality
2023-06-02 14:19:57,787 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 10 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.7333333333333333




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:58,809 - BERTopic - Reduced dimensionality
2023-06-02 14:19:58,815 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 2 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.7333333333333333




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:19:59,999 - BERTopic - Reduced dimensionality
2023-06-02 14:20:00,006 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 5 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.8




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:01,481 - BERTopic - Reduced dimensionality
2023-06-02 14:20:01,490 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 10 :,
Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.8




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DG(differential geometry)
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    bertopic_model = bertopic_models[f'bertopic_20k_sbert_umap_hdbscan-n_neighbors{n}-n_components{m}']
    dev_topics,_ = bertopic_model.transform(dev_abs)
    mask = ['math.DG' in categories for categories in df_dev.categories]
    dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
    num_dev_outliers = dev_topics.count(-1)
    print(f'\n\nBERTopic model with hyperparameters n_neighbors = {n}, n_components = {m} :,')
    print(f"Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:02,315 - BERTopic - Reduced dimensionality
2023-06-02 14:20:02,320 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 2 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.5714285714285714




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:03,143 - BERTopic - Reduced dimensionality
2023-06-02 14:20:03,150 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 5 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.5




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:03,988 - BERTopic - Reduced dimensionality
2023-06-02 14:20:03,995 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 10 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.42857142857142855




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:04,841 - BERTopic - Reduced dimensionality
2023-06-02 14:20:04,847 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 2 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.5




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:06,577 - BERTopic - Reduced dimensionality
2023-06-02 14:20:06,584 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 5 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.5714285714285714




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:07,576 - BERTopic - Reduced dimensionality
2023-06-02 14:20:07,583 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 10 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.6428571428571429




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:08,620 - BERTopic - Reduced dimensionality
2023-06-02 14:20:08,625 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 2 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.6428571428571429




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:09,817 - BERTopic - Reduced dimensionality
2023-06-02 14:20:09,825 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 5 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.6428571428571429




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:11,331 - BERTopic - Reduced dimensionality
2023-06-02 14:20:11,339 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 10 :,
Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.6428571428571429




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.PR(probability)
for n in n_neighbors_candidates :
  for m in n_components_candidates :
    dev_topics,_ = bertopic_model.transform(dev_abs)
    mask = ['math.PR' in categories for categories in df_dev.categories]
    dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
    num_dev_outliers = dev_topics.count(-1)
    print(f'\n\nBERTopic model with hyperparameters n_neighbors = {n}, n_components = {m} :,')
    print(f"Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:12,158 - BERTopic - Reduced dimensionality
2023-06-02 14:20:12,164 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 2 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.3076923076923077




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:12,983 - BERTopic - Reduced dimensionality
2023-06-02 14:20:12,990 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 5 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.23076923076923078




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:13,837 - BERTopic - Reduced dimensionality
2023-06-02 14:20:13,844 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 5, n_components = 10 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.46153846153846156




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:14,701 - BERTopic - Reduced dimensionality
2023-06-02 14:20:14,707 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 2 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.5384615384615384




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:15,584 - BERTopic - Reduced dimensionality
2023-06-02 14:20:15,591 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 5 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.6153846153846154




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:17,406 - BERTopic - Reduced dimensionality
2023-06-02 14:20:17,414 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 15, n_components = 10 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.6923076923076923




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:18,423 - BERTopic - Reduced dimensionality
2023-06-02 14:20:18,429 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 2 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.9230769230769231




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:19,553 - BERTopic - Reduced dimensionality
2023-06-02 14:20:19,561 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 5 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.8461538461538461




Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:20:21,009 - BERTopic - Reduced dimensionality
2023-06-02 14:20:21,017 - BERTopic - Predicted clusters




BERTopic model with hyperparameters n_neighbors = 50, n_components = 10 :,
Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.7692307692307693




For the unseen data, the model with hyperparameters set at n_neighbors=5 and n_components=5 exhibited the lowest ratio of outliers.

To summarize, models with lower values for n_neighbors and n_components exhibit a lower outlier ratio on the training data. However, for unseen data, overly low hyperparameter values can sometimes lead to misclassifications. Consequently, after considering both the training and unseen data, we have identified n_neighbors=5 and n_components=5 as the optimal hyperparameters.

# Section 2. Reducing the number of topics and outliers
-------------------------------

### - reduce the number of topics -

In [None]:
# recall that we have identified n_neighbors=5 and n_components=5 as the optimal hyperparameters
bertopic_model = bertopic_models['bertopic_20k_sbert_umap_hdbscan-n_neighbors5-n_components5']

In [None]:
# merge topics using reduce_topics method
bertopic_model.reduce_topics(lib_abs, nr_topics='auto')

2023-06-02 14:20:37,509 - BERTopic - Reduced number of topics from 348 to 216


<bertopic._bertopic.BERTopic at 0x7f30e254c850>

In [None]:
# count the number of outliers and topics
df_lib_topic_freqs = bertopic_model.get_topic_freq()
num_lib_outliers = df_lib_topic_freqs['Count'][df_lib_topic_freqs['Topic']==-1].iloc[0]
print(f"{num_lib_outliers} documents have not been classified")
print(f"The other {df_lib_topic_freqs['Count'].sum() - num_lib_outliers} documents are {df_lib_topic_freqs['Topic'].shape[0]-1} topics")

7456 documents have not been classified
The other 12544 documents are 215 topics


In [None]:
# visualize topics
# each circle corresponds to a topic cluster
# the size of a circle represents the size of the correspondingtopic cluster
bertopic_model.visualize_topics().show()

In [None]:
# visualize documents
# each point is a document
# topic clusters are labeled by colors
bertopic_model.visualize_documents(lib_abs, 
                                   hide_document_hover=True, 
                                   hide_annotations=True).show()

<img src="images/visualize_topics-5-5-reduce_topics.png" alt="n_neighbors = 5, n_components = 5,reduce_topics" width="800"/>

<img src="images/visualize_documents-5-5-reduce_topics.png" alt="n_neighbors = 5, n_components = 5,reduce_documents" width="800"/>


Observe that a sizable orange cluster has formed in the bottom left corner.  This cluster results from merging several loosely correlated topics, which is not the optimal outcome for our analysis.

In [None]:
# check how the models classify papers from df_dev 
# the ones they have not seen during the training
for i in range(-6,-11,-1) :
  print('For the paper with the following title + abstract\n\n ',
        textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
        '\n\n\n')
  df = pd.read_parquet('data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors5-n_components5.parquet')
  df=df.loc[:,['clean_title_abstract','keywords','representative_doc']]
  print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
  print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
        textwrap.fill(df.iloc[i,2][0], width=50))
  print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  enumerative geometry via the moduli space of super
riemann surfacesin this paper we relate volumes of
moduli spaces of super riemann surfaces to
integrals over the moduli space of stable riemann
surfaces  this allows us to prove via algebraic
geometry a recursion between the volumes of moduli
spaces of super hyperbolic surfaces previously
proven via super geometry techniques by stanford
and witten the recursion between the volumes of
moduli spaces of super hyperbolic surfaces is
proven to be equivalent to the fact that a
generating function for the intersection numbers
of a natural collection of cohomology classes with
tautological classes on is a kdv tau function this
is analogous to mirzakhanis proof of the
kontsevich witten theorem regarding a generating
function for the intersection numbers of
tautological classes on using volumes of moduli
spaces of hyperbolic surfaces 



 - assigned the following topic keywords to the paper -


In [None]:
# caculate the ratio of the papers that the models classified as outliers
dev_topics,_ = bertopic_model.transform(dev_abs)
num_dev_outliers = dev_topics.count(-1)
print(f"In full_dev_set, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:21:06,483 - BERTopic - Reduced dimensionality
2023-06-02 14:21:06,490 - BERTopic - Predicted clusters


In full_dev_set, the ratio of the papers that have been classified as outliers is 0.38




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DS(dynamical systems)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.DS' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:21:08,262 - BERTopic - Reduced dimensionality
2023-06-02 14:21:08,268 - BERTopic - Predicted clusters


Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.25




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.AP(PDEs)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.AP' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:21:09,104 - BERTopic - Reduced dimensionality
2023-06-02 14:21:09,110 - BERTopic - Predicted clusters


Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.35294117647058826




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.MP(mathematical physics)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.MP' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.MP, the ratio of papers that has been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:21:09,944 - BERTopic - Reduced dimensionality
2023-06-02 14:21:09,950 - BERTopic - Predicted clusters


Among papers with subject tag math.MP, the ratio of papers that has been classified as outliers is 0.4666666666666667




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DG(differential geometry)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.DG' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:21:10,794 - BERTopic - Reduced dimensionality
2023-06-02 14:21:10,801 - BERTopic - Predicted clusters


Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.5




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.PR(probability)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.PR' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:21:11,663 - BERTopic - Reduced dimensionality
2023-06-02 14:21:11,670 - BERTopic - Predicted clusters


Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.23076923076923078




### - reduce the number of outliers -

In [None]:
lib_topics = pd.read_parquet('data/topics/df_lib_topics_20k_sbert_umap_hdbscan-n_neighbors5-n_components5.parquet').values

In [None]:
# to apply reduce_outliers, we require data indicating the probabilities 
# of a document being classified under specific topics.
# recall that previously we set caculate_probabilities = False
# we retrain the model with caculate_probabilities = True
umap_model = UMAP(n_neighbors=5, 
                  n_components=5,
                  min_dist = 0, 
                  metric='cosine', 
                  random_state = 623,
                  low_memory=False)
bertopic_model = BERTopic(embedding_model = 'all-MiniLM-L6-v2',
                          umap_model = umap_model,
                          vectorizer_model=vectorizer_model, 
                          calculate_probabilities=True,
                          verbose = True) 
lib_topics, lib_probs = bertopic_model.fit_transform(lib_abs, lib_vecs)

2023-06-02 14:21:29,750 - BERTopic - Reduced dimensionality
2023-06-02 14:23:55,888 - BERTopic - Clustered reduced embeddings


In [None]:
# reduce outliers
new_topics = bertopic_model.reduce_outliers(lib_abs, lib_topics, probabilities=lib_probs, 
                             threshold=0.05, strategy="probabilities")
bertopic_model.update_topics(lib_abs, topics=new_topics)

In [None]:
# count the number of topics and outliers
df_lib_topic_freqs = bertopic_model.get_topic_freq()
num_lib_outliers = df_lib_topic_freqs['Count'][df_lib_topic_freqs['Topic']==-1].iloc[0]
print(f"{num_lib_outliers} documents have not been classified")
print(f"The other {df_lib_topic_freqs['Count'].sum() - num_lib_outliers} documents are {df_lib_topic_freqs['Topic'].shape[0]-1} topics")

6786 documents have not been classified
The other 13214 documents are 347 topics


In [None]:
# visualize topics
# each circle corresponds to a topic cluster
# the size of a circle represents the size of the correspondingtopic cluster
bertopic_model.visualize_topics().show()

In [None]:
# visualize documents
# each point is a document
# topic clusters are labeled by colors
bertopic_model.visualize_documents(lib_abs, 
                                   hide_document_hover=True, 
                                   hide_annotations=True).show()

<img src="images/visualize_topics-5-5-reduce_outliers.png" alt="n_neighbors = 5, n_components = 5,reduce_outliers" width="800"/>

<img src="images/visualize_documents-5-5-reduce_outliers.png" alt="n_neighbors = 5, n_components = 5,reduce_outliers" width="800"/>

In [None]:
# check how the models classify papers from df_dev 
# the ones they have not seen during the training
for i in range(-6,-11,-1) :
  print('For the paper with the following title + abstract\n\n ',
        textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
        '\n\n\n')
  df = pd.read_parquet('data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors5-n_components5.parquet')
  df=df.loc[:,['clean_title_abstract','keywords','representative_doc']]
  print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
  print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
          textwrap.fill(df.iloc[i,2][0], width=50))
  print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  enumerative geometry via the moduli space of super
riemann surfacesin this paper we relate volumes of
moduli spaces of super riemann surfaces to
integrals over the moduli space of stable riemann
surfaces  this allows us to prove via algebraic
geometry a recursion between the volumes of moduli
spaces of super hyperbolic surfaces previously
proven via super geometry techniques by stanford
and witten the recursion between the volumes of
moduli spaces of super hyperbolic surfaces is
proven to be equivalent to the fact that a
generating function for the intersection numbers
of a natural collection of cohomology classes with
tautological classes on is a kdv tau function this
is analogous to mirzakhanis proof of the
kontsevich witten theorem regarding a generating
function for the intersection numbers of
tautological classes on using volumes of moduli
spaces of hyperbolic surfaces 



 - assigned the following topic keywords to the paper -


In [None]:
# caculate the ratio of the papers that the models classified as outliers
dev_topics,_ = bertopic_model.transform(dev_abs)
num_dev_outliers = dev_topics.count(-1)
print(f"In full_dev_set, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:24:52,997 - BERTopic - Reduced dimensionality
2023-06-02 14:24:53,419 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:24:53,420 - BERTopic - Predicted clusters


In full_dev_set, the ratio of the papers that have been classified as outliers is 0.38




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DS(dynamical systems)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.DS' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:24:54,257 - BERTopic - Reduced dimensionality
2023-06-02 14:24:54,678 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:24:54,680 - BERTopic - Predicted clusters


Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.25




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.AP(PDEs)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.AP' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:24:55,516 - BERTopic - Reduced dimensionality
2023-06-02 14:24:55,946 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:24:55,948 - BERTopic - Predicted clusters


Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.35294117647058826




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.MP(mathematical physics)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.MP' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:24:56,806 - BERTopic - Reduced dimensionality
2023-06-02 14:24:57,228 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:24:57,230 - BERTopic - Predicted clusters


Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.4666666666666667




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DG(differential geometry)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.DG' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:24:58,097 - BERTopic - Reduced dimensionality
2023-06-02 14:24:58,521 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:24:58,522 - BERTopic - Predicted clusters


Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.5




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.PR(probability)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.PR' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:25:00,230 - BERTopic - Reduced dimensionality
2023-06-02 14:25:00,663 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:25:00,665 - BERTopic - Predicted clusters


Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.23076923076923078




### - reduce both the number of topics and outliers -

In [None]:
bertopic_model = BERTopic.load('models/bertopic_20k_sbert_umap_hdbscan-n_neighbors5-n_components5')

In [None]:
# to apply reduce_outliers, we require data indicating the probabilities 
# of a document being classified under specific topics.
# recall that previously we set caculate_probabilities = False
# we retrain the model with caculate_probabilities = True
umap_model = UMAP(n_neighbors=5, 
                  n_components=5,
                  min_dist = 0, 
                  metric='cosine', 
                  random_state = 623,
                  low_memory=False)
bertopic_model = BERTopic(embedding_model = 'all-MiniLM-L6-v2',
                          umap_model = umap_model,
                          vectorizer_model=vectorizer_model, 
                          calculate_probabilities=True,
                          verbose = True) 
lib_topics, lib_probs = bertopic_model.fit_transform(lib_abs, lib_vecs)

2023-06-02 14:25:19,141 - BERTopic - Reduced dimensionality
2023-06-02 14:27:43,810 - BERTopic - Clustered reduced embeddings


In [None]:
# merge topics by using reduce_topic method
bertopic_model.reduce_topics(lib_abs, nr_topics='auto')

2023-06-02 14:28:22,697 - BERTopic - Reduced number of topics from 348 to 216


<bertopic._bertopic.BERTopic at 0x7f2fc1b68310>

In [None]:
# reduce the number of outliers
new_topics = bertopic_model.reduce_outliers(lib_abs, lib_topics, probabilities=lib_probs, 
                             threshold=0.05, strategy="probabilities")
bertopic_model.update_topics(lib_abs, topics=new_topics)

In [None]:
# count the number of topics and outliers
pic_freqs = bertopic_model.get_topic_freq()
num_lib_outliers = df_lib_topic_freqs['Count'][df_lib_topic_freqs['Topic']==-1].iloc[0]
print(f"{num_lib_outliers} documents have not been classified")
print(f"The other {df_lib_topic_freqs['Count'].sum() - num_lib_outliers} documents are {df_lib_topic_freqs['Topic'].shape[0]-1} topics")

6786 documents have not been classified
The other 13214 documents are 347 topics


In [None]:
# visualize topics
# each circle corresponds to a topic cluster
# the size of a circle represents the size of the correspondingtopic cluster
bertopic_model.visualize_topics().show()

In [None]:
# visualize documents
# each point is a document
# topic clusters are labeled by colors
bertopic_model.visualize_documents(lib_abs, 
                                   hide_document_hover=True, 
                                   hide_annotations=True).show()

<img src="images/visualize_topics-5-5-reduce_topics_outliers.png" alt="n_neighbors = 5, n_components = 5,reduce_topics_outliers" width="800"/>

<img src="images/visualize_documents-5-5-reduce_topics_outliers.png" alt="n_neighbors = 5, n_components = 5,reduce_topics_outliers" width="800"/>


Again we have large orange cluster at the bottom left corner.

In [None]:
# check how the models classify papers from df_dev 
# the ones they have not seen during the training
for i in range(-6,-11,-1) :
  print('For the paper with the following title + abstract\n\n ',
        textwrap.fill(df_dev.loc[:,'clean_title_abstract'].iloc[i], width=50),
        '\n\n\n')
  df = pd.read_parquet('data/dev_sets/df_dev_topics_20k_sbert_umap_hdbscan-n_neighbors5-n_components5.parquet')
  df=df.loc[:,['clean_title_abstract','keywords','representative_doc']]
  print(' - assigned the following topic keywords to the paper -\n\n ',df.iloc[i,1],'\n\n')
  print(' - The most representative abstract with the above topic keywords has the following title + abstract - \n\n',
          textwrap.fill(df.iloc[i,2][0], width=50))
  print('--------------------------------------------------\n\n\n')

For the paper with the following title + abstract

  enumerative geometry via the moduli space of super
riemann surfacesin this paper we relate volumes of
moduli spaces of super riemann surfaces to
integrals over the moduli space of stable riemann
surfaces  this allows us to prove via algebraic
geometry a recursion between the volumes of moduli
spaces of super hyperbolic surfaces previously
proven via super geometry techniques by stanford
and witten the recursion between the volumes of
moduli spaces of super hyperbolic surfaces is
proven to be equivalent to the fact that a
generating function for the intersection numbers
of a natural collection of cohomology classes with
tautological classes on is a kdv tau function this
is analogous to mirzakhanis proof of the
kontsevich witten theorem regarding a generating
function for the intersection numbers of
tautological classes on using volumes of moduli
spaces of hyperbolic surfaces 



 - assigned the following topic keywords to the paper -


In [None]:
# caculate the ratio of the papers that the models classified as outliers
dev_topics,_ = bertopic_model.transform(dev_abs)
num_dev_outliers = dev_topics.count(-1)
print(f"In full_dev_set, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:29:00,801 - BERTopic - Reduced dimensionality
2023-06-02 14:29:01,229 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:29:01,230 - BERTopic - Predicted clusters


In full_dev_set, the ratio of the papers that have been classified as outliers is 0.38




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DS(dynamical systems)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.DS' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:29:02,056 - BERTopic - Reduced dimensionality
2023-06-02 14:29:02,478 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:29:02,480 - BERTopic - Predicted clusters


Among papers with subject tag math.DS, the ratio of the papers that have been classified as outliers is 0.25




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.AP(PDEs)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.AP' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:29:03,406 - BERTopic - Reduced dimensionality
2023-06-02 14:29:03,836 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:29:03,838 - BERTopic - Predicted clusters


Among papers with subject tag math.AP, the ratio of the papers that have been classified as outliers is 0.35294117647058826




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.MP(mathematical physics)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.MP' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:29:04,735 - BERTopic - Reduced dimensionality
2023-06-02 14:29:05,163 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:29:05,164 - BERTopic - Predicted clusters


Among papers with subject tag math.MP, the ratio of the papers that have been classified as outliers is 0.4666666666666667




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.DG(differential geometry)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.DG' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:29:06,052 - BERTopic - Reduced dimensionality
2023-06-02 14:29:06,479 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:29:06,480 - BERTopic - Predicted clusters


Among papers with subject tag math.DG, the ratio of the papers that have been classified as outliers is 0.5




In [None]:
# caculate the ratio of the papers that the models classified as 
# outliers among the papers with subject tag math.PR(probability)
dev_topics,_ = bertopic_model.transform(dev_abs)
mask = ['math.PR' in categories for categories in df_dev.categories]
dev_topics = [ dev_topics[i]for i in range(len(mask)) if mask[i]]
num_dev_outliers = dev_topics.count(-1)
print(f"Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is {num_dev_outliers/len(dev_topics)}\n\n") 

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

2023-06-02 14:29:07,331 - BERTopic - Reduced dimensionality
2023-06-02 14:29:07,757 - BERTopic - Calculated probabilities with HDBSCAN
2023-06-02 14:29:07,759 - BERTopic - Predicted clusters


Among papers with subject tag math.PR, the ratio of the papers that have been classified as outliers is 0.23076923076923078





# Conclusion
The model sustains its performance even after we streamline the number of topics and address the outliers. However, reducing the number of topics results in clusters of unequal sizes. As a result, we've selected a BERTopic model configuration with n_neighbors=5 and n_components=5, in tandem with a reduced number of outliers.