# Heirarchal topic modeling analysis

## Goal


 We perform a topic analysis on a dataset consisting of arxiv pre-prints based on their titles and abstracts.

## The dataset

Our dataset contains the metadata from a uniform sample of 20,000 papers among those with subject tags in the following list:

Dynamical systems, PDEs, Mathematical Physics, Probability, and Differential Geometry.

## Layout of this notebook

1. Preliminary analysis of the data
1. Creating the basic topic model structure
1. Creating the evaluation metrics
1. Tuning hyper-parameters
1. Evaluating performance of the model on a test set


In [8]:
! git clone https://github.com/Anirban-7/Arxiv_Recommender

Cloning into 'Arxiv_Recommender'...
remote: Enumerating objects: 304, done.[K
remote: Counting objects: 100% (40/40), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 304 (delta 31), reused 35 (delta 29), pack-reused 264[K
Receiving objects: 100% (304/304), 598.54 MiB | 26.21 MiB/s, done.
Resolving deltas: 100% (143/143), done.
Updating files: 100% (53/53), done.


In [9]:
cd /content/Arxiv_Recommender/

/content/Arxiv_Recommender


# 1. Create the basic topic model structure

## Create the basic UMAP, KMeans, and HDBSCAN objects we will modify when tuning hyper-parameters

Below we create two instances of BERTopic models. One will be responsible for the initial K-means clustering and the second will be the template for the 5 topic models fit on each cluster.

In [15]:
## Install necessary packages
!pip install arxiv
!pip install bertopic
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting arxiv
  Downloading arxiv-1.4.7-py3-none-any.whl (12 kB)
Collecting feedparser (from arxiv)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k (from feedparser->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6046 sha256=3a746aac291492e79765a6bb22ae8d591db394b45d09b849cb9bb9999b199dff
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser, arxiv
Succes

In [23]:
## Imports
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import MaximalMarginalRelevance
import pandas as pd 
import numpy as np
import data_utils

In [None]:
## Create the umap objects

# UMAP for K-means step
kmeans_proj = UMAP(n_neighbors=15,n_components=5,metric='euclidean',min_dist=0.0,random_state=623)

# UMAP for subtopic clustering
cluster_proj = UMAP(n_neighbors=15, n_components=5,metric='euclidean',min_dist=0.0,random_state=623)

# UMAP for visualizing the document clustering in two dimensions during evaluation.
vis_proj = UMAP(n_neighbors=15,n_components=2,metric='euclidean',min_dist=0.0,random_state=623)

## We use a fixed random state to eliminate stochastic effects in tuning hyperparameters and to compare to the global topic model.

In [None]:
## Create clusterers

# K-means
kmeans_clusterer = KMeans(5) #k = 5 reflects the major presence of 5 distinct subjects.

# HDBSCAN for fine clustering
subclusterer = HDBSCAN(min_cluster_size=10,min_samples=10,max_cluster_size=0,metric='euclidean')


In [None]:
## Create the two kinds of topic model architecture



# K-means
base_topic_model = BERTopic(umap_model=kmeans_proj,
                            hdbscan_model=kmeans_clusterer,
                            vectorizer_model=vectorizer,
                            representation_model=rep_model,
                            verbose=True)

# Fine clustering
cluster_topic_model = BERTopic(umap_model=cluster_proj,
                            hdbscan_model=subclusterer,
                            vectorizer_model=vectorizer,
                            representation_model=rep_model,
                            verbose=True) 


## Create the two-step model and define fitting and predicting methods.

We first decide the hyper-parameters we will tune. Note that we use the same bertopic model for each cluster model in order to simplify the procedure. Therefore we have 

We need to choose parameters of **two** bertopic models. We won't modify the respresentation of topics but rather the UMAP and clustering parameters.

Model 1: UMAP and K-means clustering parameters.
Model 2: UMAP and HDBSCAN clustering parameters.

We write a function which takes two arguments model_1_params and model_2_params.
it returns a tuple (kmeans_model , cluster_model). The second we will run inside every cluster produced by the first.

To input the parameters of the models, we use a dictionary 

kmeans_model_params = { 'umap' : umap_params }
cluster_model_params = {'umap': umap_params , 'hdbscan': hdbscan_params}

Note that we don't change the kmeans clusterer itself because there are essentially no parameters to tune.

Each of the umap and hdbscan parameters will be packaged as a kwarg and unpacked with **.

umap_params = {'n_neighbors':15 , 'n_components':5, 'metric':'euclidean','min_dist':0.0, 'random_state':623}

hdbscan_params = {'min_cluster_size':10, 'min_samples' : 10, 'max_cluster_size' : 0, 'metric' : 'euclidean'}


In [58]:
## Fix the parameters we will vary and construct the full set of model parameters from these

def get_model_params(umap_n_neighbors=15, umap_n_components=5,hdbscan_min_cluster_size=10, hdbscan_min_samples=10):

  umap_params = {'n_neighbors':15 , 'n_components':5, 'metric':'euclidean','min_dist':0.0, 'random_state':623}
  hdbscan_params = {'min_cluster_size':10, 'min_samples' : 10, 'max_cluster_size' : 0, 'metric' : 'euclidean', 'prediction_data':'True'}

  kmeans_model_params = {'umap' : umap_params} 
  cluster_model_params = {'umap' : umap_params, 'hdbscan': hdbscan_params}

  return kmeans_model_params , cluster_model_params

In [59]:
def construct_models(kmeans_model_params , cluster_model_params):
  # Construct umap objects 

  kmeans_proj = UMAP(**kmeans_model_params['umap'])
  cluster_proj = UMAP(**cluster_model_params['umap'])

  # Construct clusterers
  kmeans_clusterer = KMeans(n_clusters=5)
  hdbscan_clusterer = HDBSCAN(**cluster_model_params['hdbscan'])

  # Construct topic representation
  vectorizer = CountVectorizer(stop_words='english',ngram_range=(1,2))
  rep_model = MaximalMarginalRelevance(diversity=0.5)

  # K-means
  base_topic_model = BERTopic(umap_model=kmeans_proj,
                              hdbscan_model=kmeans_clusterer,
                              vectorizer_model=vectorizer,
                              representation_model=rep_model,
                              verbose=True)

  # Fine clustering
  cluster_topic_model = BERTopic(umap_model=cluster_proj,
                              hdbscan_model=hdbscan_clusterer,
                              vectorizer_model=vectorizer,
                              representation_model=rep_model,
                              verbose=True) 

  return base_topic_model , cluster_topic_model

#### Create the fit_model function

In [None]:
## Define the function which trains the models.
## More precisely, we are using the dataframe 'library.parquet'
## which contains the columns -- 'doc_strings' and
## 'doc_strings_reduced'. This is the corpus of papers 
## on which we do topic analysis. These columns are the
## exact text strings that are fed into the specter
## sentence embedding model to generate the vector embeddings
## we cluster. The 'reduced' argument tells us whether to use
## the titles + abstracts which have had rare words removed
## as well as latex (reduced = True) vs just the latex removed.

def fit_models(base_topic_model,cluster_topic_model,reduced=False):
  """
  Arguments:

  reduced: Boolean determining whether we use the
    reduced title + abstract or the minimally cleaned title + abstract
  base_topic_model: a bertopic model which does the
    first step k-means clustering
  cluster_topic_model: a bertopic model which does the
    second step of topic identification within each cluster.

  Returns:

  A 3 tuple consisting of the
    (a) trained kmeans model,
    (b) a dictionary of trained cluster models,
    (c) the dataframe returned with two additional columns.
        The new columns are
        1. 'kmeans_labels' : the numerical label 0-4 which
          corresponds to the k-means cluster the document belongs to
        2. 'fine_topic_labels' : -1 if the document is an outlier
          within its cluster. Otherwise, it is a list of the keywords
          generated that best describes the topic assigned to the document. 
  """
  df = pd.read_parquet('./final_data/library.parquet')

  if reduced:
    embeddings = pd.read_parquet('./final_data/library_vec_reduced_specter.parquet').values
    docs = 'doc_string_reduced'
  else:
    embeddings = pd.read_parquet('./final_data/library_vec_specter.parquet').values
    docs = 'doc_string'

  # First train the K-means model.
  print('Finding the K-means clusters...')
  base_topic_model.fit(documents=df[docs].to_list(), embeddings=embeddings)

  # Create a new column in the dataframe
  # called 'kmeans_labels' which records
  # the topic label for each paper
  kmeans_labels = pd.Series(base_topic_model.topics_, index=df.index)
  df['kmeans_labels'] = kmeans_labels

  # Construct dictionary of cluster models
  cluster_models = {i : cluster_topic_model for i in range(5)}

  # Add a placeholder column for the fine topic labels
  df['fine_topic_labels'] = 0

  for i in range(5):
    print(f'Getting topics for cluster {i}...')
    
    # Get the papers in kmeans topic i
    indices = df.loc[df['kmeans_labels'] == i].index

    # Get the documents in this topic
    cluster_docs = df[docs].iloc[indices].to_list()

    # Get the embeddings for these documents
    cluster_embeddings = embeddings[indices,:]

    # Train the ith model
    cluster_models[i].fit(documents=cluster_docs,embeddings=cluster_embeddings)

    # Create the topic labels dataframe
    topics = cluster_models[i].topics_
    labels = cluster_models[i].generate_topic_labels(nr_words=10,separator=' | ')

    
    def get_keywords(i):
      return labels[i+1]

    fine_topic_info = pd.DataFrame({'topic_number': topics}, index=indices)
    fine_topic_info['topic_keywords'] = fine_topic_info['topic_number'].apply(func=get_keywords)

    # Replace the keywords by -1 if the row is an outlier
    fine_topic_info['topic_keywords'].loc[fine_topic_info['topic_number'] == -1] = -1

    df['fine_topic_labels'].iloc[indices] = fine_topic_info['topic_keywords']

  return base_topic_model , cluster_models , df


#### Create the predict_topics function.

In [61]:
## This will work very similarly to the fit function.

## We assume we are given a dataframe consisting
## of test documents. The column 'doc_strings'
## contains the text that was used to generate
## the embedding of each document (cleaned title
## and abstract). The strip_cat column contains
## the arxiv math subject tags in the form of a list
## where each is represented by its two-letter code.
## e.g. Dynamical Systems is 'DS'.
## The goal is to return the test dataframe with the
## same two additional columns that the fit method constructs. 

def predict_topics(test_path,test_embeddings_path,trained_base_model,trained_cluster_models):
  """
  Args:



  Returns: The dataframe that was passed with two additional
  columns. The new columns are
    1. 'kmeans_labels' : the numerical label 0-4 which corresponds
      to the k-means cluster the document belongs to
    2. 'fine_topic_labels' : -1 if the document is an outlier within
      its cluster. Otherwise, it is a list of the keywords generated
      that best describes the topic assigned to the document. 
  
  """

  test = pd.read_csv(test_path)
  test_embeddings = pd.read_parquet(test_embeddings_path).values  


  ## Trained cluster models are encoded as a dictionary with
  # keys 0-4 representing the name of K-means cluster it was trained on.

  # Grab the documents the embeddings were trained on
  docs = test['doc_string'].to_list()
  
  # Predict the K-means topic of each paper and store these as a series
  print('Predicting K-means clusters for each document...')
  kmeans_label_list , _ = trained_base_model.transform(documents=docs, embeddings=test_embeddings)
  kmeans_labels = pd.Series(kmeans_label_list,index=test.index)
  
  # Add K-means labels to the dataframe
  test['kmeans_labels'] = kmeans_labels

  # Add a placeholder column for the fine topic labels
  test['fine_topic_labels'] = 0


  for i in range(5):
    print(f'Predicting topic labels for cluster {i}...')

    # Get the papers in kmeans topic i
    indices = test.loc[test['kmeans_labels'] == i].index

    # Get the documents in this topic
    cluster_docs = test['doc_string'].iloc[indices].to_list()

    # Get the embeddings for these documents
    cluster_embeddings = test_embeddings[indices,:]

    # Get the predicted topics for this cluster
    topics , _ = trained_cluster_models[i].transform(documents=cluster_docs,embeddings=cluster_embeddings)
    labels = trained_cluster_models[i].generate_topic_labels(nr_words=10,separator=' | ')
    
    def get_keywords(i):
      return labels[i]

    fine_topic_info = pd.DataFrame({'topic_number': topics}, index=indices)
    fine_topic_info['topic_keywords'] = fine_topic_info['topic_number'].apply(func=get_keywords)

    # Replace the keywords by -1 if the row is an outlier
    fine_topic_info['topic_keywords'].loc[fine_topic_info['topic_number'] == -1] = -1

    test['fine_topic_labels'].iloc[indices] = fine_topic_info['topic_keywords']

  return test





## Testing the 'fit' function on our dataset with default model parameters.

In [21]:
## Test

## Define default parameters

default_umap_params = {'n_neighbors':15 , 'n_components':5, 'metric':'euclidean','min_dist':0.0, 'random_state':623}
default_hdbscan_params = {'min_cluster_size':10, 'min_samples' : 10, 'max_cluster_size' : 0, 'metric' : 'euclidean'}

kmeans_model_params = {'umap' : default_umap_params}
cluster_model_params = {'umap' : default_umap_params , 'hdbscan': default_hdbscan_params}

## Construct models

base_topic_model , cluster_topic_model = construct_models(kmeans_model_params=kmeans_model_params,cluster_model_params=cluster_model_params)



NameError: ignored

In [None]:
## Load the library

library = pd.read_parquet('./data/library.parquet')
library.head()

Unnamed: 0,id,title_raw,abstract_raw,update_date,strip_cat,authors_parsed
182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,[DS],"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]"
196425,809.351,Shrinking Point Bifurcations of Resonance Tong...,Resonance tongues are mode-locking regions o...,2015-05-13,[DS],"[['Simpson', 'D. J. W.', ''], ['Meiss', 'J. D...."
479424,2201.04222,Classification of Codimension-1 Singular Bifur...,The study of bifurcations of differential-al...,2022-01-13,[DS],"[['Ovsyannikov', 'Ivan', ''], ['Ruan', 'Haibo'..."
176385,1408.5812,Partial sums of excursions along random geodes...,"For a non-uniform lattice in SL(2,R), we con...",2014-10-09,"[GT, DS]","[['Gadre', 'Vaibhav', '']]"
291058,1707.03102,Uniform dimension results for a family of Mark...,In this paper we prove uniform Hausdorff and...,2017-10-03,[PR],"[['Sun', 'Xiaobin', ''], ['Xiao', 'Yimin', '']..."


In [3]:
## Load the library embeddings:

# library_vec_reduced_specter = pd.read_parquet('./final_data/library_vec_reduced_specter.parquet')
# library_vec_specter = pd.read_parquet('./final_data/library_vec_specter.parquet')

## Load the dev set embeddings:

# dev_vec_specter = pd.read_parquet('./final_data/dev_vec_specter.parquet')

## Load the test set embeddings:
pass

In [None]:
## Create the fitted model and get topics for the library

trained_base , trained_clusters , results = fit_models(df=library,
                                                       reduced=False,
                                                       base_topic_model=base_topic_model,
                                                       cluster_topic_model=cluster_topic_model)

results.head()

## 3. Creating the evaluation metrics

Next we will evaluate the topic model on a dev set of 50 brand new articles that are not present in the dataset. We will measure

1. The fraction of outliers per subject tag on the entire dataset
2. The fraction of outlier predictions in the dev set
3. The (subjective) accuracy of the predicted key-words. 

To the third point, the last 1/5 of the dev set consists of papers that Jee uhn and I will be confident in categorizing. The others will be a rough eye-test by non-experts.

In [86]:
from sklearn.preprocessing import MultiLabelBinarizer

def OHE_cats(df):
    """Return a DataFrame of one-hot-encoded categories of the library with
    the same index as the library
    """
    mlb = MultiLabelBinarizer()
    categories = data_utils.category_map()

    def convert_to_eng(cat_array):
      out = []
      for tag in cat_array:
        if ('math.' + tag) not in categories.keys():
          continue
        else:
          out.append(categories['math.' + tag])
        return out

    def func_to_apply(cat_array):
      if convert_to_eng(cat_array):
        return convert_to_eng(cat_array)
      else:
        return 'Unknown'

    eng_cats = df['strip_cat'].apply(func_to_apply)
    OHE_array = mlb.fit_transform(eng_cats)
    
    return pd.DataFrame(OHE_array,columns=mlb.classes_,index=df.index)


In [75]:
## Define a function to get outlier information.
## This will take in the results of predicting
## topics and return a dataframe showing the
## breakdown of total # of outliers per subject tag,
## as well as the ratio of outliers per subject tag.

def get_outlier_stats(results):

  total_subject_count = OHE_cats(results).sum(axis=0)
  outliers = results.loc[results['fine_topic_labels'] == -1] 
  outlier_subject_count = OHE_cats(outliers).sum(axis=0)
  outlier_subject_ratio = outlier_subject_count / total_subject_count

  return pd.concat({'outlier_subject_count' : outlier_subject_count, 'outlier_subject_ratio': outlier_subject_ratio})



In [74]:
## Define a function taking in the results of topic model prediction and returning the predicted topics as well as the outlier stats

def eval_predictions(results):
  
  predicted_topics = results[['title_raw','abstract_raw','fine_topic_labels']].loc[results['fine_topic_labels'] != -1]

  return predicted_topics , get_outlier_stats(results)

## 4. Define the hyper-parameter tuning pipeline

In [62]:
## 1. Specify the hyper-parameters needed to build the UMAP and HDBSCAN objects

kmeans_model_params , cluster_model_params = get_model_params(umap_n_neighbors=15,
                                                              umap_n_components=5,
                                                              hdbscan_min_cluster_size=10,
                                                              hdbscan_min_samples=10)

In [63]:
## 2. Construct the cluster models

base_topic_model , cluster_topic_model = construct_models(kmeans_model_params , cluster_model_params)

In [64]:
## Fit the models

trained_base_model , trained_cluster_models , fit_library = fit_models(base_topic_model,cluster_topic_model,reduced=False)


Finding the K-means clusters...


2023-06-02 05:31:06,954 - BERTopic - Reduced dimensionality
2023-06-02 05:31:09,632 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 0...


2023-06-02 05:31:43,642 - BERTopic - Reduced dimensionality
2023-06-02 05:31:43,870 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 1...


2023-06-02 05:32:01,569 - BERTopic - Reduced dimensionality
2023-06-02 05:32:01,771 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 2...


2023-06-02 05:32:30,103 - BERTopic - Reduced dimensionality
2023-06-02 05:32:30,275 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 3...


2023-06-02 05:32:57,093 - BERTopic - Reduced dimensionality
2023-06-02 05:32:57,269 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 4...


2023-06-02 05:33:13,020 - BERTopic - Reduced dimensionality
2023-06-02 05:33:13,122 - BERTopic - Clustered reduced embeddings


In [82]:
## Predict topics on the dev set

dev_predictions = predict_topics(test_path='./final_data/clean_dev_set.csv',
               test_embeddings_path='./final_data/dev_vec_specter.parquet',
               trained_base_model=trained_base_model,
               trained_cluster_models=trained_cluster_models)


Predicting K-means clusters for each document...


2023-06-02 05:45:09,795 - BERTopic - Reduced dimensionality
2023-06-02 05:45:09,801 - BERTopic - Predicted clusters


Predicting topic labels for cluster 0...


2023-06-02 05:45:10,774 - BERTopic - Reduced dimensionality
2023-06-02 05:45:10,780 - BERTopic - Predicted clusters


Predicting topic labels for cluster 1...


2023-06-02 05:45:11,687 - BERTopic - Reduced dimensionality
2023-06-02 05:45:11,692 - BERTopic - Predicted clusters


Predicting topic labels for cluster 2...


2023-06-02 05:45:12,621 - BERTopic - Reduced dimensionality
2023-06-02 05:45:12,626 - BERTopic - Predicted clusters


Predicting topic labels for cluster 3...


2023-06-02 05:45:13,528 - BERTopic - Reduced dimensionality
2023-06-02 05:45:13,533 - BERTopic - Predicted clusters


Predicting topic labels for cluster 4...


2023-06-02 05:45:14,415 - BERTopic - Reduced dimensionality
2023-06-02 05:45:14,417 - BERTopic - Predicted clusters


In [87]:
## Get the evaluation metrics

eye_test , outlier_stats = eval_predictions(dev_predictions)


In [90]:
## Look at the fitted library itself
library_outlier_stats = get_outlier_stats(fit_library)


print(library_outlier_stats)
fit_library.head()

outlier_subject_count  Algebraic Geometry              150.000000
                       Algebraic Topology               36.000000
                       Analysis of PDEs               1634.000000
                       Category Theory                   8.000000
                       Classical Analysis and ODEs     110.000000
                                                         ...     
outlier_subject_ratio  Representation Theory             0.457831
                       Rings and Algebras                0.517241
                       Spectral Theory                   0.415094
                       Statistics Theory                 0.422018
                       Symplectic Geometry               0.412162
Length: 64, dtype: float64


Unnamed: 0,index,id,title_raw,abstract_raw,update_date,strip_cat,authors_parsed,title_clean,abstract_clean,authors_clean,abstract_tokenized,abstract_reduced_tokens,abstract_rejoin,doc_string,doc_string_reduced,kaggle_index,kmeans_labels,fine_topic_labels
0,182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,[DS],"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",limit cycles bifurcating from a degenerate center,we study the maximum number of limit cycles th...,"[['llibre', 'j', ''], ['pantazi', 'c', '']]","[we, study, the, maximum, number, of, limit, c...","[we, study, the, maximum, number, of, limit, c...",we study the maximum number of limit cycles th...,limit cycles bifurcating from a degenerate cen...,limit cycles bifurcating from a degenerate cen...,182244,1,3 | bifurcation | bifurcations | hopf | hopf b...
1,196425,809.351,Shrinking Point Bifurcations of Resonance Tong...,Resonance tongues are mode-locking regions o...,2015-05-13,[DS],"[['Simpson', 'D. J. W.', ''], ['Meiss', 'J. D....",shrinking point bifurcations of resonance tong...,resonance tongues are mode locking regions of ...,"[['simpson', 'd j w', ''], ['meiss', 'j d', '']]","[resonance, tongues, are, mode, locking, regio...","[resonance, tongues, are, mode, locking, regio...",resonance tongues are mode locking regions of ...,shrinking point bifurcations of resonance tong...,shrinking point bifurcations of resonance tong...,196425,1,3 | bifurcation | bifurcations | hopf | hopf b...
2,479424,2201.04222,Classification of Codimension-1 Singular Bifur...,The study of bifurcations of differential-al...,2022-01-13,[DS],"[['Ovsyannikov', 'Ivan', ''], ['Ruan', 'Haibo'...",classification of codimension singular bifurca...,the study of bifurcations of differential alge...,"[['ovsyannikov', 'ivan', ''], ['ruan', 'haibo'...","[the, study, of, bifurcations, of, differentia...","[the, study, of, bifurcations, of, differentia...",the study of bifurcations of differential alge...,classification of codimension singular bifurca...,classification of codimension singular bifurca...,479424,1,3 | bifurcation | bifurcations | hopf | hopf b...
3,176385,1408.5812,Partial sums of excursions along random geodes...,"For a non-uniform lattice in SL(2,R), we con...",2014-10-09,"[GT, DS]","[['Gadre', 'Vaibhav', '']]",partial sums of excursions along random geodes...,for a non uniform lattice in slr we consider e...,"[['gadre', 'vaibhav', '']]","[for, a, non, uniform, lattice, in, slr, we, c...","[for, a, non, uniform, lattice, in, slr, we, c...",for a non uniform lattice in slr we consider e...,partial sums of excursions along random geodes...,partial sums of excursions along random geodes...,176385,3,-1
4,291058,1707.03102,Uniform dimension results for a family of Mark...,In this paper we prove uniform Hausdorff and...,2017-10-03,[PR],"[['Sun', 'Xiaobin', ''], ['Xiao', 'Yimin', '']...",uniform dimension results for a family of mark...,in this paper we prove uniform hausdorff and p...,"[['sun', 'xiaobin', ''], ['xiao', 'yimin', '']...","[in, this, paper, we, prove, uniform, hausdorf...","[in, this, paper, we, prove, uniform, hausdorf...",in this paper we prove uniform hausdorff and p...,uniform dimension results for a family of mark...,uniform dimension results for a family of mark...,291058,0,8 | markov | chains | markov chains | chain | ...


In [91]:
## Look at the outlier stats by subject
outlier_stats


outlier_subject_count  U    42.00
                       k    42.00
                       n    42.00
                       o    42.00
                       w    42.00
outlier_subject_ratio  U     0.84
                       k     0.84
                       n     0.84
                       o     0.84
                       w     0.84
dtype: float64

In [92]:
## Evaluate the labels by eye-test
eye_test

Unnamed: 0,title_raw,abstract_raw,fine_topic_labels
2,Clustering and Arnoux-Rauzy words,We characterize the clustering of a word under...,22 | grassmannians | grassmann | grassmannian ...
6,Endemic Oscillations for SARS-CoV-2 Omicron --...,The SIRS model with constant vaccination and i...,2 | lie | lie algebras | algebras | lie groups...
12,A sub-Riemannian maximum modulus theorem,In this note we prove a sub-Riemannian maximum...,-1 | theory | quantum | field | algebra | spac...
19,One-parameter discrete-time Calogero-Moser system,We present a new type of integrable one-dimens...,2 | lie | lie algebras | algebras | lie groups...
24,"Boundary rigidity, and non-rigidity, of projec...",We investigate the property of boundary rigidi...,14 | symplectic | poisson | manifolds | manifo...
28,On the reduced space of multiplicative multive...,A strict Lie $2$-algebra $\Gamma(\wedge^\bulle...,6 | branes | brane | calabi yau | yau | calabi...
31,Sharp bounds on the height of K-semistable tor...,Inspired by Fujita's algebro-geometric result ...,10 | string | boundary | entanglement | bulk |...
44,Enumerative geometry via the moduli space of s...,In this paper we relate volumes of moduli spac...,10 | string | boundary | entanglement | bulk |...


The outlier stats are obviously fucked up, fix tomorrow morning