# Heirarchal topic modeling analysis

## Goal


 We perform a topic analysis on a dataset consisting of arxiv pre-prints based on their titles and abstracts only.

## The dataset

Our dataset contains the metadata from a uniform sample of 20,000 papers among those with subject tags in the following list:

Dynamical systems, PDEs, Mathematical Physics, Probability, and Differential Geometry.

## Layout of this notebook

1. Preliminary analysis of the data
1. Creating the basic topic model structure
1. Creating the evaluation metrics
1. Tuning hyper-parameters
1. Evaluating performance of the model on a test set


# Create the basic topic model structure

### We use a heirarchal clustering method where we first cluster at large scale using K-means and then run an HDBSCAN based clustering approach inside each K-means cluster to extract fine topic information.

In [1]:
## Install necessary packages
!pip install arxiv
!pip install bertopic
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting arxiv
  Downloading arxiv-1.4.7-py3-none-any.whl (12 kB)
Collecting feedparser (from arxiv)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k (from feedparser->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6046 sha256=e6b8aad32ef4bb39c1336ae690d08abfaf7ce349860dcef073770d4ac7dcf565
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser, arxiv
Succes

In [11]:
## Imports
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans
from bertopic import BERTopic
import pandas as pd 
import numpy as np

In [3]:
## Create the umap objects

# UMAP for K-means step
kmeans_proj = UMAP(n_neighbors=15,n_components=5,metric='euclidean',min_dist=0.0,random_state=623)

# UMAP for subtopic clustering
cluster_proj = UMAP(n_neighbors=15, n_components=5,metric='euclidean',min_dist=0.0,random_state=623)

# UMAP for visualizing the document clustering in two dimensions during evaluation.
vis_proj = UMAP(n_neighbors=15,n_components=2,metric='euclidean',min_dist=0.0,random_state=623)

## We use a fixed random state to eliminate stochastic effects in tuning hyperparameters and to compare to the global topic model.

In [4]:
## Create clusterers

# K-means
kmeans_clusterer = KMeans(5) #k = 5 reflects the major presence of 5 distinct subjects.

# HDBSCAN for fine clustering
subclusterer = HDBSCAN(min_cluster_size=10,min_samples=10,max_cluster_size=0,metric='euclidean')


In [5]:
## Create the data needed for optimal topic representation
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import MaximalMarginalRelevance

vectorizer = CountVectorizer(stop_words='english',ngram_range=(1,2))
rep_model = MaximalMarginalRelevance(diversity=0.5)

In [6]:
## Create the two kinds of topic model architecture



# K-means
base_topic_model = BERTopic(umap_model=kmeans_proj,
                            hdbscan_model=kmeans_clusterer,
                            vectorizer_model=vectorizer,
                            representation_model=rep_model,
                            verbose=True)

# Fine clustering
cluster_topic_model = BERTopic(umap_model=cluster_proj,
                            hdbscan_model=subclusterer,
                            vectorizer_model=vectorizer,
                            representation_model=rep_model,
                            verbose=True) 


## Tracking keeping track of modifying hyper-parameters

We first decide the hyper-parameters we will tune. Note that we use the same bertopic model for each cluster model in order to simplify the procedure. Therefore we have 

We need to choose parameters of **two** bertopic models. We won't modify the respresentation of topics but rather the UMAP and clustering parameters.

Model 1: UMAP and K-means clustering parameters.
Model 2: UMAP and HDBSCAN clustering parameters.

We write a function which takes two arguments model_1_params and model_2_params.
it returns a tuple (kmeans_model , cluster_model). The second we will run inside every cluster produced by the first.

To input the parameters of the models, we use a dictionary 

kmeans_model_params = { 'umap' : umap_params }
cluster_model_params = {'umap': umap_params , 'hdbscan': hdbscan_params}

Note that we don't change the kmeans clusterer itself because there are essentially no parameters to tune.

Each of the umap and hdbscan parameters will be packaged as a kwarg and unpacked with **.

umap_params = {'n_neighbors':15 , 'n_components':5, 'metric':'euclidean','min_dist':0.0, 'random_state':623}

hdbscan_params = {'min_cluster_size':10, 'min_samples' : 10, 'max_cluster_size' : 0, 'metric' : 'euclidean'}


In [10]:
def construct_models(kmeans_model_params , cluster_model_params):
  # Construct umap objects 

  kmeans_proj = UMAP(**kmeans_model_params['umap'])
  cluster_proj = UMAP(**cluster_model_params['umap'])

  # Construct clusterers
  kmeans_clusterer = KMeans(n_clusters=5)
  hdbscan_clusterer = HDBSCAN(**cluster_model_params['hdbscan'])

  # K-means
  base_topic_model = BERTopic(umap_model=kmeans_proj,
                              hdbscan_model=kmeans_clusterer,
                              vectorizer_model=vectorizer,
                              representation_model=rep_model,
                              verbose=True)

  # Fine clustering
  cluster_topic_model = BERTopic(umap_model=cluster_proj,
                              hdbscan_model=hdbscan_clusterer,
                              vectorizer_model=vectorizer,
                              representation_model=rep_model,
                              verbose=True) 

  return base_topic_model , cluster_topic_model

In [56]:
## Define the function which trains the models. More precisely, we are given an input dataframe df with a column called 'docs' which consists of the strings 
## that are the inputs we want to get topic information for and a np array of their sentence embeddings called doc_embeddings

def fit_models(df,doc_embeddings,base_topic_model,cluster_topic_model):
  ## This model returns a triple of the fit models as well as the dataframe with a new column named 'fine_topic_labels'. This column either contains:
  ## -1 if the row corresponds to an outlier inside its cluster. 
  ## A list of keywords the cluster topic model assigned it otherwise.

  # First train the K-means model.
  base_topic_model.fit(documents=df.docs.to_list(), embeddings=doc_embeddings)

  # Create a new column in the dataframe called 'kmeans_labels' which records the topic label for each paper
  kmeans_labels = pd.Series(base_topic_model.topics_, index=df.index)
  df['kmeans_labels'] = kmeans_labels

  # Construct dictionary of cluster models
  cluster_models = {i : cluster_topic_model for i in range(5)}

  # Add a placeholder column for the fine topic labels
  df['fine_topic_labels'] = 0

  for i in range(5):
    
    # Get the papers in kmeans topic i
    indices = df.loc[df['kmeans_labels'] == i].index

    # Get the documents in this topic
    docs = df.docs.iloc[indices].to_list()

    # Get the embeddings for these documents
    embeddings = doc_embeddings[indices,:]

    # Train the ith model
    cluster_models[i].fit(documents=docs,embeddings=embeddings)

    # Create the topic labels dataframe
    topics = cluster_models[i].topics_
    labels = cluster_models[i].generate_topic_labels()

    
    def get_keywords(i):
      return labels[i+1]

    fine_topic_info = pd.DataFrame({'topic_number': topics}, index=indices)
    fine_topic_info['topic_keywords'] = fine_topic_info['topic_number'].apply(func=get_keywords)

    # Replace the keywords by -1 if the row is an outlier
    fine_topic_info['topic_keywords'].loc[fine_topic_info['topic_number'] == -1] = -1

    df['fine_topic_labels'].iloc[indices] = fine_topic_info['topic_keywords']

  return base_topic_model , cluster_models , df


Late night test the above functions to make sure it does what we want, write the evaluation function tomorrow and run it on the dev set.

In [24]:
!git clone https://github.com/Anirban-7/Arxiv_Recommender

Cloning into 'Arxiv_Recommender'...
remote: Enumerating objects: 283, done.[K
remote: Counting objects: 100% (75/75), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 283 (delta 34), reused 75 (delta 34), pack-reused 208[K
Receiving objects: 100% (283/283), 361.33 MiB | 23.40 MiB/s, done.
Resolving deltas: 100% (133/133), done.
Updating files: 100% (45/45), done.


In [25]:
cd /content/Arxiv_Recommender/

/content/Arxiv_Recommender


In [26]:
## Test

## Define default parameters

default_umap_params = {'n_neighbors':15 , 'n_components':5, 'metric':'euclidean','min_dist':0.0, 'random_state':623}
default_hdbscan_params = {'min_cluster_size':10, 'min_samples' : 10, 'max_cluster_size' : 0, 'metric' : 'euclidean'}

kmeans_model_params = {'umap' : default_umap_params}
cluster_model_params = {'umap' : default_umap_params , 'hdbscan': default_hdbscan_params}

## Construct models

base_topic_model , cluster_topic_model = construct_models(kmeans_model_params=kmeans_model_params,cluster_model_params=cluster_model_params)



In [32]:
## Load the dataset

df = pd.read_parquet('./data/filter_20k.parquet')
df.head()

Unnamed: 0,id,title,abstract,update_date,authors_parsed,strip_cat
182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS]
196425,809.351,Shrinking Point Bifurcations of Resonance Tong...,Resonance tongues are modelocking regions of...,2015-05-13,"[['Simpson', 'D. J. W.', ''], ['Meiss', 'J. D....",[DS]
479424,2201.04222,Classification of Codimension1 Singular Bifurc...,The study of bifurcations of differentialalg...,2022-01-13,"[['Ovsyannikov', 'Ivan', ''], ['Ruan', 'Haibo'...",[DS]
176385,1408.5812,Partial sums of excursions along random geodes...,"For a nonuniform lattice in SL(2,R), we cons...",2014-10-09,"[['Gadre', 'Vaibhav', '']]","[GT, DS]"
291058,1707.03102,Uniform dimension results for a family of Mark...,In this paper we prove uniform Hausdorff and...,2017-10-03,"[['Sun', 'Xiaobin', ''], ['Xiao', 'Yimin', '']...",[PR]


In [33]:
## Create the 'docs' column
df['docs'] = df.title + df.abstract

## Reset index
df = df.reset_index()
df.head()



Unnamed: 0,index,id,title,abstract,update_date,authors_parsed,strip_cat,docs
0,182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],Limit cycles bifurcating from a degenerate cen...
1,196425,809.351,Shrinking Point Bifurcations of Resonance Tong...,Resonance tongues are modelocking regions of...,2015-05-13,"[['Simpson', 'D. J. W.', ''], ['Meiss', 'J. D....",[DS],Shrinking Point Bifurcations of Resonance Tong...
2,479424,2201.04222,Classification of Codimension1 Singular Bifurc...,The study of bifurcations of differentialalg...,2022-01-13,"[['Ovsyannikov', 'Ivan', ''], ['Ruan', 'Haibo'...",[DS],Classification of Codimension1 Singular Bifurc...
3,176385,1408.5812,Partial sums of excursions along random geodes...,"For a nonuniform lattice in SL(2,R), we cons...",2014-10-09,"[['Gadre', 'Vaibhav', '']]","[GT, DS]",Partial sums of excursions along random geodes...
4,291058,1707.03102,Uniform dimension results for a family of Mark...,In this paper we prove uniform Hausdorff and...,2017-10-03,"[['Sun', 'Xiaobin', ''], ['Xiao', 'Yimin', '']...",[PR],Uniform dimension results for a family of Mark...


In [29]:
## Load the embeddings

doc_embeddings = pd.read_parquet('./data/df_lib_vecs_20k_all-MiniLM-L6-v2.parquet').values

In [57]:
trained_base , trained_clusters , results = fit_models(df=df,
                                                       doc_embeddings=doc_embeddings,
                                                       base_topic_model=base_topic_model,
                                                       cluster_topic_model=cluster_topic_model)

results.head()

2023-06-01 05:28:16,043 - BERTopic - Reduced dimensionality
2023-06-01 05:28:18,548 - BERTopic - Clustered reduced embeddings
2023-06-01 05:28:51,446 - BERTopic - Reduced dimensionality
2023-06-01 05:28:51,627 - BERTopic - Clustered reduced embeddings
2023-06-01 05:29:08,872 - BERTopic - Reduced dimensionality
2023-06-01 05:29:09,013 - BERTopic - Clustered reduced embeddings
2023-06-01 05:29:23,882 - BERTopic - Reduced dimensionality
2023-06-01 05:29:24,021 - BERTopic - Clustered reduced embeddings
2023-06-01 05:29:40,171 - BERTopic - Reduced dimensionality
2023-06-01 05:29:40,311 - BERTopic - Clustered reduced embeddings
2023-06-01 05:29:52,158 - BERTopic - Reduced dimensionality
2023-06-01 05:29:52,208 - BERTopic - Clustered reduced embeddings


Unnamed: 0,index,id,title,abstract,update_date,authors_parsed,strip_cat,docs,kmeans_labels,fine_topic_labels
0,182244,1412.3275,Limit cycles bifurcating from a degenerate center,We study the maximum number of limit cycles ...,2014-12-11,"[['Llibre', 'J.', ''], ['Pantazi', 'C.', '']]",[DS],Limit cycles bifurcating from a degenerate cen...,4,6_bifurcation_bifurcations_hopf
1,196425,809.351,Shrinking Point Bifurcations of Resonance Tong...,Resonance tongues are modelocking regions of...,2015-05-13,"[['Simpson', 'D. J. W.', ''], ['Meiss', 'J. D....",[DS],Shrinking Point Bifurcations of Resonance Tong...,4,6_bifurcation_bifurcations_hopf
2,479424,2201.04222,Classification of Codimension1 Singular Bifurc...,The study of bifurcations of differentialalg...,2022-01-13,"[['Ovsyannikov', 'Ivan', ''], ['Ruan', 'Haibo'...",[DS],Classification of Codimension1 Singular Bifurc...,4,6_bifurcation_bifurcations_hopf
3,176385,1408.5812,Partial sums of excursions along random geodes...,"For a nonuniform lattice in SL(2,R), we cons...",2014-10-09,"[['Gadre', 'Vaibhav', '']]","[GT, DS]",Partial sums of excursions along random geodes...,3,1_geodesic_geodesics_hyperbolic
4,291058,1707.03102,Uniform dimension results for a family of Mark...,In this paper we prove uniform Hausdorff and...,2017-10-03,"[['Sun', 'Xiaobin', ''], ['Xiao', 'Yimin', '']...",[PR],Uniform dimension results for a family of Mark...,0,-1


## Creating the evaluation metrics

Next we will evaluate the topic model on a dev set of 50 brand new articles that are not present in the dataset. We will measure

1. The fraction of outliers per subject tag on the entire dataset
2. The fraction of outlier predictions in the dev set
3. The (subjective) accuracy of the predicted key-words. 

To the third point, the last 1/5 of the dev set consists of papers that Jee uhn and I will be confident in categorizing. The others will be a rough eye-test by non-experts.