# Heirarchal topic modeling analysis

## Goal


 We perform a topic analysis on a dataset consisting of arxiv pre-prints based on their titles and abstracts.

## The dataset

Our dataset contains the metadata from a uniform sample of 20,000 papers among those with subject tags in the following list:

Dynamical systems, PDEs, Mathematical Physics, Probability, and Differential Geometry.

## Layout of this notebook

1. Preliminary analysis of the data
1. Creating the basic topic model structure
1. Creating the evaluation metrics
1. Tuning hyper-parameters
1. Evaluating performance of the model on a test set


In [3]:
! git clone https://github.com/Anirban-7/Arxiv_Recommender

Cloning into 'Arxiv_Recommender'...
remote: Enumerating objects: 383, done.[K
remote: Counting objects: 100% (119/119), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 383 (delta 57), reused 98 (delta 43), pack-reused 264[K
Receiving objects: 100% (383/383), 611.09 MiB | 19.75 MiB/s, done.
Resolving deltas: 100% (169/169), done.
Updating files: 100% (84/84), done.


In [4]:
cd /content/Arxiv_Recommender/

/content/Arxiv_Recommender


# 2. Create the basic topic model structure

## Create the basic UMAP, KMeans, and HDBSCAN objects we will modify when tuning hyper-parameters

In the sections below we create two instances of BERTopic models. One will be responsible for the initial K-means clustering and the second will be the template for the 5 topic models fit on each cluster.

In [5]:
## Install necessary packages
!pip install arxiv
!pip install bertopic
!pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting arxiv
  Downloading arxiv-1.4.7-py3-none-any.whl (12 kB)
Collecting feedparser (from arxiv)
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k (from feedparser->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6046 sha256=b9409bc3838741d6b8afa32f3fa8335013f1ed75eebf7ec5e113496daa96958c
  Stored in directory: /root/.cache/pip/wheels/f0/69/93/a47e9d621be168e9e33c7ce60524393c0b92ae83cf6c6e89c5
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser, arxiv
Succes

In [6]:
## Imports
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.cluster import KMeans
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from bertopic.representation import MaximalMarginalRelevance
import pandas as pd 
import numpy as np
import data_utils

## Create the two-step model and define fitting and predicting methods.

We first decide the hyper-parameters we will tune. Note that we use the same bertopic model for each cluster model in order to simplify the procedure. Therefore we have 

We need to choose parameters of **two** bertopic models. We won't modify the respresentation of topics but rather the UMAP and clustering parameters.

Model 1: UMAP and K-means clustering parameters.
Model 2: UMAP and HDBSCAN clustering parameters.

We write a function which takes two arguments model_1_params and model_2_params.
it returns a tuple (kmeans_model , cluster_model). The second we will run inside every cluster produced by the first.

To input the parameters of the models, we use a dictionary 

kmeans_model_params = { 'umap' : umap_params }
cluster_model_params = {'umap': umap_params , 'hdbscan': hdbscan_params}

Note that we don't change the kmeans clusterer itself because there are essentially no parameters to tune.

Each of the umap and hdbscan parameters will be packaged as a kwarg and unpacked with **.

umap_params = {'n_neighbors':15 , 'n_components':5, 'metric':'euclidean','min_dist':0.0, 'random_state':623}

hdbscan_params = {'min_cluster_size':10, 'min_samples' : 10, 'max_cluster_size' : 0, 'metric' : 'euclidean'}


### Create the function which assembles the model parameters

In [7]:
## Fix the parameters we will vary and construct the full set of model parameters from these

def get_model_params(umap_n_neighbors=15, umap_n_components=5,hdbscan_min_cluster_size=10, hdbscan_min_samples=10):

  umap_params = {'n_neighbors':15 , 'n_components':5, 'metric':'euclidean','min_dist':0.0, 'random_state':623}
  hdbscan_params = {'min_cluster_size':10, 'min_samples' : 10, 'max_cluster_size' : 0, 'metric' : 'euclidean', 'prediction_data':'True'}

  kmeans_model_params = {'umap' : umap_params} 
  cluster_model_params = {'umap' : umap_params, 'hdbscan': hdbscan_params}

  return kmeans_model_params , cluster_model_params

### Create the construct models function

In [8]:
def construct_models(kmeans_model_params , cluster_model_params):
  # Construct umap objects 

  kmeans_proj = UMAP(**kmeans_model_params['umap'])
  cluster_proj = UMAP(**cluster_model_params['umap'])

  # Construct clusterers
  kmeans_clusterer = KMeans(n_clusters=5)
  hdbscan_clusterer = HDBSCAN(**cluster_model_params['hdbscan'])

  # Construct topic representation
  vectorizer = CountVectorizer(stop_words='english',ngram_range=(1,2))
  rep_model = MaximalMarginalRelevance(diversity=0.5)

  # K-means
  base_topic_model = BERTopic(umap_model=kmeans_proj,
                              hdbscan_model=kmeans_clusterer,
                              vectorizer_model=vectorizer,
                              representation_model=rep_model,
                              verbose=True)

  # Fine clustering
  cluster_topic_model = BERTopic(umap_model=cluster_proj,
                              hdbscan_model=hdbscan_clusterer,
                              vectorizer_model=vectorizer,
                              representation_model=rep_model,
                              verbose=True) 

  return base_topic_model , cluster_topic_model

#### Create the fit_model function

In [9]:
## Define the function which trains the models. More precisely, we are using the dataframe 'library.parquet' which contains the columns -- 'doc_strings' and
## 'doc_strings_reduced'. This is the corpus of papers on which we do topic analysis. These columns are the exact text strings that are fed into the specter
## sentence embedding model to generate the vector embeddings we cluster. The 'reduced' argument tells us whether to use the titles + abstracts which have had
## rare words removed as well as latex (reduced = True) vs just the latex removed.

def fit_models(base_topic_model,cluster_topic_model,reduced=False):
  """
  Arguments:

  reduced: Boolean determining whether we use the reduced title + abstract or the minimally cleaned title + abstract
  base_topic_model: a bertopic model which does the first step k-means clustering
  cluster_topic_model: a bertopic model which does the second step of topic identification within each cluster.

  Returns:

  A 3 tuple consisting of the trained kmeans model, a dictionary of trained cluster models, the dataframe returned with two additional
  columns. The new columns are
    1. 'kmeans_labels' : the numerical label 0-4 which corresponds to the k-means cluster the document belongs to
    2. 'fine_topic_labels' : -1 if the document is an outlier within its cluster. Otherwise, it is a list of the keywords generated that
    best describes the topic assigned to the document. 
  """
  df = pd.read_parquet('./final_data/library.parquet')

  if reduced:
    embeddings = pd.read_parquet('./final_data/library_vec_reduced_specter.parquet').values
    docs = 'doc_string_reduced'
  else:
    embeddings = pd.read_parquet('./final_data/library_vec_specter.parquet').values
    docs = 'doc_string'

  # First train the K-means model.
  print('Finding the K-means clusters...')
  base_topic_model.fit(documents=df[docs].to_list(), embeddings=embeddings)

  # Create a new column in the dataframe called 'kmeans_labels' which records the topic label for each paper
  kmeans_labels = pd.Series(base_topic_model.topics_, index=df.index)
  df['kmeans_labels'] = kmeans_labels

  # Construct dictionary of cluster models
  cluster_models = {i : cluster_topic_model for i in range(5)}

  # Add a placeholder column for the fine topic labels
  df['fine_topic_labels'] = 0

  for i in range(5):
    print(f'Getting topics for cluster {i}...')
    
    # Get the papers in kmeans topic i
    indices = df.loc[df['kmeans_labels'] == i].index

    # Get the documents in this topic
    cluster_docs = df[docs].iloc[indices].to_list()

    # Get the embeddings for these documents
    cluster_embeddings = embeddings[indices,:]

    # Train the ith model
    cluster_models[i].fit(documents=cluster_docs,embeddings=cluster_embeddings)

    # Create the topic labels dataframe
    topics = cluster_models[i].topics_
    labels = cluster_models[i].generate_topic_labels(nr_words=10,separator=' | ')

    
    def get_keywords(i):
      return labels[i+1]

    fine_topic_info = pd.DataFrame({'topic_number': topics}, index=indices)
    fine_topic_info['topic_keywords'] = fine_topic_info['topic_number'].apply(func=get_keywords)

    # Replace the keywords by -1 if the row is an outlier
    fine_topic_info['topic_keywords'].loc[fine_topic_info['topic_number'] == -1] = -1

    df['fine_topic_labels'].iloc[indices] = fine_topic_info['topic_keywords']

  return base_topic_model , cluster_models , df


#### Create the predict_topics function.

In [10]:
## This will work very similarly to the fit function.

## We assume we are given a dataframe consisting of test documents. The column 'doc_strings' contains the text that was used to generate the embedding of each document
## (Cleaned title and abstract). The strip_cat column contains the arxiv math subject tags in the form of a list where each is represented by its two-letter code. i.e.
## Dynamical Systems is 'DS'.
## The goal is to return the test dataframe with the same two additional columns that the fit method constructs. 

def predict_topics(test_path,test_embeddings_path,trained_base_model,trained_cluster_models):
  """
  Args:



  Returns: The dataframe that was passed with two additional
  columns. The new columns are
    1. 'kmeans_labels' : the numerical label 0-4 which corresponds to the k-means cluster the document belongs to
    2. 'fine_topic_labels' : -1 if the document is an outlier within its cluster. Otherwise, it is a list of the keywords generated that
    best describes the topic assigned to the document. 
  
  """

  test = pd.read_parquet(test_path)
  test_embeddings = pd.read_parquet(test_embeddings_path).values  


  ## Trained cluster modes are encoded as a dictionary with keys 0-4 representing the name of K-means cluster it was trained on.

  # Grab the documents the embeddings were trained on
  docs = test['doc_string'].to_list()
  
  # Predict the K-means topic of each paper and store these as a series
  print('Predicting K-means clusters for each document...')
  kmeans_label_list , _ = trained_base_model.transform(documents=docs, embeddings=test_embeddings)
  kmeans_labels = pd.Series(kmeans_label_list,index=test.index)
  
  # Add K-means labels to the dataframe
  test['kmeans_labels'] = kmeans_labels

  # Add a placeholder column for the fine topic labels
  test['fine_topic_labels'] = 0


  for i in range(5):
    print(f'Predicting topic labels for cluster {i}...')

    # Get the papers in kmeans topic i
    indices = test.loc[test['kmeans_labels'] == i].index

    # Get the documents in this topic
    cluster_docs = test['doc_string'].iloc[indices].to_list()

    # Get the embeddings for these documents
    cluster_embeddings = test_embeddings[indices,:]

    # Get the predicted topics for this cluster
    topics , _ = trained_cluster_models[i].transform(documents=cluster_docs,embeddings=cluster_embeddings)
    labels = trained_cluster_models[i].generate_topic_labels(nr_words=10,separator=' | ')
    
    def get_keywords(i):
      return labels[i]

    fine_topic_info = pd.DataFrame({'topic_number': topics}, index=indices)
    fine_topic_info['topic_keywords'] = fine_topic_info['topic_number'].apply(func=get_keywords)

    # Replace the keywords by -1 if the row is an outlier
    fine_topic_info['topic_keywords'].loc[fine_topic_info['topic_number'] == -1] = -1

    test['fine_topic_labels'].iloc[indices] = fine_topic_info['topic_keywords']

  return test





# 3. Creating the evaluation metrics

### What we want to measure

Next we will evaluate the topic model on a dev set of 50 brand new articles that are not present in the dataset. We will measure

1. The fraction of outliers per subject tag on the entire dataset
2. The fraction of outlier predictions in the dev set
3. The (subjective) accuracy of the predicted key-words. 

To the third point, the last 1/5 of the dev set consists of papers that Jee uhn and I will be confident in categorizing. The others will be a rough eye-test by non-experts.

### Getting outlier statistics

In [11]:
def OHE_cats(df):
    """Return a DataFrame of one-hot-encoded categories of the library with
    the same index as the library
    """

    mlb = MultiLabelBinarizer()
    OHE_array = mlb.fit_transform(df.strip_cat)
    
    return pd.DataFrame(OHE_array,columns=mlb.classes_,index=df.index)


In [12]:
## Define a function to get outlier information. This will take in the results of predicting topics and return a dataframe
## showing the breakdown of total # of outliers per subject tag, as well as the ratio of outliers per subject tag.

def get_outlier_stats(results):

  total_subject_count = OHE_cats(results).sum(axis=0)
  outliers = results.loc[results['fine_topic_labels'] == -1] 
  outlier_subject_count = OHE_cats(outliers).sum(axis=0).fillna(value=0)
  outlier_subject_ratio = outlier_subject_count / total_subject_count

  return pd.DataFrame({'total_subject_count': total_subject_count,
                       'outlier_subject_count' : outlier_subject_count,
                       'outlier_subject_ratio': outlier_subject_ratio}).sort_values(by=['total_subject_count'],
                                                                                    ascending=False)



### Getting the final evaluation: Predicted topics for non-outliers plus outlier statistics per subject.

In [13]:
## Define a function taking in the results of topic model prediction and returning the predicted topics
## as well as the outlier stats

def eval_predictions(results):
  
  predicted_topics = results[['title_raw','abstract_raw','fine_topic_labels','strip_cat']].loc[results['fine_topic_labels'] != -1]

  return predicted_topics , get_outlier_stats(results)

# 4. Define the hyper-parameter tuning pipeline

### Working pipeline for parameter-tuning

In [14]:
## Create a single function that builds the topic models, fits the library, and outputs the evaluation of the predictions made
## on a given test set

def eval_model(test_or_dev,
                      reduced=False,                      
                      umap_n_neighbors=15,
                      umap_n_components=5,
                      hdbscan_min_cluster_size=10,
                      hdbscan_min_samples=10):
  
  print('Constructing cluster models...')
  print()
  
  ## Construct model params
  kmeans_model_params , cluster_model_params = get_model_params(umap_n_neighbors=15,
                                                              umap_n_components=5,
                                                              hdbscan_min_cluster_size=10,
                                                              hdbscan_min_samples=10)
  ## Construct the cluster models

  base_topic_model , cluster_topic_model = construct_models(kmeans_model_params , cluster_model_params)

  ## Fit the models
  print('Fitting cluster models...')
  print()

  trained_base_model , trained_cluster_models , fit_library = fit_models(base_topic_model,cluster_topic_model,reduced=reduced)

  ## Make predictions
  if test_or_dev == 'dev':
    test_path = './final_data/clean_dev_set.parquet'
    test_embeddings_path = './final_data/dev_vec_specter.parquet'
  elif test_or_dev == 'test':
    test_path = './final_data/clean_test_set.parquet'
    test_embeddings_path = './final_data/test_vec_specter.parquet'


  test_predictions = predict_topics(test_path=test_path,
               test_embeddings_path=test_embeddings_path,
               trained_base_model=trained_base_model,
               trained_cluster_models=trained_cluster_models)

  ## Evaluate predictions
  print('Getting library outlier data...')
  print()

  library_outliers = get_outlier_stats(fit_library)

  print('Getting test set topic predictions & outlier data...')
  test_topic_predictions , test_outlier_stats = eval_predictions(test_predictions)

  print('Library outlier data:')
  print()
  print(library_outliers)
  print()

  print(f'{test_or_dev} set outlier data:')
  print()
  print(test_outlier_stats)
  print()

  print(f'{test_or_dev} set predictions:')
  print()
  print(test_topic_predictions)




# 5. Evaluating different choices for parameters:

### 1. Control: The default parameters


--------------------

UMAP:

n_neighbors=10

n_components = 5

--------------------

HDBSCAN:

min_cluster_size = 10

min_sample_size = 10


In [15]:
## First, run the default model as an example

eval_model('dev')



Constructing cluster models...

Fitting cluster models...

Finding the K-means clusters...


2023-06-02 21:18:16,434 - BERTopic - Reduced dimensionality
2023-06-02 21:18:16,763 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 0...


2023-06-02 21:18:43,909 - BERTopic - Reduced dimensionality
2023-06-02 21:18:44,137 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 1...


2023-06-02 21:19:02,847 - BERTopic - Reduced dimensionality
2023-06-02 21:19:03,063 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 2...


2023-06-02 21:19:34,241 - BERTopic - Reduced dimensionality
2023-06-02 21:19:34,416 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 3...


2023-06-02 21:19:59,681 - BERTopic - Reduced dimensionality
2023-06-02 21:19:59,857 - BERTopic - Clustered reduced embeddings


Getting topics for cluster 4...


2023-06-02 21:20:14,433 - BERTopic - Reduced dimensionality
2023-06-02 21:20:14,542 - BERTopic - Clustered reduced embeddings


Predicting K-means clusters for each document...


2023-06-02 21:20:39,121 - BERTopic - Reduced dimensionality
2023-06-02 21:20:39,125 - BERTopic - Predicted clusters


Predicting topic labels for cluster 0...


2023-06-02 21:20:45,765 - BERTopic - Reduced dimensionality
2023-06-02 21:20:45,772 - BERTopic - Predicted clusters


Predicting topic labels for cluster 1...


2023-06-02 21:20:46,630 - BERTopic - Reduced dimensionality
2023-06-02 21:20:46,638 - BERTopic - Predicted clusters


Predicting topic labels for cluster 2...


2023-06-02 21:20:47,516 - BERTopic - Reduced dimensionality
2023-06-02 21:20:47,522 - BERTopic - Predicted clusters


Predicting topic labels for cluster 3...


2023-06-02 21:20:49,588 - BERTopic - Reduced dimensionality
2023-06-02 21:20:49,597 - BERTopic - Predicted clusters


Predicting topic labels for cluster 4...


2023-06-02 21:20:51,034 - BERTopic - Reduced dimensionality
2023-06-02 21:20:51,041 - BERTopic - Predicted clusters


Getting library outlier data...

Getting test set topic predictions & outlier data...
Library outlier data:

    total_subject_count  outlier_subject_count  outlier_subject_ratio
MP                 6568                   2851               0.434074
AP                 4961                   1988               0.400726
PR                 4663                   1746               0.374437
DG                 3765                   1579               0.419389
DS                 2910                   1083               0.372165
FA                  627                    286               0.456140
CO                  544                    218               0.400735
OC                  534                    180               0.337079
CA                  518                    234               0.451737
SP                  506                    221               0.436759
AG                  484                    226               0.466942
CV                  473                    230     