### Introduction

BERTopic is a topic modeling technique (which is helpful to discover latent topics in collections of documents) that leverages pre-trained transformer based language model for topic embedding and can cluster these embeddings to generate topic representations with the classed-based TF-IDF procedure. With embeddings of lower dimension, it creates dense clusters, allowing for easily interpretable topics whilst keeping important words in the topic descriptions. It supports all kinds of modeling techniques: guided, supervised and semi-supervised, manual, multi-topic distributions, hierarchical, etc.

BERTopic can solve a problem with LDA model, that is, the losing of semantic relationships among words with BoG representation which ignores the order of them. In recent years, with the proposal of word embedding and transformer model, BERT and its variations have shown great results in generating contextual word and sentence vector representations, where semantically similar words or sentences are also closer.

The detailed information of this model is as follows: the authors first created document embeddings using a pre-trained language model (with Sentence-BERT to transfer sentences or paragraphs into dense vectors) to obtain document level information. Second, they reduced the dimensionality of the embeddings followed by the creation of semantically similar clusters of documents that represent specific topics. Here each cluster is basically a topic. Finally, they developed a class based TF-IDF to extract the topic representation from each topic to avoid the centroid based perspective.

(Grootendorst, 2022)

### Libraries

In [9]:
import numpy as np
import pandas as pd
import os
import re

from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

### Data and Preprocessing

In [2]:
data = np.load('literature.npy', allow_pickle=True).item()

In [3]:
# Extract papers from dictionary and save in a list
texts = []

for _, sections in data.items():
    full_text = " ".join(sections.values())
    texts.append(full_text)

In [11]:
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")

stop_words_a = set(stopwords.words("english"))
stop_words_b = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to C:\Users\Songlin
[nltk_data]     Wang\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Songlin
[nltk_data]     Wang\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Songlin
[nltk_data]     Wang\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z\s]", " ", text)
    
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words_a and len(t) > 1]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return " ".join(tokens)

In [13]:
texts_cleaned = [preprocess_text(text) for text in texts]

### BERTopic

In [14]:
vectorizer_model = CountVectorizer(stop_words=stop_words_b)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(hdbscan_model=HDBSCAN(min_cluster_size=2, min_samples=1), vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, \
                       top_n_words=200)
topic, probs = topic_model.fit_transform(texts)



### Results

In [15]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2,-1_niger_spatial_subvillage_scale,"[niger, spatial, subvillage, scale, lake, geog...",[Migration-environment models tend to be aspat...
1,0,6,0_drought_assets_mobility_households,"[drought, assets, mobility, households, floodi...",[Research on the demographic consequences of e...
2,1,4,1_conditions_burkina_faso_land,"[conditions, burkina, faso, land, migrations, ...",[It is widely accepted that environmental chan...
3,2,4,2_benin_dominase_ponkrum_nigeria,"[benin, dominase, ponkrum, nigeria, bialaba, m...",[Internal migration in rural Benin is not dire...
4,3,4,3_temperature_countries_poor_middleincome,"[temperature, countries, poor, middleincome, e...",[This paper describes the conceptual and pract...
5,4,3,4_events_intentions_climaterelated_transition,"[events, intentions, climaterelated, transitio...","[Research has demonstrated that, in a variety ..."
6,5,3,5_conflict_natural_canada_temporary,"[conflict, natural, canada, temporary, interna...",[There is limited empirical evidence of how en...
7,6,2,6_health_maasai_mental_refugees,"[health, maasai, mental, refugees, camp, arush...","[Objectives By 2050, over 250 million people w..."
8,7,2,7_hawaweer_palm_oasis_dra,"[hawaweer, palm, oasis, dra, um, jawasir, vall...","[The Hawaweer, a nomadic, pastoralist group in..."


In [16]:
topic[0]

4

### Evaluation

BERTopic doesn't work well with small dataset! Different from LDA, it's a clustering based model. It can not generate 'distribution of topics', but several clusters of topics, even though these papers might be very different from each other. With small sample data, HDBSCAN can easily consider all inputs as noise (topic -1). It can only generate a few topics even with min_cluster_size=2, which can already lead to unstable results.

One possible way is to split each paper into chunks but I can't tell if it works better or not.

Similar to LDA, BERTopic can automize text classification part, however, it can not really assign externally defined categorization codes, especially when pre-defined class names are not present in the sample. Thus, we have to manually annotate the papers based on topics we get, top words of such topics, and the matching topics of each paper. But BERTopic can lower the workload as well, compared with doing fully manual annotation. Here we first take paper 0 for evaluation. Its topic is likely to be '0'.

In [18]:
topic_model.get_topic(4)

[('events', np.float64(0.3759535205167446)),
 ('intentions', np.float64(0.3624419646784941)),
 ('climaterelated', np.float64(0.36226602688115533)),
 ('transition', np.float64(0.3537240861015784)),
 ('et', np.float64(0.3485533485088044)),
 ('zone', np.float64(0.3479549382825513)),
 ('al', np.float64(0.34699678687978236)),
 ('stressors', np.float64(0.3335409466428436)),
 ('environmental', np.float64(0.33105585407618143)),
 ('heads', np.float64(0.32888070118014867)),
 ('household', np.float64(0.3240510676965316)),
 ('migration', np.float64(0.31638918267487387)),
 ('individuals', np.float64(0.30370868774295645)),
 ('migrate', np.float64(0.29010217420906914)),
 ('climate', np.float64(0.28670093167535426)),
 ('stressor', np.float64(0.2778614793931438)),
 ('respondents', np.float64(0.2767126363528006)),
 ('forestsavannah', np.float64(0.27195375456334514)),
 ('ghana', np.float64(0.27178068652131204)),
 ('trends', np.float64(0.26578312011811467)),
 ('people', np.float64(0.26519348922815045)),
 

In [19]:
data = [[1, 1, 1, 1, \
        1, 1, 1, 0, 0, \
        0, 0, 1, 0, \
        1, 1, 0, 1, \
        0, 0, 1, \
        0, 1, 0, 0, 0, 1, \
        0, 0, 1, 0, 0, 0, 0, \
        1, 0, 0]]
result = pd.DataFrame(data, columns=['Qualitative method', 'Quantitative method', 'Socio-demo-economic data', 'Environmental data', \
                       'Individuals', 'Households', 'Subnational groups', 'National groups', 'International groups', \
                       'Urban', 'Rural', 'Time frame considered', 'Foresight', \
                       'Rainfall pattern / Variability', 'Temperature change', 'Food scarcity / Famine / Food security', 'Drought / Aridity / Desertification', \
                       'Floods', 'Erosion / Soil fertility / Land degradation / Deforestation / Salinisation', 'Self assessment / Perceived environment', \
                       'Labour migration', 'Marriage migration', 'Refugees', 'International migration', 'Cross-border migration', 'Internal migration', \
                       'Rural to urban', 'Rural to rural', 'Circular / Seasonal', 'Long distance', 'Short distance', 'Temporal', 'Permanent', \
                       'Age', 'Gender', 'Ethnicity / Religion']).astype(str)

In [20]:
manual_result = pd.read_excel('manual.xlsx').astype(str)

In [21]:
manual_result = manual_result.iloc[[3]].drop(columns=['ID', 'AUTHOR', 'TITLE']).reset_index(drop=True)
manual_result

Unnamed: 0,Qualitative method,Quantitative method,Socio-demo-economic data,Environmental data,Individuals,Households,Subnational groups,National groups,International groups,Urban,...,Rural to urban,Rural to rural,Circular / Seasonal,Long distance,Short distance,Temporal,Permanent,Age,Gender,Ethnicity / Religion
0,0,1,1,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0


In [22]:
bool_result = (result.iloc[0] == manual_result.iloc[0])
bool_result.mean(axis=0)

np.float64(0.6944444444444444)