

```
# Questo è formattato come codice
```

# **Scientific Publication Analysis with BERTopic**

BERTopic is a topic modeling technique that leverages transformers and a custom class-based TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. (https://maartengr.github.io/BERTopic/index.html)

## Information about the algorihm
website: https://maartengr.github.io/BERTopic/algorithm/algorithm.html

paper: https://arxiv.org/pdf/2203.05794.pdf  

<img src="https://maartengr.github.io/BERTopic/img/algorithm.png" width="50%">


# Enabling the GPU

First, you'll need to enable GPUs for the notebook:

*   Navigate to Edit→Notebook Settings
*   Select GPU from the Hardware Accelerator drop-down

[Reference](https://colab.research.google.com/notebooks/gpu.ipynb)

# **Installing BERTopic**

We start by installing BERTopic from PyPi:

In [None]:
%%capture
!pip install bertopic
!pip install joblib==1.1.0

After installing BERTopic, some packages that were already loaded were updated and in order to correctly use them, we should now restart the notebook. **From the Menu: Runtime → Restart Runtime**

There is an issue in the library update (not solved at the time of the analysis). See https://github.com/scikit-learn-contrib/hdbscan/issues/565

# **Import Data**
Import the dataset for the Topic Modelling. In this case we will analyse the scientific Publications about attrition. The dataset includes Title, Abstract, Authors Keywords, Year, Number of Citations and Authors of papers available on Scopus. The papers have been checked manually to select only the ones in scope for the purpose of the analysis.

In [None]:
pip install google.colab

Collecting jedi>=0.16 (from ipython==7.34.0->google.colab)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m44.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [None]:
# Connect Google Drive (GDrive) with Colab
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
# @title Default title text
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

# Đọc tệp CSV
file_path = '/content/drive/MyDrive/Truong Nguyen/input/scopus.csv'
df = pd.read_csv(file_path)

# Hiển thị dữ liệu
print(df.head())

                                             Authors  \
0                Ramirez F.; Mari W.; Martingayle D.   
1                                         Ngulube P.   
2  Yiu C.; Liu Y.; Park W.; Li J.; Huang X.; Yao ...   
3                   Goodman N.P.; Lehto O.; Novak M.   
4    Ghysels S.; De Baets B.; Reheul D.; Maenhout S.   

                                   Author full names  \
0  Ramirez, Fanny (57204107984); Mari, William (5...   
1                      Ngulube, Patrick (8884474300)   
2  Yiu, Chunki (57215813633); Liu, Yiming (572107...   
3  Goodman, Nathan P. (57196424399); Lehto, Otto ...   
4  Ghysels, Sarah (59711189100); De Baets, Bernar...   

                                        Author(s) ID  \
0              57204107984; 57190610149; 58302990800   
1                                         8884474300   
2  57215813633; 57210733719; 57339547900; 5733917...   
3              57196424399; 57202013323; 57190026901   
4  59711189100; 55664779600; 6602168060; 22734

In [None]:
# Import the file from Google Drive
import pandas as pd
attrition_paper = pd.read_csv('/content/drive/MyDrive/Truong Nguyen/input/scopus.csv')

Let's have a look into the dataset and in the string that we will use for the topic modelling.

In [None]:
attrition_paper.head()

Unnamed: 0,Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,...,DOI,Link,Abstract,Author Keywords,Index Keywords,Document Type,Publication Stage,Open Access,Source,EID
0,Ramirez F.; Mari W.; Martingayle D.,"Ramirez, Fanny (57204107984); Mari, William (5...",57204107984; 57190610149; 58302990800,Let Us Fly Our Drones: An Examination of Stude...,2025,Journalism Practice,19,4.0,,923.0,...,10.1080/17512786.2023.2218332,https://www.scopus.com/inward/record.uri?eid=2...,This study uses diffusion of innovations theor...,college newspapers; content analysis; Drone jo...,,Article,Final,,Scopus,2-s2.0-105001833789
1,Ngulube P.,"Ngulube, Patrick (8884474300)",8884474300,Leveraging information and communication techn...,2025,Discover Environment,3,1.0,9,,...,10.1007/s44274-025-00190-1,https://www.scopus.com/inward/record.uri?eid=2...,"Using a qualitative approach, this study exami...",Digital innovations; Environmental preservatio...,,Review,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-85217821172
2,Yiu C.; Liu Y.; Park W.; Li J.; Huang X.; Yao ...,"Yiu, Chunki (57215813633); Liu, Yiming (572107...",57215813633; 57210733719; 57339547900; 5733917...,Skin-interfaced multimodal sensing and tactile...,2025,Science Advances,11,13.0,eadt6041,,...,10.1126/sciadv.adt6041,https://www.scopus.com/inward/record.uri?eid=2...,Unmanned aerial vehicles have undergone substa...,,Aircraft; Equipment Design; Feedback; Humans; ...,Article,Final,,Scopus,2-s2.0-105001336860
3,Goodman N.P.; Lehto O.; Novak M.,"Goodman, Nathan P. (57196424399); Lehto, Otto ...",57196424399; 57202013323; 57190026901,Institutional diversity and innovative recombi...,2025,European Economic Review,174,,104998,,...,10.1016/j.euroecorev.2025.104998,https://www.scopus.com/inward/record.uri?eid=2...,"In Explaining Technology, Koppl et al. (2023) ...",Competition policy; Drones; Governing knowledg...,empirical analysis; innovation; intellectual p...,Article,Final,,Scopus,2-s2.0-85218500779
4,Ghysels S.; De Baets B.; Reheul D.; Maenhout S.,"Ghysels, Sarah (59711189100); De Baets, Bernar...",59711189100; 55664779600; 6602168060; 22734778400,Image-based yield prediction for tall fescue u...,2025,Frontiers in Plant Science,16,,1549099,,...,10.3389/fpls.2025.1549099,https://www.scopus.com/inward/record.uri?eid=2...,"In the early stages of selection, many plant b...",convolutional neural network; dry matter yield...,,Article,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-105001130182


# **Text Preprocessing**

1. Merge text from Title, Author Keywords, and Abstracts.

In [None]:
attrition_paper[['Title', 'Abstract', 'Author Keywords']] = attrition_paper[['Title', 'Abstract', 'Author Keywords']].fillna('')
attrition_paper['Cited by'] = attrition_paper['Cited by'].fillna(0)
attrition_paper['text'] = attrition_paper['Title'] + ' ' + attrition_paper['Abstract'] + ' ' + attrition_paper['Author Keywords']

In [None]:
attrition_paper.head()

Unnamed: 0,Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,...,Link,Abstract,Author Keywords,Index Keywords,Document Type,Publication Stage,Open Access,Source,EID,text
0,Ramirez F.; Mari W.; Martingayle D.,"Ramirez, Fanny (57204107984); Mari, William (5...",57204107984; 57190610149; 58302990800,Let Us Fly Our Drones: An Examination of Stude...,2025,Journalism Practice,19,4.0,,923.0,...,https://www.scopus.com/inward/record.uri?eid=2...,This study uses diffusion of innovations theor...,college newspapers; content analysis; Drone jo...,,Article,Final,,Scopus,2-s2.0-105001833789,Let Us Fly Our Drones: An Examination of Stude...
1,Ngulube P.,"Ngulube, Patrick (8884474300)",8884474300,Leveraging information and communication techn...,2025,Discover Environment,3,1.0,9,,...,https://www.scopus.com/inward/record.uri?eid=2...,"Using a qualitative approach, this study exami...",Digital innovations; Environmental preservatio...,,Review,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-85217821172,Leveraging information and communication techn...
2,Yiu C.; Liu Y.; Park W.; Li J.; Huang X.; Yao ...,"Yiu, Chunki (57215813633); Liu, Yiming (572107...",57215813633; 57210733719; 57339547900; 5733917...,Skin-interfaced multimodal sensing and tactile...,2025,Science Advances,11,13.0,eadt6041,,...,https://www.scopus.com/inward/record.uri?eid=2...,Unmanned aerial vehicles have undergone substa...,,Aircraft; Equipment Design; Feedback; Humans; ...,Article,Final,,Scopus,2-s2.0-105001336860,Skin-interfaced multimodal sensing and tactile...
3,Goodman N.P.; Lehto O.; Novak M.,"Goodman, Nathan P. (57196424399); Lehto, Otto ...",57196424399; 57202013323; 57190026901,Institutional diversity and innovative recombi...,2025,European Economic Review,174,,104998,,...,https://www.scopus.com/inward/record.uri?eid=2...,"In Explaining Technology, Koppl et al. (2023) ...",Competition policy; Drones; Governing knowledg...,empirical analysis; innovation; intellectual p...,Article,Final,,Scopus,2-s2.0-85218500779,Institutional diversity and innovative recombi...
4,Ghysels S.; De Baets B.; Reheul D.; Maenhout S.,"Ghysels, Sarah (59711189100); De Baets, Bernar...",59711189100; 55664779600; 6602168060; 22734778400,Image-based yield prediction for tall fescue u...,2025,Frontiers in Plant Science,16,,1549099,,...,https://www.scopus.com/inward/record.uri?eid=2...,"In the early stages of selection, many plant b...",convolutional neural network; dry matter yield...,,Article,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-105001130182,Image-based yield prediction for tall fescue u...


2. Lemmatize and clean text removing stopwords, scientific literature blacklist, and domain blacklist.

In [None]:
# Import blacklist
# The literature Blacklist include a list of common terms from scientific literature
literature_blacklist = pd.read_csv('/content/drive/MyDrive/Truong Nguyen/dictionaries/literature_blacklist.csv')
# The Domain Blacklist includes the terms used in the keywords and the most frequent and the most rare terms from the abstracts in the dataset
domain_blacklist = pd.read_csv('/content/drive/MyDrive/Truong Nguyen/dictionaries/domain_blacklist.csv')

In [None]:
# Merge blacklist
blacklist = pd.concat([literature_blacklist, domain_blacklist])
len(blacklist)

6334

In [None]:
# Remove duplicates (if any)
blacklist.drop_duplicates(subset='value', inplace=True)
len(blacklist)

5123

In [None]:
# Transform blacklist in a list with .tolist()
blacklist_list = blacklist["value"].tolist()

In [None]:
# Configure cleaning operations
config = {
    'remove_punct' : True,
    'remove_num' : True,
    'remove_stopwords' : True,
    'lemmatize' : True,
    'remove_blacklist' : blacklist_list
}

In [None]:
# Define preprocessing funcion
import spacy

nlp = spacy.load('en_core_web_sm') # load language model

def preprocess_txt(text):
    text = text.lower() # convert to lower case
    doc = nlp(text) # apply language model
    if config['remove_punct']:
        doc = [token for token in doc if not token.is_punct]
    if config['remove_num']:
        doc = [token for token in doc if not token.is_digit]
    if config['remove_stopwords']:
        doc = [token for token in doc if not token.is_stop and token.text not in config['remove_blacklist']]
    if config['lemmatize']:
        doc = [token.lemma_ for token in doc]   # .lemma_ is a string
    if config['remove_blacklist']:
        doc = [token for token in doc if token not in config['remove_blacklist']]

    result = ''
    for text in doc:
        result += text + ' '

    return result.strip()

In [None]:
# Apply preprocessing funcion to text [PAY ATTENION: LONG]
attrition_paper['text_preprocessed'] = attrition_paper['text'].apply(lambda text: preprocess_txt(text))

In [None]:
attrition_paper.head()

Unnamed: 0,Authors,Author full names,Author(s) ID,Title,Year,Source title,Volume,Issue,Art. No.,Page start,...,Abstract,Author Keywords,Index Keywords,Document Type,Publication Stage,Open Access,Source,EID,text,text_preprocessed
0,Ramirez F.; Mari W.; Martingayle D.,"Ramirez, Fanny (57204107984); Mari, William (5...",57204107984; 57190610149; 58302990800,Let Us Fly Our Drones: An Examination of Stude...,2025,Journalism Practice,19,4.0,,923.0,...,This study uses diffusion of innovations theor...,college newspapers; content analysis; Drone jo...,,Article,Final,,Scopus,2-s2.0-105001833789,Let Us Fly Our Drones: An Examination of Stude...,let fly drone examination student newspaper co...
1,Ngulube P.,"Ngulube, Patrick (8884474300)",8884474300,Leveraging information and communication techn...,2025,Discover Environment,3,1.0,9,,...,"Using a qualitative approach, this study exami...",Digital innovations; Environmental preservatio...,,Review,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-85217821172,Leveraging information and communication techn...,leverage information communication technology ...
2,Yiu C.; Liu Y.; Park W.; Li J.; Huang X.; Yao ...,"Yiu, Chunki (57215813633); Liu, Yiming (572107...",57215813633; 57210733719; 57339547900; 5733917...,Skin-interfaced multimodal sensing and tactile...,2025,Science Advances,11,13.0,eadt6041,,...,Unmanned aerial vehicles have undergone substa...,,Aircraft; Equipment Design; Feedback; Humans; ...,Article,Final,,Scopus,2-s2.0-105001336860,Skin-interfaced multimodal sensing and tactile...,interface multimodal sensing tactile feedback ...
3,Goodman N.P.; Lehto O.; Novak M.,"Goodman, Nathan P. (57196424399); Lehto, Otto ...",57196424399; 57202013323; 57190026901,Institutional diversity and innovative recombi...,2025,European Economic Review,174,,104998,,...,"In Explaining Technology, Koppl et al. (2023) ...",Competition policy; Drones; Governing knowledg...,empirical analysis; innovation; intellectual p...,Article,Final,,Scopus,2-s2.0-85218500779,Institutional diversity and innovative recombi...,institutional diversity innovative recombinati...
4,Ghysels S.; De Baets B.; Reheul D.; Maenhout S.,"Ghysels, Sarah (59711189100); De Baets, Bernar...",59711189100; 55664779600; 6602168060; 22734778400,Image-based yield prediction for tall fescue u...,2025,Frontiers in Plant Science,16,,1549099,,...,"In the early stages of selection, many plant b...",convolutional neural network; dry matter yield...,,Article,Final,All Open Access; Gold Open Access,Scopus,2-s2.0-105001130182,Image-based yield prediction for tall fescue u...,image base yield prediction tall fescue random...


In [None]:
# Save results
attrition_paper.to_csv(r'/content/drive/MyDrive/Truong Nguyen/wip/BERTopic_cleaned.csv', index = False, header=True)

OSError: [Errno 107] Transport endpoint is not connected: '/content/drive/MyDrive/Truong Nguyen/wip'

# **Application of the BERTopic model**

Let's apply BERTopic using the techniques for imporving topic representation (with reference to the elimination of stopwords in defining the names of the clusters). We will customize UMAP only to set the random state to ensure reproducibility.

Then we will visualize the Topics' Hierarchy to get information on the structure of the clustering and the UMAP model to assess the clustering. The second is a visualization of the distribution of the embeddings in the clusters in a two-dimensional space, where each paper as a point, colored as the belonging cluster. The graph will provide the title of the papers moving on the graph.  

We will apply this pipeline both to raw and clean text.

In [None]:
# Connect Google Drive (GDrive) with Colab [not needed if it is the same run]
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
import pandas as pd
attrition_paper = pd.read_csv('/content/drive/MyDrive/Truong Nguyen/wip/BERTopic_cleaned.csv')

In [None]:
# Import pre-processed data [not needed if it is the same run]
import pandas as pd
attrition_paper = pd.read_csv(r'/content/drive/MyDrive/Truong Nguyen/wip/BERTopic_cleaned.csv')

TRY TO CHANGE THE **INITIAL CONDITION** (random_state in UMAP) AND CHECK HOW MUCH RESULTS DEPEND ON THE **INITIAL CONDITION**

In [None]:
# Set models
from scipy.cluster import hierarchy as sch
from bertopic import BERTopic
from umap import UMAP

# Set UMAP model
umap_model_new = UMAP(random_state=567)

# Set BERTopic model
topic_model_new = BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range =(1,2), umap_model=umap_model_new)

**Application to raw text**

In [None]:
# Apply model to raw text
topics_new_raw, probs_new_raw = topic_model_new.fit_transform(attrition_paper.text)
len(topic_model_new.get_topic_info())

2025-04-17 13:31:19,961 - BERTopic - Embedding - Transforming documents to embeddings.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/62 [00:00<?, ?it/s]

2025-04-17 13:31:37,209 - BERTopic - Embedding - Completed ✓
2025-04-17 13:31:37,209 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-17 13:31:49,320 - BERTopic - Dimensionality - Completed ✓
2025-04-17 13:31:49,321 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-17 13:31:49,432 - BERTopic - Cluster - Completed ✓
2025-04-17 13:31:49,443 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-04-17 13:31:51,648 - BERTopic - Representation - Completed ✓


3

In [None]:
freq_new_raw = topic_model_new.get_topic_info()
freq_new_raw

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1897,0_the_and_of_to,"[the, and, of, to, in, for, is, this, with, on]",[Technology disruption for development and pea...
1,1,39,1_of_the_in_and,"[of, the, in, and, sperm, to, honey, apis, dro...",[Flow cytometry evidence about sperm competiti...
2,2,19,2_space_and_for_of,"[space, and, for, of, the, using, orbital, of ...",[Proceedings of the International Astronautica...


In [None]:
df_new_raw = pd.DataFrame({'Topic': topics_new_raw, 'scopus_id': attrition_paper.EID, 'year':attrition_paper.Year})
df_new_raw.head()

Unnamed: 0,Topic,scopus_id,year
0,0,2-s2.0-105001833789,2025
1,0,2-s2.0-85217821172,2025
2,0,2-s2.0-105001336860,2025
3,0,2-s2.0-85218500779,2025
4,0,2-s2.0-105001130182,2025


In [None]:
freq_new_raw

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1897,0_the_and_of_to,"[the, and, of, to, in, for, is, this, with, on]",[Technology disruption for development and pea...
1,1,39,1_of_the_in_and,"[of, the, in, and, sperm, to, honey, apis, dro...",[Flow cytometry evidence about sperm competiti...
2,2,19,2_space_and_for_of,"[space, and, for, of, the, using, orbital, of ...",[Proceedings of the International Astronautica...


In [None]:
# Save results
freq_new_raw.to_csv (r'/content/drive/MyDrive/Truong Nguyen/output/BERTopics_new_raw_topic_freq2.csv', index = False, header=True)
df_new_raw.to_csv (r'/content/drive/MyDrive/Truong Nguyen/output/BERTopics_new_raw_paper2.csv', index = False, header=True)

In [None]:
from google.colab import drive
drive.flush_and_unmount()

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Hierarchical topics
hierarchical_topics_new_raw = topic_model_new.hierarchical_topics(attrition_paper.text)

100%|██████████| 2/2 [00:00<00:00, 34.08it/s]


In [None]:
# Visualize hierarchical topics in a tree
tree_new_raw = topic_model_new.get_topic_tree(hierarchical_topics_new_raw)
print(tree_new_raw)

# (copy and paste in a txt to save the result)

.
├─■──space_and_for_of_the ── Topic: 2
└─the_and_of_to_in
     ├─■──the_and_of_to_in ── Topic: 0
     └─■──of_the_in_and_sperm ── Topic: 1



In [None]:
# Results from UMAP model

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP

# Set UMAP model [note: ONLY random_stade ensures replication]
umap_model_new = UMAP(random_state=567)

# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings_raw = sentence_model.encode(attrition_paper.text, show_progress_bar=False)

# Train BERTopic
topic_model_new2_raw = BERTopic(language="english", calculate_probabilities=True, verbose=True, n_gram_range =(1,2), umap_model=umap_model_new).fit(attrition_paper.text, embeddings_raw)

# Run the visualization with the original embeddings
topic_model_new2_raw.visualize_documents(attrition_paper.text, embeddings=embeddings_raw)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings_raw = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_raw)


2025-04-17 13:37:20,084 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-17 13:37:25,520 - BERTopic - Dimensionality - Completed ✓
2025-04-17 13:37:25,521 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-17 13:37:25,621 - BERTopic - Cluster - Completed ✓
2025-04-17 13:37:25,625 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-04-17 13:37:27,063 - BERTopic - Representation - Completed ✓


In [None]:
# Set only numbers as labels (for a better visualization)
topic_labels = list((range(-1,2)))

for i in range(0,len(topic_labels)):
  topic_labels[i] = str(topic_labels[i])

topic_model_new2_raw.set_topic_labels(topic_labels)

In [None]:
# Visualize plot
fig_UMAP_new_raw = topic_model_new2_raw.visualize_documents(attrition_paper.Title, reduced_embeddings=reduced_embeddings_raw, hide_annotations = False, custom_labels= True, width = 800, height = 500)
fig_UMAP_new_raw

In [None]:
# Save results in html to have the interactive version
import plotly.express as px
fig_UMAP_new_raw.write_html("/content/drive/MyDrive/Truong Nguyen/output/BERTopics_new_raw_UMAP2.html", default_width = 1200, default_height = 1200)

**Application to clean text**

In [None]:
# Apply model to clean text
topics_new_clean, probs_new_clean = topic_model_new.fit_transform(attrition_paper.text_preprocessed)
len(topic_model_new.get_topic_info())

2025-04-17 13:39:15,698 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/62 [00:00<?, ?it/s]

2025-04-17 13:39:22,692 - BERTopic - Embedding - Completed ✓
2025-04-17 13:39:22,696 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-17 13:39:33,016 - BERTopic - Dimensionality - Completed ✓
2025-04-17 13:39:33,024 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-17 13:39:33,386 - BERTopic - Cluster - Completed ✓
2025-04-17 13:39:33,399 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-04-17 13:39:36,066 - BERTopic - Representation - Completed ✓


3

In [None]:
freq_new_clean = topic_model_new.get_topic_info()

In [None]:
freq_new_clean

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,1895,0_drone_system_technology_base,"[drone, system, technology, base, market, uav,...",[application drone crisis mobile application c...
1,1,37,1_sperm_bee_male_colony,"[sperm, bee, male, colony, honey, drone, queen...",[patriline composition change honey bee queen ...
2,2,23,2_space_spacecraft_orbital_electric,"[space, spacecraft, orbital, electric, commerc...",[international astronautical iac contain paper...


In [None]:
df_new_clean = pd.DataFrame({'Topic': topics_new_clean, 'scopus_id': attrition_paper.EID, 'year':attrition_paper.Year})
df_new_clean.head()

Unnamed: 0,Topic,scopus_id,year
0,0,2-s2.0-105001833789,2025
1,0,2-s2.0-85217821172,2025
2,0,2-s2.0-105001336860,2025
3,0,2-s2.0-85218500779,2025
4,0,2-s2.0-105001130182,2025


In [None]:
# Save results
freq_new_clean.to_csv (r'/content/drive/MyDrive/Truong Nguyen/output/BERTopics_new_clean_topic_freq2(2).csv', index = False, header=True)
df_new_clean.to_csv (r'/content/drive/MyDrive/Truong Nguyen/output/BERTopics_new_clean_paper2(2).csv', index = False, header=True)

In [None]:
# Hierarchical topics
hierarchical_topics_new_clean = topic_model_new.hierarchical_topics(attrition_paper.text_preprocessed)

100%|██████████| 2/2 [00:00<00:00, 36.59it/s]


In [None]:
# Visualize hierarchical topics in a tree
tree_new_clean = topic_model_new.get_topic_tree(hierarchical_topics_new_clean)
print(tree_new_clean)

# (copy and paste in a txt to save the result)

.
├─■──space_spacecraft_orbital_electric_commercial ── Topic: 2
└─drone_system_technology_base_market
     ├─■──drone_system_technology_base_market ── Topic: 0
     └─■──sperm_bee_male_colony_honey ── Topic: 1



In [None]:
# Visualize results from UMAP model

from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
from umap import UMAP
from bertopic.representation import MaximalMarginalRelevance


# Set UMAP model [note: ONLY random_stade ensures replication]
umap_model_new = UMAP(random_state=567)

# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings_clean = sentence_model.encode(attrition_paper.text_preprocessed, show_progress_bar=False)


representation_model = MaximalMarginalRelevance(diversity=0.2)

# Train BERTopic

topic_model_new2_clean = BERTopic(
    language="english",
    calculate_probabilities=True,
    verbose=True,
    n_gram_range=(1, 2),
    umap_model=umap_model_new,
    representation_model=representation_model
).fit(attrition_paper.text_preprocessed, embeddings_clean)

# Run the visualization with the original embeddings
topic_model_new2_clean.visualize_documents(attrition_paper.text_preprocessed, embeddings=embeddings_clean)

# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings_clean = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings_clean)
fig_UMAP_new_clean = topic_model_new2_clean.visualize_documents(attrition_paper.Title, reduced_embeddings=reduced_embeddings_clean)

2025-04-17 13:41:14,937 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-04-17 13:41:21,674 - BERTopic - Dimensionality - Completed ✓
2025-04-17 13:41:21,675 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-04-17 13:41:21,984 - BERTopic - Cluster - Completed ✓
2025-04-17 13:41:21,998 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-04-17 13:41:24,181 - BERTopic - Representation - Completed ✓


In [None]:
# Set only numbers as labels (for a better visualization)
topic_labels_clean = list((range(-1,2)))

for i in range(0,len(topic_labels_clean)):
  topic_labels_clean[i] = str(topic_labels_clean[i])

topic_model_new2_clean.set_topic_labels(topic_labels_clean)

In [None]:
# Visualize plot
fig_UMAP_new_clean = topic_model_new2_clean.visualize_documents(attrition_paper.Title, reduced_embeddings=reduced_embeddings_clean, hide_annotations = False, custom_labels= True, width = 800, height = 500)
fig_UMAP_new_clean

In [None]:
# Save results in html to have the interactive version
import plotly.express as px
fig_UMAP_new_clean.write_html("/content/drive/MyDrive/Truong Nguyen/output/BERTopics_new_raw_UMAP2(2).html", default_width = 1200, default_height = 1200)