<a href="https://colab.research.google.com/github/RJuro/ga22/blob/main/ga22/tutorials/BERTopic_Cordis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic for Topic Modeling with CORDIS data.

[BERTopic](https://maartengr.github.io/BERTopic/index.html) can be considered the current (2022) state of the art in topic modeling. You'll find the corresponding [paper here](https://arxiv.org/abs/2203.05794).
It's advantage lies in a clever use of [sentence transfomers](https://www.sbert.net/) as well as dimensionality reduction and clustering (per default UMAP and HDBSCAN). 
Sentence transformers allow to encode natural language efficiently (also very large amounts). UMAP and HDBSCAN are two high-performance algorithms.
The autor Maarten Grootendorst released a well documented and increasingly used package that implements all steps including useful visualization and representation tool.

In this tutorial we will use the approach to identify topics in CORDIS data (EU FP and H2020 project results). 
This is a basic-application tutorial adjusted to work for "smaller data" (500 summaries) following [this tutorial](https://www.kaggle.com/code/maartengr/topic-modeling-arxiv-abstract-with-bertopic/notebook).

Also: We are going to use a GPU enabled instance... You get very far with Google Colab (clear legal first)

In [1]:
# Start by installing the package (in quite mode)
!pip install bertopic -q

[K     |████████████████████████████████| 76 kB 1.7 MB/s 
[K     |████████████████████████████████| 636 kB 11.3 MB/s 
[K     |████████████████████████████████| 88 kB 5.7 MB/s 
[K     |████████████████████████████████| 5.2 MB 51.4 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 85 kB 5.1 MB/s 
[K     |████████████████████████████████| 4.7 MB 46.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 59.4 MB/s 
[K     |████████████████████████████████| 101 kB 11.9 MB/s 
[K     |████████████████████████████████| 6.6 MB 41.3 MB/s 
[K     |████████████████████████████████| 1.1 MB 60.1 MB/s 
[?25h  Building wheel for hdbscan (PEP 517) ... [?25l[?25hdone
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Building wheel for py

In [2]:
# Colab specific widget handling
from google.colab import output
output.enable_custom_widget_manager()

In [3]:
# Load packages for the analysis
import pandas as pd #handling / opening data
import random #create random years (this table does not have clear years)

from bertopic import BERTopic
from hdbscan import HDBSCAN
from umap import UMAP

from sklearn.feature_extraction.text import CountVectorizer

In [4]:
# Load report-data
reports = pd.read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/cordis-h2020reports.gz')

In [16]:
years = pd.to_datetime(reports.lastUpdateDate)

In [17]:
set([y.year for y in years])

{nan, 2017, 2018, 2019}

In [18]:
# creating "fake years" for this tutorial...don't do that in a real analysis :-)
reports['year'] = [random.choice(range(2010,2018)) for _ in range(len(reports))]

In [19]:
reports['summary']

0      Polyaniline has historically been one of the m...
1      Problem/issue: Increasing digitalisation enabl...
2      At a time of public budget constraints, major ...
3      Espresso coffee has always been closely associ...
4      The primary industrial objective of the PLEIAD...
                             ...                        
495    HEMAV, a technology-based SME leader in civil ...
496    The ultimate objective of the Celletest projec...
497    Magnesium deficiency is a global issue which a...
498    The main objective of the MEMO project is to c...
499    Hydrophobic surfaces have significant potentia...
Name: summary, Length: 500, dtype: object

We need to specify a few things to make the approach work in our setting.
This will involve:



*   Use a custom vectorizer that will remove stop-words (e.g. the, and, to, I)
*   Tweak UMAP and HDBSCAN to produce more and more specific clusters (check BERTopic FAQ and documentation)
* Request use fo n-grams from BERTopics for "reporting"
* use of specialized allenai-specter trasformer pretrained to deal with scientific text



In [20]:
# custom vectorizer to get rid of stopwords
vectorizer_model = CountVectorizer(stop_words="english")

# lower n_neighbors=3 value thatn standard 5 and lower n_components=3
umap_model = UMAP(n_neighbors=3, n_components=3, 
                  min_dist=0.0, metric='cosine', random_state=42)

# resuce min_cluster_size and min_samples
hdbscan_model = HDBSCAN(min_cluster_size=20, metric='euclidean', 
                        cluster_selection_method='eom', prediction_data=True, min_samples=3)

# specify all custom models and n_grams
topic_model = BERTopic(verbose=True, 
                       embedding_model="allenai-specter", 
                       n_gram_range=(2, 3), 
                       hdbscan_model=hdbscan_model, 
                       umap_model=umap_model,
                       vectorizer_model=vectorizer_model)

In [21]:
# Run the modelnig
topics, _ = topic_model.fit_transform(reports['summary']); len(topic_model.get_topic_info())

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2022-08-21 09:35:22,778 - BERTopic - Transformed documents to Embeddings
2022-08-21 09:35:30,834 - BERTopic - Reduced dimensionality
2022-08-21 09:35:30,863 - BERTopic - Clustered reduced embeddings


11

the object `topics` is a vector with the cluster-numbers that can be used in other analysis...

Below some built-in ways for exploring the results

In [22]:
topic_model.get_topic_info().head(10)

Unnamed: 0,Topic,Count,Name
0,-1,39,-1_rail_plant_project_plants
1,0,131,0_cells_project_cell_disease
2,1,96,1_energy_market_materials_project
3,2,46,2_innovation_support_sme_smes
4,3,39,3_researchers_science_people_social
5,4,34,4_quantum_project_spin_optical
6,5,26,5_process_sensor_software_data
7,6,24,6_data_research_eo_knowledge
8,7,24,7_water_wastewater_lignin_treatment
9,8,21,8_fish_aircraft_european_aquaculture


In [23]:
topic_model.visualize_barchart(top_n_topics=9, height=200)

In [24]:
topic_model.visualize_topics(top_n_topics=50)

In [25]:
topic_model.visualize_hierarchy(top_n_topics=50, width=800)

In [27]:
# dynamic analysis with "fake years"
topics_over_time = topic_model.topics_over_time(reports['summary'], topics, reports['year'])
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=20, width=900, height=500)

You can also use BERTopic to generate embeddings and use them in other analysis...for instance some supervised task. However, it is probably easier to go directly to SBERT (sentence transformers)

In [53]:
# create embeddings

docs = topic_model.embedding_model.embed_documents(reports['summary'])