# Exploring SDoH topics with BERTopic


### Install Bertopic

In [1]:
!pip install bertopic                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ## Py Standard Libs
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from umap import UMAP
import os
import pandas as pd
import requests

Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting pynndescent>=0.5 (from umap-learn>=0.5.0->bertopic)
  Downloading pynndescent-0.5.13-py3-none-any.whl.metadata (6.8 kB)
Downloading bertopic-0.16.4-py3-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.7/143.7 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.39-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB

### Load SDG Data
The OSDG Community Dataset is a set of document excerpts with labels identifying the sustainable development goal.  Labeling is manually performed by volunteers. The text excerpts are derived from publicly available documents, including reports, policy documents and journal abstracts.

Column Description

* doi - Digital Object Identifier of the original document
* text_id - unique text identifier
* text - text excerpt from the document
* sdg - the SDG the text is validated against
* labels_negative - the number of volunteers who rejected the suggested SDG label
* labels_positive - the number of volunteers who accepted the suggested SDG label
* agreement - agreement score

In [2]:
url = 'https://zenodo.org/records/11441197/files/osdg-community-data-v2024-04-01.csv?download=1'
# Read the CSV file directly from the URL
data = pd.read_csv(url, sep='\t')

### Filter SDGs
We will explore sustainable goal 11, "Sustainable Cities and Communities"

In [3]:
data = data.loc[data['sdg'] == 11]
text_source = data['text'].to_list()

## Embedding Text
Compute embeddings using Sentence Transformer models.  See https://www.sbert.net/  The all-MiniLM-L6-v2 transformer can provide general social media analysis where performance and speed are critical.

In [4]:
model_embedding = SentenceTransformer('all-MiniLM-L6-v2')
corpus_embeddings = model_embedding.encode(text_source)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Instantiate vectorizer

Embedding models can handle stop words, removing them might result in better and more coherent topics for many use cases. We can use the CountVectorizer to preprocess our  embeddings. There is no disadvantages to using the CountVectorizer to remove stopwords as the embeddings are generated based on the full texts.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_model = CountVectorizer(stop_words="english")

## Instantiate Model and Fit

**n_gram_range**: This refers to the number of words grouped together when creating topics. For example, in the phrase "air polution the two words form a 2-gram (n=2).

**nr_topics**: This limits the number of topics after the model has been trained. For example, if the model initially identifies 50 topics, but you set nr_topics to 10, the model will reduce the number of topics to around 10. When set to "auto," the model will automatically decide the number of topics using a clustering algorithm called HDBSCAN.

**min_topic_size**: This sets the minimum number of data points (documents) required to form a topic. A lower value results in more topics.




In [6]:
model = BERTopic(
    n_gram_range=(1, 2),
    vectorizer_model=vectorizer_model,
    #nr_topics='auto',
    nr_topics=25,
    min_topic_size=10,
    calculate_probabilities=True).fit(text_source, corpus_embeddings)

# Topic frequency

Get topic frequencies sorted to show the largest.  A representative document is also given for each topic.  Note that -1 are outliers - can't be assigned to a topic.

In [7]:
mti_df = model.get_topic_info()
mti_df = mti_df.sort_values('Count', ascending=False)
mti_df[['Topic', 'Count', 'Name', 'Representative_Docs']]

Unnamed: 0,Topic,Count,Name,Representative_Docs
0,-1,729,-1_urban_development_transport_public,[The Municipal Development Plan and the Calgar...
1,0,244,0_planning_urban_development_land,[This may have reduced Viet Nam’s potential to...
2,1,198,1_road_safety_traffic_speed,[Measures to reduce fatalities and serious inj...
3,2,195,2_transport_public_services_mobility,[At the same time it poses clear risks to the ...
4,3,110,3_housing_rental_social_households,"[In addition, 22 reporting countries support t..."
5,4,89,4_almaty_kazakhstan_astana_city,[The cooperation is regulated by contracts sig...
6,5,77,5_urban_children_cent_african,[The relative wealth disparity metric is inste...
7,6,67,6_risk_resilience_disaster_disasters,"[Therefore, risk should be seen as a normal an..."
8,7,59,7_local_government_governments_subnational,[The city government organises workshops and t...
9,8,57,8_air_pollution_quality_pollutants,[Figure 1 illustrates different patterns of ur...


# Topic Barchart
This provides the c-TF-IDF measure for each word in a topic.  The counts how often each word appears in documents within a given topic and the importance of each word by calculating how unique or rare it is across all topics.

In [8]:
model.visualize_barchart(topics= [0, 1, 2, 3, 4, 5, 6])

# Document clustering
Embeddings are reduced to a 2 dimensional space to visualize topics and get insight into their relationships.

In [9]:
model.visualize_documents(text_source, embeddings=corpus_embeddings)

## Generate the predicted topic and the probabilities
We can generate the topic for each document and the probability that the document belongs to this cluster of documents (a topic).  We use these to inspect all the documents in topic 1.

In [10]:

topics, probabilities = model.transform(text_source, corpus_embeddings)

In [11]:
topic_text_df = pd.DataFrame({'topic': topics, 'document': text_source})
topic_text_df[topic_text_df.topic == 1]

Unnamed: 0,topic,document
17,1,The issue of low noise vehicles (i.e.: electri...
31,1,The Police Authority of Gyeonggi Province took...
66,1,This unit would be charged with leading develo...
77,1,Towards Zero: Road Safety Strategy endorsed by...
78,1,An MTA that covers the entire commuting area c...
...,...,...
2184,1,The Cycling Development Concept was based on p...
2200,1,Between 1990 and 1993 the TAC strategy helped ...
2211,1,Both cases show that strong leadership is need...
2215,1,"In road policing, Police target population bas..."


In [12]:
probabilities

array([[4.04381153e-04, 3.06446568e-05, 2.39424221e-04, ...,
        5.43945409e-05, 4.55222557e-05, 5.23918079e-05],
       [1.42515037e-01, 1.41727390e-03, 2.10673058e-02, ...,
        2.35494131e-02, 1.85448630e-02, 1.13448007e-03],
       [2.51086552e-01, 2.15517669e-03, 2.45392654e-02, ...,
        2.27892203e-02, 1.52079097e-02, 2.44668903e-03],
       ...,
       [4.86248127e-32, 3.24392951e-33, 5.04087074e-01, ...,
        6.26403493e-33, 5.19174604e-33, 1.12236626e-33],
       [3.94221760e-13, 6.79361613e-14, 9.37552579e-01, ...,
        5.46022835e-14, 4.51990835e-14, 3.26313785e-14],
       [1.08376533e-10, 1.01860728e-11, 9.28543877e-01, ...,
        1.41524845e-11, 1.16890882e-11, 2.32094854e-12]])