# **BERTopic Analysis**
**Overview**

This analysis uses BERTopic, an advanced topic modeling technique that combines transformer-based embeddings with clustering algorithms to automatically discover topics in text data. We're analyzing a dataset focused on disability and digital inclusion themes.

# Installation and Setup
The required library is installed first (run only once in Google Colab):


In [None]:
# Import required libraries
import pandas as pd
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
import random

In [None]:
# Set random seed for reproducible results
random.seed(42)

# Data Loading and Theme Extraction

Load the dataset and separate it into different thematic categories for potential individual analysis:


In [None]:
# Step 1 – Load the pre-cleaned text dataset and make a dataframe for each theme
df = pd.read_csv("../1_datasets/processed_data/cleaned_datasets.csv") #Load the main dataset

# Create separate dataframes for each thematic category
barrier_to_access =df[df['theme'] == "barriers_to_access"] # Documents about access obstacles
digital_infrastructure = df[df['theme'] == "digital_infrastructure"]  # Infrastructure-related documents
inclusive_digital_technology = df[df['theme'] == "inclusive_digital_technology"] # Accessibility tech documents


**Purpose:** This step loads the preprocessed dataset and organizes it by predefined themes. While separate dataframes are created, all documents will be analyzed together to discover cross-cutting topics.

# Text Preprocessing and Stopword Configuration
Define domain-specific stopwords to focus on meaningful topic-distinguishing terms:

In [None]:
# Define words to ignore
disability_domain_stopwords = [
        'disability', 'disabled', 'person', 'people',
        'africa', 'african', 'report', 'study',
        'research', 'data', 'information', "nigeria", "india", "2024", "non", "percent", "sta", "et","al","100",
        "au", "2020", "akinola", "at"
    ]

In [None]:
# Modify for specific theme or all datasets
docs = df["cleaned_text"].tolist() # Convert pandas Series to list format required by BERTopic

**Purpose:** Preparing the text data by removing domain-specific common words that don't contribute to meaningful topic differentiation.

# Model Configuration
Set up the vectorizer, clustering algorithm, and dimensionality reduction components:




In [None]:
    # Create vectorizer that ignores these words
vectorizer_model = CountVectorizer(
        stop_words=disability_domain_stopwords, # Apply custom stopwords
        min_df=2,                    # Ignore words appearing in <2 documents
        max_df=0.8,                  # Ignore words appearing in >80% of documents
        ngram_range=(1, 3)           # Include 1-3 word phrases
    )


In [None]:
# Configure clustering algorithm for small, cohesive clusters
hdbscan_model = HDBSCAN(min_cluster_size=2, min_samples=1) # Very permissive clustering settings


**Purpose:** Configuring the core components of BERTopic including text vectorization and clustering parameters to optimize topic discovery for our specific dataset.

# Dimensionality Reduction Configuration



In [None]:
# Configure UMAP for dimensionality reduction of document embeddings
umap_model = UMAP(
    n_neighbors=15,     # Lower values focus on local structure, good for fine-grained topics
    n_components=5,     # Number of dimensions to reduce to (typical range: 5–15)
    min_dist=0.0,       # Allows tight clustering of points
    metric='cosine',    # Better for semantic embeddings like BERT
)


**Purpose:** Setting up UMAP (Uniform Manifold Approximation and Projection) to reduce the high-dimensional document embeddings while preserving semantic relationships, enabling better clustering and visualization.


# Model Training and Topic Discovery
Initialize and train the BERTopic model:


In [None]:
# Instantiate BERTopic model with configured components
topic_model = BERTopic(
    language="english", # Specify language for optimized processing
    hdbscan_model=hdbscan_model, # Use configured clustering algorithm
    nr_topics=15, # Force creation of exactly 15 topics
    vectorizer_model=vectorizer_model,
    )  # Apply custom vectorizer with stopwords

# Train the model and assign topics to documents
topics, probs = topic_model.fit_transform(docs) # topics: topic assignments, probs: assignment probabilities

# Create topic hierarchy for understanding relationships
hierarchical_topics = topic_model.hierarchical_topics(docs)

# Check the model
topic_model.get_topic_info()


100%|██████████| 10/10 [00:00<00:00, 181.52it/s]


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,4,-1_ict_intellectual_national_communication tec...,"[ict, intellectual, national, communication te...",[good practice employment person intellectual ...
1,0,3,0_centre_tech_accesstech_mtn,"[centre, tech, accesstech, mtn, app, button, t...",[blind tech ceo return inspire student pacelli...
2,1,6,1_digital inclusion_internet access_south_infr...,"[digital inclusion, internet access, south, in...",[size distribution digital connectivity gap su...
3,2,2,2_ai_young_entrepreneur_entrepreneurship,"[ai, young, entrepreneur, entrepreneurship, ai...",[innovate inclusion youth disability lead futu...
4,3,2,3_quality_teacher_approximately_attend,"[quality, teacher, approximately, attend, emer...",[transform education begin inclusion  general...
5,4,5,4_mobile internet_gsma_cost_cent,"[mobile internet, gsma, cost, cent, social pro...",[term condition access use find assistive tech...
6,5,3,5_inclusive datum_sightsaver_safaricom_plan,"[inclusive datum, sightsaver, safaricom, plan,...",[empower vulnerable community mtn commitment i...
7,6,4,6_telecom_director_outlet_panel,"[telecom, director, outlet, panel, google, saf...",[man disability catalyst innovation papi sibom...
8,7,2,7_pwd_digital inclusion_program_project enable,"[pwd, digital inclusion, program, project enab...",[strengthen ict accessibility pwd africa post ...
9,8,2,8_ai_dataset_governance_ibid,"[ai, dataset, governance, ibid, funding, pwds,...",[ai assistive technology at person disability ...


**Purpose:** This comprehensive step handles all model configuration including custom stopwords definition, text vectorization, clustering parameters, model instantiation, training, and hierarchical topic analysis. The custom stopwords remove domain-specific common terms, the vectorizer focuses on meaningful phrases, permissive clustering allows specialized topics, and the complete pipeline generates topic assignments with hierarchical relationships.

# Representative Document Analysis
Sample documents for each topic are examined to understand their content:

In [None]:
for topic_num in topics: # Iterate through all topic assignments
    if topic_num != -1: # Skip outlier documents (topic -1)
        print(f"\n--- TOPIC {topic_num} ---\n")
        docs = topic_model.get_representative_docs(topic_num) # Get most representative documents for this topic
        for doc in docs[:3]:  # First 3 sample docs
            print(doc[:300])  # Preview first 300 chars


--- TOPIC 6 ---

man disability catalyst innovation papi sibomana photo credit patrick nzabonimpa deep readquick read patrick nzabonimpa 2014 papi sibomana travel india hope treatment repair vision lose year early arrive optimism hope dash tell impairment incurable instead treatment refer rehabilitation centre month
telecom operators africa fail person disability august 2020 access deny introduction research design scope result availability accessible handset sale outlet promotion awareness accessible mobile telecommunication procurement policy accessible handset physical accessibility sale outlet capacity tele
theme digital accessibility assistive technology africa level knowledge document day 29- 2023 opening remark irene mbari kirika founder executive director inable key takeaway disability long term phenomenon collaboration important change generation come access unity purpose change generation come wo

--- TOPIC 4 ---

term condition access use find assistive technology official 

**Purpose:** By reading actual documents from each topic, the clustering's semantic coherence can be verified and each topic's true representation understood.

# Visualization 1: Topic Frequency Bar Chart
A bar chart showing topic distribution and top words is created:

In [None]:
# To visualize topic frequency
topic_model.visualize_barchart(n_words=7) # Show top 7 words for each topic with frequency scores

**Purpose:** This visualization shows the most important words for each topic and their relative importance scores, helping understand what characterizes each topic.

# Visualization 2: Topic Similarity Heatmap
A similarity matrix between all topics is generated:

In [None]:
topic_model.visualize_heatmap() # Create similarity matrix showing relationships between topics

**Purpose:** This heatmap shows how similar each topic is to every other topic. Darker colors indicate higher similarity, helping identify potentially redundant topics or topic clusters.

# Visualization 3: 2D Topic Space
A 2D representation of topic relationships is created:

In [None]:
# To visualize topics in 2D
topic_model.visualize_topics() # Project topics into 2D space using dimensionality reduction

**Purpose:** This plot shows topics as circles in 2D space where distance represents similarity. Topics that are close together are more semantically related.

# Visualization 4: Hierarchical Topic Clustering
The hierarchical relationship between topics is displayed:

In [None]:
topic_model.visualize_hierarchy(hierarchical_topics=hierarchical_topics) # Show dendrogram of topic relationships

**Purpose:** This dendrogram shows how topics can be merged at different similarity levels, revealing the hierarchical structure of the topic space and which topics are most closely related.

# Results Interpretation
Based on the visualizations:

**Topic Distribution (11 Topics Discovered):**


**Topic 0 (4 documents)**

**Keywords:** ict, intellectual, national, communication, technology

**Theme:** ICT and intellectual communication technologies

**Topic 1 (3 documents)**

**Keywords:** centre, tech, accesstech, mtn, app, button

Theme: Technology centers and accessibility applications

**Topic 2 (6 documents)**

**Keywords:** digital inclusion, internet access, south, infrastructure

**Theme:** Digital inclusion and internet infrastructure in South Africa

**Topic 3 (2 documents)**

**Keywords:** ai, young, entrepreneur, entrepreneurship

**Theme:** AI and young entrepreneurship initiatives

**Topic 4 (2 documents)**

**Keywords:** quality, teacher, approximately, attend

**Theme:** Educational quality and teacher attendance

**Topic 5 (5 documents)**

**Keywords:** mobile internet, gsma, cost, cent, social protection

**Theme:** Mobile internet costs and social protection programs

**Topic 6 (3 documents)**

**Keywords:** inclusive datum, sightsaver, safaricom, plan

**Theme:** Inclusive data initiatives and organizational partnerships

**Topic 7 (4 documents)**

**Keywords:** telecom, director, outlet, panel, google, safaricom

**Theme:** Telecommunications leadership and industry panels

**Topic 8 (2 documents)**

**Keywords:** pwd, digital inclusion, program, project enable

**Theme:** PWD-focused digital inclusion programs and enabling projects

**Topic 9 (2 documents)**

**Keywords:** ai, dataset, governance, ibid, funding, pwds

**Theme:** AI governance, datasets, and PWD funding initiatives

**Topic 10 (3 documents)**

**Keywords:** nancial, survey, figure, household, inaccessible

**Theme:** Financial surveys and household accessibility barriers

**Topic 11 (2 documents)**

**Keywords:** digital platform, livelihood, young, platform

**Theme:** Digital platforms for youth livelihoods

# Key Insights:

1. **Coherent Topics:** Each topic shows semantically related words and themes

2. **Good Coverage:** Topics span infrastructure, policy, technology, and social aspects

3. **Meaningful Clusters:** The hierarchical clustering reveals logical groupings

4. **Balanced Distribution:** No single topic dominates, suggesting good topic diversity

The analysis successfully identified distinct themes within the disability and digital inclusion domain, ranging from technical infrastructure to policy frameworks.