<a href="https://colab.research.google.com/github/Jankoding/topic-classifier/blob/main/topic_modelling_w_discovery_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Install required packages first (run this cell once)
!pip install -q bertopic sentence-transformers plotly

# Import libraries and mount Google Drive
import os
import shutil
from google.colab import drive
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer

# Mount Google Drive
drive.mount('/content/drive')

# Paths in your Google Drive (adjust if needed)
#input path is where you put a large corpus text file
input_path = "/content/drive/MyDrive/topic_classifier/topic-modelling-w-discovery/combined.txt"
#output model dir(ectory) is just a folder in your google drive disk that the code needs in order to run
output_model_dir = "/content/drive/MyDrive/topic_classifier/topic-modelling-w-discovery/bertopic_model_dir"
output_results = "/content/drive/MyDrive/topic_classifier/topic-modelling-w-discovery/topic_results.txt"

# Read input text
with open(input_path, "r", encoding="utf-8") as f:
    text = f.read()

# Function to split text into chunks for better topic modeling
####################################################
#below the chunk size is the number of words per chunk that will be used for topic modeling
###################################################
def chunk_text(text, chunk_size=500):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size):
        chunk = " ".join(words[i:i+chunk_size])
        if len(chunk.split()) > 20:  # Filter out very short chunks
            chunks.append(chunk)
    return chunks

docs = chunk_text(text, chunk_size=100)
print(f"Loaded {len(docs)} paragraphs from {input_path}")

# Load embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize BERTopic
topic_model = BERTopic(embedding_model=embedding_model, verbose=True)

# Fit the model
topics, probs = topic_model.fit_transform(docs)

# Print topic summary
print("\nTOPIC SUMMARY:")
print(topic_model.get_topic_info())

# Remove existing model directory if it exists to avoid errors
if os.path.exists(output_model_dir):
    shutil.rmtree(output_model_dir)

# Save the BERTopic model
topic_model.save(output_model_dir)

# Save individual topic assignments to a text file
with open(output_results, "w", encoding="utf-8") as out:
    for i, (topic, prob) in enumerate(zip(topics, probs)):
        out.write(f"\n--- Paragraph {i+1} ---\n")
        out.write(f"Topic: {topic} | Confidence: {prob:.3f}\n")
        out.write(docs[i] + "\n")

print(f"\nResults saved to '{output_results}'")
print(f"BERTopic model saved to '{output_model_dir}'")

# Optional: Visualization (works well in Colab)
try:
    import plotly.io as pio
    pio.renderers.default = "colab"
    fig = topic_model.visualize_topics()
    fig.show()
except Exception as e:
    print(f"Visualization skipped: {e}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Loaded 244 paragraphs from /content/drive/MyDrive/topic_classifier/topic-modelling-w-discovery/combined.txt


2025-06-02 11:21:32,913 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/8 [00:00<?, ?it/s]

2025-06-02 11:21:50,922 - BERTopic - Embedding - Completed ✓
2025-06-02 11:21:50,925 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-06-02 11:21:51,928 - BERTopic - Dimensionality - Completed ✓
2025-06-02 11:21:51,931 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-06-02 11:21:51,953 - BERTopic - Cluster - Completed ✓
2025-06-02 11:21:51,960 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-06-02 11:21:52,029 - BERTopic - Representation - Completed ✓



TOPIC SUMMARY:
   Topic  Count                        Name  \
0     -1      2  -1_in_turkey_refugee_state   
1      0    212             0_the_of_and_in   
2      1     30             1_the_in_and_to   

                                      Representation  \
0  [in, turkey, refugee, state, islamic, an, with...   
1  [the, of, and, in, to, that, iraq, is, islamic...   
2  [the, in, and, to, kurdish, of, pkk, turkish, ...   

                                 Representative_Docs  
0  [to radical Islamists in Syria â In October ...  
1  [section on Syria given the rapid deterioratio...  
2  [on a possible role for Turkey in the assault ...  

Results saved to '/content/drive/MyDrive/topic_classifier/topic-modelling-w-discovery/topic_results.txt'
BERTopic model saved to '/content/drive/MyDrive/topic_classifier/topic-modelling-w-discovery/bertopic_model_dir'
Visualization skipped: zero-size array to reduction operation maximum which has no identity
