<a href="https://colab.research.google.com/github/07Lakusz/BERTopic_Topic_Modelling/blob/main/BERTopic_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERTopic Pipeline
## Overview of the Methodology
This notebook implements a sophisticated topic modeling pipeline using the BERTopic library. The process begins by ingesting bibliographic data from `.ris` files, extracting abstracts, and breaking them down into individual sentences. These sentences, the fundamental units of analysis, are then transformed into high-dimensional numerical vectors (`embeddings`) using a pre-trained `SentenceTransformer` model.
To make these embeddings clusterable, their dimensionality is reduced using `UMAP` (Uniform Manifold Approximation and Projection), an algorithm that preserves the data's topological structure. Subsequently, `HDBSCAN` (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is applied to the reduced-dimension embeddings to identify dense clusters of semantically similar sentences. Each cluster represents a potential topic.
Finally, a class-based `TF-IDF` (c-TF-IDF) algorithm is used to extract the most representative keywords for each topic, providing human-interpretable labels. The pipeline also includes optional steps for hierarchical topic reduction and re-assigning outlier sentences to existing topics, enhancing the coherence and comprehensiveness of the final model. The results, including topic assignments, summaries, and various visualizations, are generated and made available for download.

In [None]:
# @title 1. Installation and Imports
# @markdown Run this cell to install and import all required libraries. <br> This cell prepares the Colab environment. It begins by installing the necessary Python libraries that are not included by default, such as bertopic and rispy. After installation, it imports all the required modules for data handling, natural language processing, topic modeling, and plotting. Finally, it downloads the essential NLTK resources (stopwords, punkt, and punkt_tab) which are required for text preprocessing in the later cells.

# --- Install necessary packages ---
# Using --quiet to keep the output clean.
!pip install rispy bertopic sentence-transformers umap-learn hdbscan --quiet

# --- Import Core Libraries ---
import os
import glob
import zipfile
from google.colab import files
import re

# --- Data Handling and Scientific Computing ---
import pandas as pd
import numpy as np

# --- Bibliographic Data Parsing ---
import rispy

# --- Natural Language Processing ---
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# --- Topic Modeling and Machine Learning ---
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance
from sentence_transformers import SentenceTransformer
import umap
import hdbscan
import torch

# --- Plotting ---
import matplotlib.pyplot as plt

# --- Download NLTK resources for text processing ---
# These resources are used for tokenizing text into sentences and removing common stopwords.
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

print("✅ Cell 1 complete: All libraries installed, imported, and NLTK resources are ready.")

In [None]:
# @title 2. Settings Configuration
# @markdown Adjust the parameters below to customize the topic modeling process. <br> This cell contains a single Python dictionary named SETTINGS. This centralized configuration allows for easy modification of the topic modeling pipeline's parameters without altering the core logic. Each setting is documented to explain its purpose. Key parameters include the choice of embedding model, configurations for dimensionality reduction (UMAP) and clustering (HDBSCAN), and options for post-processing steps.

SETTINGS = {
    # --- General Settings ---
    "EMBEDDING_MODEL_NAME": "all-MiniLM-L6-v2",  # Model from SentenceTransformers to create numerical representations (embeddings) of sentences.
    "DEVICE": None,  # Set to "cuda", "cpu", or None to auto-detect. Using a GPU ("cuda") is highly recommended for speed.
    "EMBEDDING_BATCH_SIZE": 32,  # Number of sentences to process at once during embedding. Higher values may be faster but require more GPU memory.
    "RANDOM_STATE": 42,  # A fixed seed for random processes to ensure that results are reproducible each time the code is run.
    "VERBOSE": True,  # If True, BERTopic will print progress updates during the analysis.
    "CALCULATE_PROBABILITIES": True,  # If True, calculates the probability of each sentence belonging to each topic. Useful for analysis but can increase computation time.

    # --- UMAP (Dimensionality Reduction) Settings ---
    # UMAP reduces the high-dimensional embedding space to a lower dimension, making clustering feasible and effective.
    "UMAP_N_NEIGHBORS": 20,  # Controls how UMAP balances local versus global structure in the data. Higher values focus more on the overall structure.
    "UMAP_N_COMPONENTS": 5,  # The dimension of the data after reduction. 5 is a common default for BERTopic.
    "UMAP_MIN_DIST": 0.0,  # Controls how tightly UMAP packs points together. Lower values create more distinct and dense clusters.
    "UMAP_METRIC": 'cosine',  # The distance metric used for the embedding vectors. Cosine similarity is standard for text data.

    # --- HDBSCAN (Clustering) Settings ---
    # HDBSCAN is a density-based algorithm that identifies clusters of sentences (i.e., topics) in the reduced embedding space.
    "HDBSCAN_MIN_CLUSTER_SIZE": 10,  # The minimum number of sentences required to form a distinct topic. This is a critical parameter to control the number of topics.
    "HDBSCAN_METRIC": 'euclidean',  # The distance metric used on the UMAP-reduced data.
    "HDBSCAN_CLUSTER_SELECTION_METHOD": 'eom',  # 'eom' (Excess of Mass) is a robust algorithm for selecting the most stable clusters.

    # --- Topic Reduction Settings ---
    "AUTO_REDUCE_TOPICS": True,  # If True, automatically merges similar topics to create a more consolidated and interpretable set.
    "TOPIC_REDUCTION_NR_TOPICS": 'auto',  # Can be 'auto' for automatic reduction or a specific integer (e.g., 20) to define the final number of topics.

    # --- Post-processing to Reduce Outliers ---
    "REDUCE_OUTLIERS_POST_PROCESSING": True, # If True, attempts to assign outlier sentences (those not assigned to any topic) to the nearest topic cluster.
    "OUTLIER_REDUCTION_STRATEGY": "c-tf-idf", # The method for reassigning outliers. "c-tf-idf" is a robust choice that considers topic keyword relevance.

    # --- CountVectorizer (Topic Representation) Settings ---
    # This module extracts candidate keywords for each topic based on word frequency.
    "CV_STOP_WORDS": "english",  # Removes common, non-informative English words (e.g., "the", "a", "is").
    "CV_NGRAM_RANGE": (1, 2),  # Considers single words (unigrams) and two-word phrases (bigrams) as potential keywords.
    "CV_MIN_DF": 5,  # A word or phrase must appear in at least 5 different sentences to be considered a potential keyword. Helps filter out rare noise.
    "CV_MAX_FEATURES": None,  # Maximum number of keywords to consider across all topics. `None` means no limit.

    # --- Representation Model Settings ---
    # These models refine the keywords selected by CountVectorizer to create better, more coherent topic labels.
    "REPRESENTATION_MODELS": [
        KeyBERTInspired(),  # Uses a BERT-based model to find keywords that are highly relevant to the topic's documents.
        MaximalMarginalRelevance(diversity=0.3),  # Diversifies the selected keywords to avoid redundancy and improve interpretability.
    ],

    # --- Output File Names ---
    "OUTPUT_CSV_TOPICS": "sentence_topics.csv",
    "OUTPUT_CSV_SUMMARY": "topic_summary_table.csv",
    "OUTPUT_MODEL_PATH": "bertopic_model",
    "OUTPUT_PLOT_TOPIC_BARCHART": "topic_frequencies_barchart_interactive.html",
    "OUTPUT_PLOT_TOPICS_2D": "topics_2d_visualization.html",
    "OUTPUT_PLOT_TOPIC_HIERARCHY": "topic_hierarchy_visualization.html",
    "OUTPUT_PLOT_KEYWORDS_PREFIX": "topic_keywords_id",
    "OUTPUT_CSV_OUTLIERS": "outlier_sentences_before_reduction.csv",
    "OUTPUT_CSV_HIERARCHY": "topic_hierarchy_tree.csv",

    # --- Plotting Settings ---
    "PLOT_TOP_N_TOPICS_KEYWORDS": 10, # Generate individual keyword plots for the top N most frequent topics.
    "PLOT_NUM_KEYWORDS_PER_TOPIC": 10, # Number of keywords to display in each individual topic plot.
    "CUSTOM_LABEL_NUM_KEYWORDS": 3, # Number of keywords to use when generating the short custom topic labels (e.g., "Topic 1: word1, word2, word3").
}

print("✅ Cell 2 complete: Settings are configured and ready.")

In [None]:
# @title 3. Upload RIS Files and Run Analysis
# @markdown This cell contains all functions and executes the main analysis pipeline. <br> This is the main execution cell. It defines and calls all the functions necessary to go from raw .ris files to a fully trained topic model with associated visualizations and data files. When you run this cell, it will first prompt you to upload your .ris file(s). The analysis will start automatically and may take several minutes depending on the data size.
# @markdown 1.  **Run the cell.**
# @markdown 2.  A file upload button will appear. **Select and upload your `.ris` file(s).**
# @markdown 3.  The analysis will start automatically after the upload is complete. The process may take several minutes depending on the data size and whether a GPU is used.

# --- Function Definitions ---

def clean_abstract(text):
    """
    Cleans a text string by removing extra whitespace and stopwords.
    Args:
        text (str): The input string (abstract).
    Returns:
        str: The cleaned text.
    """
    nltk_stopwords = set(stopwords.words('english'))
    text = re.sub(r'\s+', ' ', text)  # Collapse multiple whitespace characters into one.
    tokens = text.split()
    # Remove stopwords
    tokens = [token for token in tokens if token.lower() not in nltk_stopwords]
    cleaned_text = ' '.join(tokens)
    return cleaned_text.strip()

def extract_abstracts_from_ris(file_content, filename):
    """
    Parses the content of a .ris file to extract abstracts.
    Args:
        file_content (str): The decoded content of the .ris file.
        filename (str): The name of the file for logging purposes.
    Returns:
        list: A list of abstract strings.
    """
    try:
        # rispy.loads is used as we are passing the file content directly.
        entries = rispy.loads(file_content)
    except Exception as e:
        print(f"Could not parse file {filename}: {e}")
        return []
    abstracts = [entry.get('abstract', '') for entry in entries if 'abstract' in entry]
    return abstracts

def collect_all_ris_sentences(uploaded_files):
    """
    Processes all uploaded .ris files to extract and prepare sentences for analysis.
    Args:
        uploaded_files (dict): A dictionary from `files.upload()`, where keys are filenames and values are file content.
    Returns:
        tuple: A tuple containing (all_sentences, original_abstracts_map).
               Returns (None, None) if no sentences are extracted.
    """
    all_sentences = []
    original_abstracts_map = []
    abstract_count = 0
    for filename, content in uploaded_files.items():
        if filename.lower().endswith(".ris"):
            # The content from files.upload() is in bytes, so it must be decoded.
            file_content_str = content.decode('utf-8', errors='ignore')
            abstracts = extract_abstracts_from_ris(file_content_str, filename)
            print(f"Processing {len(abstracts)} abstracts from {filename}...")
            for abstract in abstracts:
                if not abstract:
                    continue
                abstract_count += 1
                cleaned_abstract = clean_abstract(abstract)
                # Tokenize the cleaned abstract into individual sentences.
                sentences_from_abstract = sent_tokenize(cleaned_abstract)
                all_sentences.extend(sentences_from_abstract)
                # Map each sentence back to its parent abstract.
                original_abstracts_map.extend([cleaned_abstract] * len(sentences_from_abstract))

    if not all_sentences:
        return None, None

    print(f"\nTotal abstracts processed: {abstract_count}")
    print(f"Total sentences extracted: {len(all_sentences)}")
    return all_sentences, original_abstracts_map

def run_topic_modelling(documents, original_abstracts):
    """
    Executes the entire BERTopic modeling pipeline based on the configured SETTINGS.
    Args:
        documents (list): A list of sentences to model.
        original_abstracts (list): A list mapping each sentence back to its original abstract.
    Returns:
        tuple: A tuple containing the trained (topic_model, topics, probabilities).
    """
    # Step 1: Initialize Embedding Model and select device (GPU/CPU)
    print(f"Loading embedding model ({SETTINGS['EMBEDDING_MODEL_NAME']})...")
    device = SETTINGS['DEVICE']
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
        print(f"Auto-detecting device: Using {device.upper()}")
    else:
        print(f"Using specified device: {device.upper()}")
    embedding_model = SentenceTransformer(SETTINGS['EMBEDDING_MODEL_NAME'], device=device)

    # Step 2: Create Embeddings
    print(f"Encoding {len(documents)} sentences... (This may take a while)")
    embeddings = embedding_model.encode(
        documents, show_progress_bar=SETTINGS['VERBOSE'], batch_size=SETTINGS['EMBEDDING_BATCH_SIZE']
    )

    # Step 3: Configure BERTopic Components from SETTINGS
    print("\nConfiguring UMAP for dimensionality reduction...")
    umap_model = umap.UMAP(n_neighbors=SETTINGS['UMAP_N_NEIGHBORS'], n_components=SETTINGS['UMAP_N_COMPONENTS'], min_dist=SETTINGS['UMAP_MIN_DIST'], metric=SETTINGS['UMAP_METRIC'], random_state=SETTINGS['RANDOM_STATE'])
    print("Configuring HDBSCAN for clustering...")
    hdbscan_model = hdbscan.HDBSCAN(min_cluster_size=SETTINGS['HDBSCAN_MIN_CLUSTER_SIZE'], metric=SETTINGS['HDBSCAN_METRIC'], cluster_selection_method=SETTINGS['HDBSCAN_CLUSTER_SELECTION_METHOD'], prediction_data=True)
    print("Configuring CountVectorizer for keyword extraction...")
    vectorizer_model = CountVectorizer(stop_words=SETTINGS['CV_STOP_WORDS'], ngram_range=SETTINGS['CV_NGRAM_RANGE'], min_df=SETTINGS['CV_MIN_DF'], max_features=SETTINGS['CV_MAX_FEATURES'])
    print("Configuring Representation Models for topic labeling...")
    representation_models = SETTINGS['REPRESENTATION_MODELS']

    # Step 4: Initialize and Run BERTopic
    print("\nInitializing BERTopic model...")
    topic_model = BERTopic(
        language="english", embedding_model=embedding_model, umap_model=umap_model, hdbscan_model=hdbscan_model,
        vectorizer_model=vectorizer_model, representation_model=representation_models,
        calculate_probabilities=SETTINGS['CALCULATE_PROBABILITIES'], verbose=SETTINGS['VERBOSE']
    )
    print("Running topic modeling... (This is the main computational step)")
    topics, probs = topic_model.fit_transform(documents, embeddings)
    initial_topic_count = len(topic_model.get_topic_info())
    print(f"Found {initial_topic_count - 1} initial topics (before any reduction).")

    # Step 5: Optional Topic Reduction
    if SETTINGS["AUTO_REDUCE_TOPICS"]:
        print(f"\nReducing number of topics with nr_topics='{SETTINGS['TOPIC_REDUCTION_NR_TOPICS']}'...")
        topic_model.reduce_topics(documents, nr_topics=SETTINGS['TOPIC_REDUCTION_NR_TOPICS'])
        topics = topic_model.topics_
        probs = topic_model.probabilities_
        final_topic_count = len(topic_model.get_topic_info())
        print(f"Topics successfully reduced from {initial_topic_count - 1} to {final_topic_count - 1}.")

    # Step 6: Save Initial Outliers for review
    initial_df = pd.DataFrame({'Sentence': documents, 'Topic': topics})
    outlier_df_before_reduction = initial_df[initial_df["Topic"] == -1]
    if not outlier_df_before_reduction.empty:
        outlier_df_before_reduction.to_csv(SETTINGS['OUTPUT_CSV_OUTLIERS'], index=False)
        print(f"\nSaved {len(outlier_df_before_reduction)} outlier sentences (before reduction) to '{SETTINGS['OUTPUT_CSV_OUTLIERS']}'")

    # Step 7: Optional Outlier Reduction
    if SETTINGS["REDUCE_OUTLIERS_POST_PROCESSING"] and (np.array(topics) == -1).any():
        print(f"\nReducing outliers using strategy: '{SETTINGS['OUTLIER_REDUCTION_STRATEGY']}'...")
        initial_outlier_count = (np.array(topics) == -1).sum()
        new_topics = topic_model.reduce_outliers(documents=documents, topics=topics, strategy=SETTINGS["OUTLIER_REDUCTION_STRATEGY"])
        topics = new_topics
        final_outlier_count = (np.array(topics) == -1).sum()
        print(f"Outliers reduced. Reassigned {initial_outlier_count - final_outlier_count} sentences. Final outliers: {final_outlier_count}")

    # Step 8: Save Final Results and Model
    final_df = pd.DataFrame({
        "Sentence": documents, "Abstract": original_abstracts, "Topic": topics
    })
    final_df.to_csv(SETTINGS['OUTPUT_CSV_TOPICS'], index=False)
    print(f"\nSaved final topic assignments to '{SETTINGS['OUTPUT_CSV_TOPICS']}'")

    # Save the entire model to a directory.
    model_path = SETTINGS['OUTPUT_MODEL_PATH']
    topic_model.save(model_path, serialization="safetensors")
    print(f"Saved BERTopic model to directory '{model_path}'")
    return topic_model, topics, probs

def generate_custom_topic_labels(topic_model):
    """Generates human-readable labels for each topic and applies them to the model."""
    topic_info = topic_model.get_topic_info()
    custom_labels = {}
    num_keywords = SETTINGS['CUSTOM_LABEL_NUM_KEYWORDS']
    for _, row in topic_info.iterrows():
        topic_id = row['Topic']
        if topic_id == -1:
            custom_labels[topic_id] = "Topic -1: Outliers"
        else:
            # Get the top keywords for the topic.
            words = [word[0] for word in topic_model.get_topic(topic_id)][:num_keywords]
            custom_labels[topic_id] = f"Topic {topic_id}: {', '.join(words)}"
    topic_model.set_topic_labels(custom_labels)
    print("\nCustom topic labels generated and set for the model.")
    return custom_labels

def generate_summary_and_hierarchy_tables(topic_model, sentences):
    """Generates, prints, and saves the main topic summary and the hierarchical data."""
    # Generate hierarchical data first, if applicable.
    hier_topics_data = None
    if SETTINGS["AUTO_REDUCE_TOPICS"]:
        try:
            print("\nGenerating hierarchical topic structure...")
            hier_topics_data = topic_model.hierarchical_topics(sentences)
            # Save the hierarchy table.
            hier_topics_data.to_csv(SETTINGS['OUTPUT_CSV_HIERARCHY'], index=False)
            print(f"Saved detailed topic hierarchy tree to '{SETTINGS['OUTPUT_CSV_HIERARCHY']}'")
        except Exception as e:
            print(f"Could not generate or save topic hierarchy tree. Error: {e}")

    # Generate the main summary table.
    topic_info = topic_model.get_topic_info()
    df_summary = topic_info[["Topic", "Count", "Name"]]
    df_summary.columns = ["Topic", "Count (Sentences)", "Top Words (Label)"]
    print("\n=== Topic Summary Table ===")
    print(df_summary.to_string(index=False))
    df_summary.to_csv(SETTINGS['OUTPUT_CSV_SUMMARY'], index=False)
    print(f"\nSaved summary table to '{SETTINGS['OUTPUT_CSV_SUMMARY']}'")
    return hier_topics_data # Pass this to the plotting function.

def plot_topic_visualizations(topic_model, hier_topics_data):
    """Generates and saves all specified plot visualizations."""
    # Plot 1: Topic Frequencies Barchart
    if len(topic_model.get_topic_info()) > 1:
        print("\nGenerating topic frequency barchart...")
        try:
            fig = topic_model.visualize_barchart(custom_labels=True, top_n_topics=len(topic_model.get_topic_info()))
            fig.write_html(SETTINGS['OUTPUT_PLOT_TOPIC_BARCHART'])
            print(f"Saved topic barchart to '{SETTINGS['OUTPUT_PLOT_TOPIC_BARCHART']}'")
        except Exception as e:
            print(f"Could not generate barchart: {e}")

    # Plot 2: Individual Topic Keyword Plots
    print("\nGenerating keyword plots for top topics...")
    top_topics_df = topic_model.get_topic_info()
    top_topics_df = top_topics_df[top_topics_df.Topic != -1].head(SETTINGS['PLOT_TOP_N_TOPICS_KEYWORDS'])
    for topic_id in top_topics_df["Topic"]:
        plt.figure(figsize=(8, 5))
        topic_model.visualize_barchart(topics=[topic_id], custom_labels=True)
        output_filename = f"{SETTINGS['OUTPUT_PLOT_KEYWORDS_PREFIX']}_{topic_id}.png"
        plt.savefig(output_filename, dpi=300, bbox_inches='tight')
        plt.close()
    print(f"Saved top keyword plots with prefix '{SETTINGS['OUTPUT_PLOT_KEYWORDS_PREFIX']}'")

    # Plot 3: 2D Topic Visualization
    print("\nGenerating 2D topic visualization...")
    try:
        fig_2d = topic_model.visualize_topics()
        fig_2d.write_html(SETTINGS['OUTPUT_PLOT_TOPICS_2D'])
        print(f"Saved 2D topic visualization to '{SETTINGS['OUTPUT_PLOT_TOPICS_2D']}'")
    except Exception as e:
        print(f"Could not generate 2D topic visualization: {e}")

    # Plot 4: Hierarchical Topic Visualization
    if hier_topics_data is not None:
        print("\nGenerating hierarchical topic visualization...")
        try:
            fig_hier = topic_model.visualize_hierarchy(hierarchical_topics=hier_topics_data, custom_labels=True)
            fig_hier.write_html(SETTINGS['OUTPUT_PLOT_TOPIC_HIERARCHY'])
            print(f"Saved topic hierarchy visualization to '{SETTINGS['OUTPUT_PLOT_TOPIC_HIERARCHY']}'")
        except Exception as e:
            print(f"Could not generate topic hierarchy visualization: {e}")


# --- Main Execution Block ---
print("Please upload your .ris files using the button below.")
uploaded = files.upload()

if not uploaded:
    print("\n⚠️ No files were uploaded. Please run the cell again to upload files.")
else:
    print(f"\nSuccessfully uploaded {len(uploaded)} file(s). Starting analysis...")
    sentences, original_abstracts = collect_all_ris_sentences(uploaded)

    if not sentences:
        print("\n❌ Analysis stopped: No sentences could be extracted from the abstracts of the uploaded files.")
    else:
        # Run the core topic modeling
        topic_model, topics, probs = run_topic_modelling(sentences, original_abstracts)

        # Generate labels, tables, and visualizations
        generate_custom_topic_labels(topic_model)
        hier_data = generate_summary_and_hierarchy_tables(topic_model, sentences)
        plot_topic_visualizations(topic_model, hier_data)

        print("\n\n✅ Analysis complete. All output files have been generated.")
        print("➡️ Proceed to the final cell to download all results as a single .zip file.")

In [None]:
# @title 4. Download All Results
# @markdown Run this cell to package all generated files into a single `.zip` archive and download it. <br> This final cell gathers all the output files and compresses them into a single .zip archive. It then automatically triggers a download of this file to your local machine, providing a convenient way to save all results from the Colab session.

# Define the name for the output zip file
zip_filename = "bertopic_results.zip"

# --- Gather the list of files to be zipped ---

# Start with the core output files defined in SETTINGS
files_to_zip = [
    SETTINGS["OUTPUT_CSV_TOPICS"],
    SETTINGS["OUTPUT_CSV_SUMMARY"],
    SETTINGS["OUTPUT_PLOT_TOPIC_BARCHART"],
    SETTINGS["OUTPUT_PLOT_TOPICS_2D"],
]

# Add files that are created conditionally
if SETTINGS["AUTO_REDUCE_TOPICS"]:
    files_to_zip.append(SETTINGS["OUTPUT_PLOT_TOPIC_HIERARCHY"])
    files_to_zip.append(SETTINGS["OUTPUT_CSV_HIERARCHY"])

# The outlier CSV is only created if outliers were present initially
if os.path.exists(SETTINGS["OUTPUT_CSV_OUTLIERS"]):
    files_to_zip.append(SETTINGS["OUTPUT_CSV_OUTLIERS"])

# Use glob to find all generated keyword plots that match the prefix
keyword_plots = glob.glob(f"{SETTINGS['OUTPUT_PLOT_KEYWORDS_PREFIX']}_*.png")
files_to_zip.extend(keyword_plots)

# --- Create the zip archive ---

print(f"Archiving the following files into '{zip_filename}':")
# Use a with statement to ensure the zip file is properly closed
with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
    # Add all individual files
    for file in files_to_zip:
        if os.path.exists(file):
            zipf.write(file)
            print(f"- Added: {file}")
        else:
            print(f"- SKIPPED (not found): {file}")

    # Special handling for the saved model directory
    model_path = SETTINGS['OUTPUT_MODEL_PATH']
    if os.path.isdir(model_path):
        print(f"- Adding model directory: {model_path}")
        # Walk through the model directory and add all its contents
        for root, dirs, filenames in os.walk(model_path):
            for filename in filenames:
                filepath = os.path.join(root, filename)
                # The arcname parameter ensures files are stored in the zip with a relative path
                arcname = os.path.relpath(filepath, start=os.path.dirname(model_path))
                zipf.write(filepath, arcname=arcname)
                print(f"  - Added model file: {arcname}")
    else:
        print(f"- SKIPPED (model directory not found): {model_path}")

print("\n✅ Archiving complete.")

# --- Trigger the download in Google Colab ---
files.download(zip_filename)