# codex__datamap_20_newsgroups.ipynb
<div>
<img src="https://connoiter.com/images/tutte/king_tutte_on_colab.transparent_bg.png" align="left" width="200px"/>

This notebook, [codex__datamap_20_newsgroups.ipynb](https://github.com/Connoiter/king_tutte_datamap_codex/blob/main/by_repo/codex/codex__datamap_20_newsgroups.ipynb), is part of [the King Tutte Datamap Codex repo](https://github.com/Connoiter/king_tutte_datamap_codex). The Codex is a Jupyter Book consisting of Jupyter notebooks that run on Colab. They run King Tutte datamap pipelines on Google's Colab.
</div>
<br/>

## Pedigree

This notebook was started from scratch for the Codex project. The Tononymy code was taken from [Getting Started with Toponymy](https://toponymy.readthedocs.io/en/latest/basic_usage.html).

## What this notebook does

This notebook
1. Set-up: Pip downloads a Gemma model from Hugging Face to be run on the GPU on Colab. Gemma is used as the topic labeler for Toponymy
2. Downloads and reads a Parquet file from Hugging Face, [a subset of the 20 Newsgroups](https://huggingface.co/datasets/lmcinnes/20newsgroups_embedded) dataset (18.2k rows)
3. Samples of topic clusters are run through Toponomy to generate short labels for clusters/topics
4. DataMapPlot generates the datamap webapp (newsgroups.html)



## Set-up

### Start stopwatch

In [None]:
# Start a timer. Record wall clock time now and again
# later to determine total runtime of notebook
import time
from datetime import timedelta

start = time.time()


### Dataset

The dataset this notebook works on is a subset of the Twenty Newsgroups dataset. This subset consists of 18,xxx documents from the newsgroups dataset.

On free tier Colab with a T4 GPU, it takes over an hour to datamap all ~18k rows. So, to speed things up for demo purposes, the 18k are chopped down to 1k. Feel free to not do that.

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
newsgroups_df = pd.read_parquet("hf://datasets/lmcinnes/20newsgroups_embedded/data/train-00000-of-00001.parquet")
newsgroups_df = newsgroups_df.head(1000)


### Various pip installs

In [None]:
!pip install -q --upgrade transformers accelerate bitsandbytes torch huggingface-hub
!pip install -q toponymy
!pip install -q datamapplot

### Auth set-up

Hugging Face auth is required for Gemma download, and user has to visit model card on huggingface.co to click OK on ELUA.

In [None]:
# TODO: why this way and not just secret => os.environ[]
#!hf auth login

In [None]:
import os
from google.colab import userdata

# Retrieve the Hugging Face token from Colab Secrets
# Make sure you have a secret named 'HF_TOKEN' containing your token
hf_token = userdata.get('HF_TOKEN')

if hf_token:
    print("This notebook has found the existing HF_TOKEN secret")
    # Set the token as an environment variable
    os.environ['HF_TOKEN'] = hf_token
    print("Hugging Face token loaded from secrets and set as environment variable, HF_TOKEN.")
else:
    print("HF_TOKEN secret is not set, or access to it has not been granted.")
    !hf auth login
    # hf auth login should set HF_TOKEN environment variable for transformers to find while auth'ing model download


In [None]:
# bitsandbytes is for 8-bit CUDA, already pip'd above
#!pip install --upgrade --quiet bitsandbytes

TODO:JFT Not sure why this isn't a code cell. This code is recommending usings 4-bit quant on Colab's T4, should do that, which is not the case in later cells.

```python
# Load Model and Tokenizer with Quantization
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Model ID for the efficient 2B-equivalent Gemma 3n Instruct model
# Note: You must have accepted the license on the Hugging Face model card for this to work.
model_id = "google/gemma-3n-E2B-it"

# Configuration for 4-bit quantization (recommended for T4 GPU)
# TODO:JFT, yeah, T4. Do this elsewhere, say, with gemma:1B
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # Normal Floats 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BFloat16 for better precision
    bnb_4bit_use_double_quant=True, # Optional: double quantization for more memory savings
)

# Load the Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the Model
# device_map="auto" automatically manages memory and places the model on the GPU.
# The model will use the HF_TOKEN stored in Colab Secrets for authentication.
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto"
)


print(f"Model {model_id} loaded successfully in 4-bit mode!")
```

### Load Gemma model from Hugging Face

Toponymy needs to be configured with an LLM to summarized document descriptions.

- [x] Take #1: `google/gemma-3n-E2B-it`, ran out of memory after over 2h on a T4
- [ ] Take #2: `google/gemma-3-1b-it`, ~2GB, worked took almost 2h
- [ ] Take #3: `google/gemma-3n-E2B-it`



In [None]:
import torch
from toponymy.llm_wrappers import HuggingFaceNamer

# Initialize with a Hugging Face model
# llm will be used by Toponymy as llm_wrapper=llm
llm = HuggingFaceNamer(
    model="google/gemma-3-1b-it", # out of memory: "google/gemma-3n-E2B-it", OutOfMemoryError: CUDA out of memory.
    llm_specific_instructions="Generate clear, descriptive topic names",
    device_map="auto",  # Automatically map to available devices
    dtype=torch.bfloat16  # Use bfloat16 precision for efficiency and better precision
)

## Toponymy basics

Following example code from: [Getting Started with Toponymy](https://toponymy.readthedocs.io/en/latest/basic_usage.html).

In [None]:
# newsgroups_df.head()
# = index,post,newsgroup,embedding,map
# is int,str,str,long-vec,2-vec
display(newsgroups_df)

In [None]:
if True:
    # TODO: next up, make embeddings and run UMAP. Never did that; started with embeds & umaps already computed
    from sentence_transformers import SentenceTransformer
    from umap import UMAP

    embedding_model = SentenceTransformer("all-mpnet-base-v2")
    embedding_vectors = embedding_model.encode(newsgroups_df["post"], show_progress_bar=True)
    clusterable_vectors = UMAP(metric="cosine").fit_transform(embedding_vectors)
else:
    # the embeddings and UMAP (x,y) coord already exist, no need to compute those:
    embedding_vectors = np.stack(newsgroups_df["embedding"].values)
    clusterable_vectors = np.stack(newsgroups_df["map"].values)

In [None]:
from toponymy import Toponymy, ToponymyClusterer, KeyphraseBuilder
from toponymy.llm_wrappers import AzureAINamer

from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("paraphrase-MiniLM-L3-v2")

#azure_api_key = open("../azure_cohere_api_key.txt").read().strip()


In [None]:
topic_model = Toponymy(
    llm_wrapper=llm,
    text_embedding_model=embedding_model,
    clusterer=ToponymyClusterer(min_clusters=4, verbose=True),
    keyphrase_builder=KeyphraseBuilder(ngram_range=(1,6), max_features=15_000, verbose=True),
    object_description="newsgroup posts",
    corpus_description="20-newsgroups dataset",
    exemplar_delimiters=["<EXAMPLE_POST>\n","\n</EXAMPLE_POST>\n\n"],
)

In [None]:
%%time
topic_model.fit(
    newsgroups_df["post"].str.strip().values,
    embedding_vectors=embedding_vectors,
    # NO show_progress_bar=topic_model.show_progress_bars, # topic_model.clusterer.fit()
    clusterable_vectors=clusterable_vectors
)

In [None]:
elapsed = time.time() - start
print(f"Total runtime: {timedelta(seconds=elapsed)}")

In [None]:
topic_model.topic_names_[-1]

In [None]:
newsgroups_df.newsgroup.unique().tolist()

### Topic treeview

Below is the `topic_tree_`, a treeview of the topics as labeled by Toponymy. The nodes in the tree can expand and collapse to show child topics.

In [None]:
topic_model.topic_tree_

## DataMapPlot

In [None]:
import datamapplot
import datamapplot.selection_handlers

In [None]:
plot = datamapplot.create_interactive_plot(
    clusterable_vectors,
    *topic_model.topic_name_vectors_,
    title="20-Newsgroups",
    sub_title="A data map of 20-newsgroups using all-mpnet-basev2, Toponymy, Cohere and UMAP",
    hover_text=newsgroups_df["post"].values,
    font_family="Cormorant SC",
    marker_size_array=np.asarray([np.log(len(x)) for x in newsgroups_df["post"].values]),
    colormaps={"newsgroup": pd.Series(newsgroups_df["newsgroup"].values)},
    cluster_layer_colormaps=True,
    enable_search=True,
    selection_handler=datamapplot.selection_handlers.WordCloud(height=300),
)
plot

In [None]:
from google.colab import files

plot.save('newsgroups.html')
files.download('newsgroups.html')