# <center> Parametric UMAP </center>

Play with the ParametricUMAP model and see if it can be used to train the reduction heads for our encoder model.

In [1]:
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

#### Load the Sentence Embedding Model

In [2]:
# We will use the initial (untrained) all-mpnet-base-v2-compressed model
from reduced_encoders import MPNetCompressedModel

model_checkpoint = "cayjobla/all-mpnet-base-v2-compressed"
model = MPNetCompressedModel.from_pretrained(model_checkpoint, revision="initial").to(device)



In [3]:
model.base_model

MPNetModel(
  (embeddings): MPNetEmbeddings(
    (word_embeddings): Embedding(30527, 768, padding_idx=1)
    (position_embeddings): Embedding(514, 768, padding_idx=1)
    (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): MPNetEncoder(
    (layer): ModuleList(
      (0-11): 12 x MPNetLayer(
        (attention): MPNetAttention(
          (attn): MPNetSelfAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (k): Linear(in_features=768, out_features=768, bias=True)
            (v): Linear(in_features=768, out_features=768, bias=True)
            (o): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (intermediate): MPNetIntermediate(
          (dense): Linear(in_

#### Load the Toy Dataset

In [4]:
from sklearn.datasets import fetch_20newsgroups
from datasets import Dataset

newsgroups = fetch_20newsgroups()
documents = Dataset.from_dict({"text":newsgroups.data, "target":newsgroups.target})

#### Embed the Data

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



In [6]:
# Tokenize and get sentence embeddings in the data
def preprocess_data(batch):
    tokenized = tokenizer(batch["text"], truncation=True, padding="max_length", return_tensors="pt")
    input_ids = tokenized["input_ids"].to(device)
    attention_mask = tokenized["attention_mask"].to(device)
    with torch.no_grad():
        outputs = model.base_model(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = model.pooler(outputs[0], attention_mask) 
    return {"data": pooled_output.cpu().detach()}

embedding_dataset = documents.map(preprocess_data, batched=True, batch_size=250, remove_columns=documents.column_names).with_format("torch")

Map:   0%|          | 0/11314 [00:00<?, ? examples/s]

#### Define the ParametricUMAP Model

In [7]:
from umap_pytorch import PUMAP

umap = PUMAP()

2024-05-09 13:12:59.495370: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


#### Train the UMAP Model

In [8]:
reduced_embeddings = umap.fit(embedding_dataset["data"].numpy()[:20])

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


Thu May  9 13:13:03 2024 Building RP forest with 5 trees
Thu May  9 13:13:09 2024 NN descent for 5 iterations
	 1  /  5
	 2  /  5
	Stopping threshold met -- exiting after 2 iterations


You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

  | Name    | Type            | Params
--------------------------------------------
0 | encoder | default_encoder | 234 K 
--------------------------------------------
234 K     Trainable params
0         Non-trainable params
234 K     Total params
0.938     Total estimated model params size (MB)
/home/cayjobla/miniconda3/envs/reduced_encoders/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in

Training: |          | 0/? [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

### Here's where you're at...

Yesterday, it looked like they had merged the changes for ParametricUMAP to allow pytorch backends; however, the issue still says open when I look today. The problem resolves itself when changing the backend to tensorflow, so it's an implementation issue. For now, try to keep using the `umap_pytorch` port, though this previously resulted in issues of silent failure, potentially a result of memory errors.

After some testing, I can get the `umap_pytorch` model to train all the way through with a subset of the wikipedia dataset (1,000,000 embeddings), but the process is killed when I pass the full 6.47M vector dataset.