# Latent Assembly with CLIP + Qdrant — Part A (Student Exercise)

### LLM-guided debugging: build and save a reusable embedding index


**Prerequisite:** You finished the **Qdrant indexing & search** notebook (CLIP embeddings + in-memory Qdrant).

---

## What you will build (Part A)

A small **system-level pipeline checkpoint** that:

1. Loads an image dataset from Hugging Face (you may need to try multiple options)
2. Investigates the dataset schema and finds a usable **caption/text field**
3. Extracts:
   - `images` (PIL images)
   - `payloads` (metadata with `filename` + non-empty `caption` for at least some items)
4. Computes **CLIP image embeddings**
5. Creates and populates an in-memory **Qdrant** collection
6. Saves a reusable artifact package to a **private Hugging Face dataset repo**:
   - `img_vecs.npy`
   - `payloads.json`
   - `meta.json` (dataset name + caption source + embedding info)

You will **not** train a new model.  
You will **not** generate images or prose with an LLM.  
The goal is a **reusable index checkpoint** for Part B.

---

## Rules (important)

- Some cells are **broken on purpose**. This is part of the assignment.
- Do not “fix by guessing”. **Inspect first**:
  - `type(x)`
  - `x.keys()` (if available)
  - tensor shapes (e.g., `.shape`)
- Your payloads must include **real caption-like text** for at least some items,
  otherwise Part B cannot do evidence aggregation.

---

## What to submit (Part A)

- Your Hugging Face dataset repo id: `username/repo-name`
- Confirm your repo contains:

```
dataset/
├── img_vecs.npy
├── payloads.json
└── meta.json
```

- Short reflection (3–6 lines):
  - What broke first: dataset schema, model outputs, or Qdrant API?
  - What did you print to understand the problem before using an LLM?
  - What is your `CAPTION_SOURCE` path?

---

*Part B will use this saved index to perform multi-intent retrieval and assemble a marketplace listing.*


---

# 0) Setup

Run the next cells. They install/import libraries and define helper functions.

Notes:
- ~~We avoid `pip install` directly and use `ensure_package_installed(...)`.~~
- ~~If something is already installed in your Colab runtime, it will just import it.~~
- I am using VS Code and `uv` for development. so the `ensure_package_installed` is not needed.
- I also moved `import` statements to the top, there were a few duplicate imports; then used `ruff` to sort the imports.

## How to use this notebook (quick)

Some cells are **broken on purpose**.

When something breaks:
1. Read the error.
2. Print small facts: `type(x)`, `x.keys()` (if possible), tensor shapes.
3. Ask an LLM for help after you know what to ask.
4. Fix, then add a tiny check (`assert` / print) to prove it works.

Keep changes small and test often.


In [1]:
import importlib
import json
import os
import subprocess
import sys

# --- Image loader utility (safe for Colab) ---
from io import BytesIO
from pathlib import Path

# --- Imports ---
from typing import Any

import numpy as np
import requests
import torch
from datasets import load_dataset
from huggingface_hub import create_repo, login, upload_folder
from PIL import Image
from qdrant_client import QdrantClient
from qdrant_client.http import models as qm
from tqdm import tqdm
from transformers import CLIPModel, CLIPProcessor

In [2]:

# --- Utility: install + import if missing (Colab-friendly) ---
def ensure_package_installed(package_name, import_name=None):
    """
    Ensures a Python package is installed and imported.

    Args:
        package_name (str): Name used in pip install (e.g., 'torchinfo').
        import_name (str): Module name used in import (e.g., 'torchinfo', 'sklearn').
        Defaults to package_name.

    Returns:
        module: The imported module object.
    """


    import_name = import_name or package_name

    try:
        return importlib.import_module(import_name)
    except ImportError:
        print(f"Installing '{package_name}'...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])
        return importlib.import_module(import_name)

# Core deps
# np = ensure_package_installed("numpy", "numpy")
# tqdm_mod = ensure_package_installed("tqdm", "tqdm")
# torch = ensure_package_installed("torch", "torch")
# PIL = ensure_package_installed("Pillow", "PIL")
# matplotlib = ensure_package_installed("matplotlib", "matplotlib")

# datasets = ensure_package_installed("datasets", "datasets")

# Qdrant + Transformers
# qdrant_client = ensure_package_installed("qdrant-client", "qdrant_client")
# transformers = ensure_package_installed("transformers", "transformers")

In [3]:

OK_RESPONSE = 200

def load_image(url):
    """Downloads and returns a PIL image from the given URL."""
    headers = {"User-Agent": "Mozilla/5.0"}
    response = requests.get(url, headers=headers)
    if response.status_code != OK_RESPONSE:
        raise ValueError(f"Failed to download image: {url}")
    try:
        return Image.open(BytesIO(response.content)).convert("RGB")
    except Exception as e:
        raise ValueError(f"Could not open image from {url}. Error: {e}") from e

In [4]:

# --- Device setup ---
device = (
    "cuda" if torch.cuda.is_available() else
    "mps" if torch.backends.mps.is_available() else
    "cpu"
)

device

'mps'

In [5]:
# --- Load CLIP (same idea as previous notebook) ---
MODEL_NAME = "openai/clip-vit-base-patch32"

processor = CLIPProcessor.from_pretrained(MODEL_NAME)
model = CLIPModel.from_pretrained(MODEL_NAME).to(device)
model.eval()

# CLIP embedding dimension
EMBED_DIM = model.config.projection_dim
EMBED_DIM

The image processor of type `CLIPImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 


Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

[1mCLIPModel LOAD REPORT[0m from: openai/clip-vit-base-patch32
Key                                  | Status     |  | 
-------------------------------------+------------+--+-
text_model.embeddings.position_ids   | UNEXPECTED |  | 
vision_model.embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


512

---
# 1) Load a small image dataset

In this exercise we need **images + short text metadata** (captions), because later we will
combine evidence across many retrieved results.

To keep this notebook self-contained, the next cell loads **COCO images with captions** from Hugging Face.

What you will get:
- `images`: list of PIL images
- `payloads`: list of dicts with:
  - `"filename"`
  - `"caption"`

If you already have a dataset from the previous notebook and want to reuse it, you can skip the loader cell
and replace it with your own dataset code (but make sure you still produce `images` and `payloads`).

In [6]:

MAX_IMAGES = 300

# ------------------------------------------------------------
# Dataset loading (research step)
# ------------------------------------------------------------
# Your goal:
#   Load a dataset that contains:
#     - images (or image URLs)
#     - some human-readable text describing the image
#
# You are NOT guaranteed that:
#   - the dataset loads
#   - the split exists
#   - the text field is obvious
#
# This is intentional.

# Try loading ONE dataset at a time.
# If something fails, read the error carefully.

# --- YOUR CODE HERE ---
# Examples of dataset names you *might* try:
#   "HuggingFaceM4/COCO"
#   "lmms-lab/COCO-Caption2017"
#   "detection-datasets/coco"
#
# Do NOT copy-paste all of them at once.
# Try, observe, then decide what to do next.


In [7]:
# ============================================================
# Dataset bookkeeping (required for Part B)
# ============================================================
# IMPORTANT:
# Update these TWO strings based on the dataset you actually loaded
# and the caption path you actually extracted.
#
# Original TODO placeholders:
# DATASET_NAME = "TODO: put the dataset name you used (e.g., lmms-lab/COCO-Caption2017)"
# CAPTION_SOURCE = "TODO: describe the caption path you used (e.g., sample['sentences'][0]['raw'])"

# SOLUTION (filled in):
DATASET_NAME = "lmms-lab/COCO-Caption2017"
CAPTION_SOURCE = "sample['answer'][0]"

print("Dataset:", DATASET_NAME)
print("Caption source:", CAPTION_SOURCE)

ds = load_dataset(DATASET_NAME, split="val", streaming=True)
print("Dataset type:", type(ds))
print("Streaming dataset loaded successfully.")

Dataset: lmms-lab/COCO-Caption2017
Caption source: sample['answer'][0]
Dataset type: <class 'datasets.iterable_dataset.IterableDataset'>
Streaming dataset loaded successfully.


In [8]:
# ------------------------------------------------------------
# Dataset investigation (do not skip)
# ------------------------------------------------------------

# Inspect ONE element from the dataset.
# Do not guess the structure — print it.

assert hasattr(ds, "__iter__"), "Dataset is not iterable"


In [9]:
# Teaching trap:
# This works for many "normal" datasets.
# It often FAILS for a certain (very useful) type of dataset.
# This failure is intentional in this exercise.

# EXPECTED CRASH — uncomment to demonstrate:
# print("Trying len(ds)...")
# print("Dataset length:", len(ds))   # TypeError: object of type 'IterableDataset' has no len()

# LESSON: Streaming datasets are generators. They yield one sample at a time
# over the network. There is no way to know the total count without consuming
# the entire stream. Use next(iter(ds)) to peek at the structure instead.
print("Skipped len(ds) — streaming datasets have no __len__.")

Skipped len(ds) — streaming datasets have no __len__.


In [10]:
# If the previous cell crashed:
# Your job is to explain WHY it crashed, in one sentence,
# and then replace len(ds) with a different sanity check.

# Bare hints:
# - What is a "streaming" dataset conceptually?
# - What does it mean if something is iterable but has no length?
# - Try getting ONE sample and inspecting its keys.

# Constraint:
# - Do NOT change how the dataset is loaded.
# - Fix the sanity check, not the dataset.

# Original TODO:
# raise NotImplementedError("TODO: write a sanity check that works for streaming datasets")

# SOLUTION (filled in):
# Streaming datasets have no len(). Instead, peek at one sample
# to verify the dataset is iterable and non-empty.

sample = next(iter(ds))
assert sample is not None, "Dataset yielded None — something is wrong"
print("Sanity check passed: got one sample")
print("Sample keys:", list(sample.keys()))

Sanity check passed: got one sample
Sample keys: ['question_id', 'image', 'question', 'answer', 'id', 'license', 'file_name', 'coco_url', 'height', 'width', 'date_captured']


In [11]:
# After passing the sanity tests, check the dataset

sample = next(iter(ds))

print("Type:", type(sample))
print("Keys:", list(sample.keys()))

# TODO:
# 1. Pick ONE key that looks like it might contain text
# 2. Print its type
# 3. Print a short preview (first ~200 chars)

# --- YOUR CODE HERE ---

# SOLUTION (filled in):
print()

# Investigate text-like fields
for key in sample.keys():
    val = sample[key]
    print(f"  {key}: type={type(val).__name__}, preview={str(val)[:200]}")
    print()

# The caption field is "answer" — it's a LIST of strings (one per annotator).
# We'll use answer[0] as our caption.
print("Caption field path: sample['answer'][0]")
print("Example caption:", sample["answer"][0])

Type: <class 'dict'>
Keys: ['question_id', 'image', 'question', 'answer', 'id', 'license', 'file_name', 'coco_url', 'height', 'width', 'date_captured']

  question_id: type=str, preview=000000179765.jpg

  image: type=JpegImageFile, preview=<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=640x480 at 0x13ABFC190>

  question: type=str, preview=Please carefully observe the image and come up with a caption for the image.

  answer: type=list, preview=['A black Honda motorcycle parked in front of a garage.', 'A Honda motorcycle parked in a grass driveway', 'A black Honda motorcycle with a dark burgundy seat.', 'Ma motorcycle parked on the gravel in

  id: type=int, preview=38

  license: type=int, preview=3

  file_name: type=str, preview=000000179765.jpg

  coco_url: type=str, preview=http://images.cocodataset.org/val2017/000000179765.jpg

  height: type=int, preview=480

  width: type=int, preview=640

  date_captured: type=str, preview=2013-11-15 14:02:51

Caption field path: samp

In [12]:
# ============================================================
# TODO: Investigate -> then extract images + payloads
# ============================================================
# Goal:
#   Build two lists:
#     images   : List[PIL.Image.Image]
#     payloads : List[dict] with keys {"filename", "caption"}
#
# IMPORTANT MENTAL MODEL:
#   We do NOT care how many dataset items we *loop over*.
#   We care how many valid items we *successfully collect*.
#
# That is why tqdm should count successful items,
# not raw dataset iterations.
#
# ------------------------------------------------------------
# Step 1: Investigate ONE sample (do not skip)
# ------------------------------------------------------------
# TODO:
# Pick 2–3 keys that *might* contain text.
# For each chosen key, print:
#   - key name
#   - type(value)
#   - short preview (first ~200 chars)
#
# Do NOT guess blindly. Inspect first.
#
# --- YOUR CODE HERE ---
#
# ------------------------------------------------------------
# Step 2: Extract a dataset subset (with tqdm)
# ------------------------------------------------------------
# TODO:
# Loop over ds.
# For each sample:
#   1) Try to extract / build a PIL image
#      - If this fails, skip the sample (continue)
#   2) Try to extract ONE caption string
#      - If you truly can't find one, use "" (empty string)
#   3) Build a filename / id string
#   4) Append to images + payloads
#   5) ONLY THEN:
#        - pbar.update(1)
#   6) Stop when len(images) == MAX_IMAGES
#
# Bare hints:
# - Some samples will be unusable. That is normal.
# - tqdm.update(1) should happen ONLY on success.
# - Do NOT assume captions are flat strings.
#
# --- YOUR CODE HERE ---

# SOLUTION (filled in):
images: list[Image.Image] = []
payloads: list[dict[str, Any]] = []

pbar = tqdm(total=MAX_IMAGES, desc="Collecting usable samples")

for sample in ds:
    try:
        # Extract image — the dataset provides PIL images directly
        img = sample["image"]
        if not isinstance(img, Image.Image):
            continue
        img = img.convert("RGB")

        # Extract caption from nested list (field is "answer", not "sentences_raw")
        captions = sample.get("answer", [])
        caption = captions[0] if captions else ""

        # Build filename from the sample
        filename = sample.get("file_name", f"img_{len(images):05d}.jpg")

        images.append(img)
        payloads.append({"filename": filename, "caption": caption})
        pbar.update(1)

    except Exception:
        continue

    if len(images) >= MAX_IMAGES:
        break

pbar.close()

print(f"Collected {len(images)} images with {sum(1 for p in payloads if p['caption'])} captions")

Collecting usable samples: 100%|██████████| 300/300 [00:01<00:00, 188.10it/s]

Collected 300 images with 300 captions





In [13]:
# ============================================================
# Sanity checks (do not delete)
# ============================================================
assert isinstance(images, list) and len(images) > 0, "images is empty"
assert isinstance(payloads, list) and len(payloads) == len(images), "payloads must match images length"
assert isinstance(payloads[0], dict), "payloads must be a list of dicts"

has_any_caption = any(bool((p.get("caption") or "").strip()) for p in payloads)

print("Loaded images:", len(images))
print("Example payload:", payloads[0])
print("Has any non-empty caption:", has_any_caption)

# Strict: if captions are missing, Step 5 becomes meaningless
assert has_any_caption, (
    "No captions found in payloads.\n"
    "This usually means: you extracted the wrong field, OR captions are nested.\n"
    "Go back to the investigation prints and find the correct path."
)


Loaded images: 300
Example payload: {'filename': '000000179765.jpg', 'caption': 'A black Honda motorcycle parked in front of a garage.'}
Has any non-empty caption: True


---

# 2) Compute CLIP image embeddings (scaffold provided)

We will embed images in batches, normalize vectors (important for cosine similarity),
and store embeddings as `float32` numpy arrays.

You should **read** the code and make sure you understand it.

## Reminder: model outputs may be bundles

If you expected a tensor but got an object/dict:
- print `type(x)`
- if possible, print `x.keys()`
- then pick the tensor field you need

This is common in Transformers.


In [14]:
# Test on one image:

img = images[0]
inputs = processor(images=[img], return_tensors="pt").to(device)

# ============================================================
# Teaching moment: model outputs are often NOT just tensors
# ============================================================
# Goal: figure out what `model.get_image_features(...)` returns.
# Rule: inspect first. Don't index into it yet.

with torch.no_grad():
    tmp = model.get_image_features(**inputs)

print("Type of output:", type(tmp))
print("Has shape?", hasattr(tmp, "shape"))
print("Has keys?", hasattr(tmp, "keys"))

# TODO: choose ONE next step (not all):
# - if it has shape: print(tmp.shape)
# - if it has keys:  print(list(tmp.keys()))
# - otherwise:       print(tmp) or dir(tmp)

# --- YOUR CODE HERE ---

# SOLUTION (filled in):
# It's a bundle! Inspect the keys.
print("Keys:", list(tmp.keys()))
print("pooler_output shape:", tmp.pooler_output.shape)  # Expected: (1, 512)

Type of output: <class 'transformers.modeling_outputs.BaseModelOutputWithPooling'>
Has shape? False
Has keys? True
Keys: ['last_hidden_state', 'pooler_output']
pooler_output shape: torch.Size([1, 512])


In [15]:
# ============================================================
# TODO: Extract the actual image embedding tensor
# ============================================================
# Based on your investigation:
# - extract the tensor that represents the image embedding
# - call it `img_embed`
#
# Requirements:
# - img_embed must be a torch.Tensor
# - shape should be (B, D)

# --- YOUR CODE HERE ---

# SOLUTION (filled in):
# get_image_features() returns a bundle with pooler_output as the
# projected (B, 512) embedding tensor.

img_embed = tmp.pooler_output

assert isinstance(img_embed, torch.Tensor), "img_embed must be a torch.Tensor"
assert img_embed.ndim == 2, "Expected shape (B, D)"
print("img_embed shape:", img_embed.shape)

img_embed shape: torch.Size([1, 512])


In [16]:
def get_features_normalized(model,inputs):
  outputs = model(**inputs)
  feats = outputs.text_embeds        # (B, D) projected CLIP embeddings
  feats = feats / feats.norm(dim=-1, keepdim=True)
  return feats


In [17]:
def l2_normalize(x: np.ndarray, eps: float = 1e-12) -> np.ndarray:
    return x / (np.linalg.norm(x, axis=-1, keepdims=True) + eps)

@torch.no_grad()
def embed_images_clip(images_pil: list[Image.Image], batch_size: int = 32) -> np.ndarray:
    """Returns (N, D) float32 embeddings."""
    all_vecs = []
    for i in tqdm(range(0, len(images_pil), batch_size), desc="Embedding images"):
        batch = images_pil[i:i+batch_size]
        inputs = processor(images=batch, return_tensors="pt").to(device)

        tmp = model.get_image_features(**inputs)  # TRAP: might be a tensor OR a bundle
        # TODO: Extract the actual embedding tensor from `tmp`
        # Bare hint: reuse what you learned in the investigation cells above.
        # --- YOUR CODE HERE ---
        # feats = ...

        # SOLUTION (filled in):
        feats = tmp.pooler_output

        feats = feats / feats.norm(dim=-1, keepdim=True)  # normalize in torch
        all_vecs.append(feats.detach().cpu().numpy())

    X = np.vstack(all_vecs).astype(np.float32)
    X = l2_normalize(X).astype(np.float32)
    return X

# Compute image vectors
img_vecs = embed_images_clip(images, batch_size=32)
print("img_vecs:", img_vecs.shape, img_vecs.dtype)

Embedding images: 100%|██████████| 10/10 [00:01<00:00,  9.38it/s]

img_vecs: (300, 512) float32





---

# 3) Create an in-memory Qdrant collection and upsert (review, TODO)

This should feel familiar.

### Requirements
- Use in-memory Qdrant (Colab-friendly)
- Use cosine distance
- Store payload metadata (filename, caption, etc.)

In the next cell:
1. Create a Qdrant client
2. Create/recreate a collection
3. Upsert all vectors

In [18]:
# TODO: Create Qdrant in-memory collection and then upsert vectors
# Teaching trap: this code is intentionally incomplete.
# Your job: inspect the client and fix API drift.

COLLECTION = "latent_assembly_images"
client = QdrantClient(":memory:")

# ============================================================
# TODO: Create Qdrant collection (intentional trap)
# ============================================================
# Goal:
#   Make sure a collection named COLLECTION exists.
#
# Bare hint:
#   - recreate_collection() is deprecated
#   - newer Qdrant APIs separate:
#       * check existence
#       * delete (optional)
#       * create
#
# Do NOT move on until the collection truly exists.

# --- YOUR CODE HERE ---

# SOLUTION (filled in):
# Modern Qdrant API: explicit delete + create (not deprecated recreate_collection)
if client.collection_exists(COLLECTION):
    client.delete_collection(collection_name=COLLECTION)

client.create_collection(
    collection_name=COLLECTION,
    vectors_config=qm.VectorParams(size=EMBED_DIM, distance=qm.Distance.COSINE),
)

print(f"Collection '{COLLECTION}' created with dim={EMBED_DIM}, distance=COSINE")

Collection 'latent_assembly_images' created with dim=512, distance=COSINE


In [19]:
# ============================================================
# Guardrail: verify collection exists BEFORE upsert
# ============================================================
# This cell is here so you don't blindly run upsert() and get a confusing error.

print("Collection name:", COLLECTION)

exists = client.collection_exists(COLLECTION)
print("collection_exists:", exists)

assert exists, (
    f"Collection '{COLLECTION}' does not exist yet.\n"
    "Fix the collection creation step BEFORE running upsert.\n"
    "Bare hint: newer Qdrant uses create_collection (and optionally delete_collection)."
)


Collection name: latent_assembly_images
collection_exists: True


In [20]:
# ============================================================
# Upsert points into Qdrant
# ============================================================

ids = list(range(len(img_vecs)))

points = [
    qm.PointStruct(
        id=ids[i],
        vector=img_vecs[i].tolist(),
        payload=payloads[i],
    )
    for i in range(len(ids))
]

client.upsert(collection_name=COLLECTION, points=points)

print("Collection populated.")
print("Count:", client.count(collection_name=COLLECTION, exact=True).count)


Collection populated.
Count: 300


In [21]:
# ============================================================
# Sanity check: can we retrieve anything? (1 query only)
# ============================================================
# Goal:
# - run ONE text query
# - get TOP_K hits
# - print payload keys + a caption preview from the first hit
#
# This is NOT the full multi-intent stage (that's Part B).
# This is just a basic "does retrieval work at all?" check.

TOP_K = 3
test_query = "chair"   # keep it boring on purpose

@torch.no_grad()
def embed_text_clip(text: str) -> np.ndarray:
    inputs = processor(text=[text], return_tensors="pt", padding=True).to(device)

    # Note: this mirrors the image side — text features may also come back as a bundle.
    # TRAP: feats may be a bundle (dict/object). Extract the tensor.
    # --- YOUR CODE HERE ---
    # Example (do not assume): feats = feats["pooler_output"]

    # SOLUTION (filled in):
    # Same as image side: get_text_features() returns a bundle.
    # Extract pooler_output for the projected (B, 512) embedding.
    tmp = model.get_text_features(**inputs)
    feats = tmp.pooler_output

    feats = feats / feats.norm(dim=-1, keepdim=True)
    v = feats.detach().cpu().numpy().astype(np.float32)[0]
    return l2_normalize(v[None, :])[0].astype(np.float32)

qv = embed_text_clip(test_query)

# TRAP: Qdrant API may differ by version.
# Your job: make ONE of these work.
# Hint: inspect dir(client)
#
# Option A (some versions):
# hits = client.search(
#     collection_name=COLLECTION,
#     query_vector=qv.tolist(),
#     limit=TOP_K,
#     with_payload=True,
# )
#
# Option B (newer versions):
# TODO: find the correct method name + arguments in your Qdrant version
# --- YOUR CODE HERE ---

# SOLUTION (filled in):
# Qdrant 1.16+: search() is removed, use query_points() instead
result = client.query_points(
    collection_name=COLLECTION,
    query=qv.tolist(),
    limit=TOP_K,
    with_payload=True,
)
hits = result.points

assert hits is not None and len(hits) > 0, "No hits returned. Retrieval not working."

print("Query:", test_query)
print("Hits:", len(hits))

h0 = hits[0]
payload0 = getattr(h0, "payload", None) or {}
print("Payload keys:", list(payload0.keys()))

caption_preview = (payload0.get("caption") or payload0.get("filename") or "")
print("Preview:", str(caption_preview)[:140])

Query: chair
Hits: 3
Payload keys: ['filename', 'caption']
Preview: a close up of a toilet with a pink seat and lid


# Save to Hugging Face (for Notebook B)

By the end of Part A, your Hugging Face dataset repo must contain **exactly these files**:

```
dataset/
├── img_vecs.npy      # image embeddings (shape: N × D)
├── payloads.json     # metadata per image (filename + caption)
└── meta.json         # index metadata (dataset, model, caption source)
```

Use a **private** Hugging Face *dataset* repo under your own account.

Notebook B will download this folder and rebuild Qdrant **without re-embedding**.


## Metadata for reuse (important)

Before uploading your embeddings, you must save a small `meta.json` file.

This file explains:
- what dataset you used
- how many items are indexed
- what model and embedding size were used
- where captions came from in the dataset schema

This is required so **Part B can rebuild Qdrant without re-embedding**.


## Colab Secrets token

Colab left sidebar → key icon (Secrets) → add:
- Name: HF_TOKEN
- Value: your HF token (Write permission)


In [22]:
# ============================================================
# Save metadata about this embedding index
# ============================================================

HF_USER_NAME = "vector-helix"
HF_REPO_NAME = "latent-assembly-clip-qdrant"

ARTIFACT_DIR = Path(HF_REPO_NAME + "/dataset")
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

# TODO (small but important):
# You should already have DATASET_NAME and CAPTION_SOURCE defined earlier.
# This metadata will be used by Part B to rebuild Qdrant without guessing.

meta = {
    "num_items": int(len(payloads)),
    "embedding_dim": int(img_vecs.shape[1]),
    "model_name": MODEL_NAME,
    "dataset_name": DATASET_NAME,
    "caption_source": CAPTION_SOURCE,
    "notes": "Add any observations about the dataset or extraction here"
}

with open(ARTIFACT_DIR / "meta.json", "w", encoding="utf-8") as f:
    json.dump(meta, f, ensure_ascii=False, indent=2)

print("Saved meta.json:")
print(json.dumps(meta, indent=2))

Saved meta.json:
{
  "num_items": 300,
  "embedding_dim": 512,
  "model_name": "openai/clip-vit-base-patch32",
  "dataset_name": "lmms-lab/COCO-Caption2017",
  "caption_source": "sample['answer'][0]",
  "notes": "Add any observations about the dataset or extraction here"
}


In [23]:
# TODO: Upload artifacts to your private HF dataset repo

# huggingface_hub = ensure_package_installed("huggingface_hub", "huggingface_hub")

try:
    # from google.colab import userdata
    # HF_TOKEN = userdata.get("HF_TOKEN")

    HF_TOKEN = os.environ.get("HF_TOKEN")
except Exception:
    HF_TOKEN = None

assert HF_TOKEN, "HF_TOKEN not found. Set it via environment variable or Colab Secrets."
login(token=HF_TOKEN)

# TODO: change to your own repo id, under your username
# HF_REPO_ID = "YOUR_USERNAME/YOUR_DATASET_REPO_NAME"

# SOLUTION (filled in):

HF_REPO_ID = HF_USER_NAME + "/" + HF_REPO_NAME

ARTIFACT_DIR = Path(HF_REPO_NAME + "/dataset")
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

np.save(ARTIFACT_DIR / "img_vecs.npy", img_vecs)
with open(ARTIFACT_DIR / "payloads.json", "w", encoding="utf-8") as f:
    json.dump(payloads, f, ensure_ascii=False)

# IMPORTANT: do NOT overwrite meta.json here.
# It must include dataset_name + caption_source from the metadata cell above.
assert (ARTIFACT_DIR / "meta.json").exists(), (
    "meta.json is missing. Run the metadata cell above before uploading."
)

create_repo(repo_id=HF_REPO_ID, repo_type="dataset", private=True, exist_ok=True)

upload_folder(
    repo_id=HF_REPO_ID,
    repo_type="dataset",
    folder_path="latent-assembly-clip-qdrant",
    path_in_repo=".",
    commit_message="Upload Latent Assembly artifacts",
)

print("Uploaded to:", HF_REPO_ID)

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Processing Files (0 / 0): |          |  0.00B /  0.00B            

New Data Upload: |          |  0.00B /  0.00B            

Uploaded to: vector-helix/latent-assembly-clip-qdrant


## Submit (Part A)

Submit your HF repo id: `username/repo-name`
