# Project Notes (Sanitized for Git)

This repository contains a **sanitized** version of the Gracity Insects YOLOv8 Classification notebooks.
All tenant-specific identifiers (bucket names, namespaces, OCIDs, local absolute paths) have been replaced by placeholders.

**Author:** Cristina Varas Menadas  
**Last updated:** 2026-02-19

> To run these notebooks, set the configuration values in the first "Configuration" section of each notebook.


# Gracity Insects — 02. Dataset QA & Exploratory Analysis (Bucket-first)

This notebook performs lightweight dataset quality checks **directly from OCI Object Storage** (bucket-first):

- Count images per class and per split (**train** / **test**, where **test is used as validation** in this starter)
- Random visual sanity checks (download only a small sample to a local cache)
- Image size distribution (computed on the downloaded sample)

> Why bucket-first?
> - The bucket is the **source of truth**
> - We avoid relying on local paths while the dataset is being staged

## 2.1 Imports

In [None]:
from __future__ import annotations

import math
import random
from collections import defaultdict
from pathlib import Path
from typing import Dict, List, Tuple

import matplotlib.pyplot as plt
import cv2
import pandas as pd

import oci
from oci.object_storage import ObjectStorageClient

## 2.2 Configuration

In [None]:
# Object Storage
BUCKET_NAME: str = "<BUCKET_NAME>"

DATA_PREFIX: str = "<PROJECT_PREFIX>/v1/raw/datasets/insects_kaggle_v1"
TRAIN_PREFIX: str = f"{DATA_PREFIX}/train/"
TEST_PREFIX: str  = f"{DATA_PREFIX}/test/"  # test is used as validation in this starter

# Local cache for small samples (do NOT store the full dataset here)
CACHE_DIR: Path = Path("<LOCAL_PATH> Gracity/gracity-insects-yolo-cls/outputs/cache/samples")

# Sampling settings
SAMPLE_N_IMAGES: int = 12         # for visual checks
SIZE_SAMPLE_MAX: int = 500        # for size distribution

RANDOM_SEED: int = 42
random.seed(RANDOM_SEED)

## 2.3 Connect to Object Storage using Resource Principals

In [None]:
signer = oci.auth.signers.get_resource_principals_signer()
os_client = ObjectStorageClient(config={}, signer=signer)
namespace: str = os_client.get_namespace().data

print("Namespace:", namespace)
print("Bucket:", BUCKET_NAME)
print("Train prefix:", TRAIN_PREFIX)
print("Test prefix:", TEST_PREFIX)

## 2.4 Helpers (list objects, count per class, download objects)

In [None]:
IMAGE_EXTS: Tuple[str, ...] = (".jpg", ".jpeg", ".png", ".webp")

def list_all_objects(prefix: str) -> List[str]:
    names: List[str] = []
    start: str | None = None
    while True:
        r = os_client.list_objects(
            namespace_name=namespace,
            bucket_name=BUCKET_NAME,
            prefix=prefix,
            start=start,
            limit=1000,
        )
        names.extend([o.name for o in r.data.objects if not o.name.endswith("/")])
        start = r.data.next_start_with
        if not start:
            break
    # keep only images
    return [n for n in names if n.lower().endswith(IMAGE_EXTS)]

def counts_by_class(prefix: str) -> Dict[str, int]:
    objs = list_all_objects(prefix)
    counts: Dict[str, int] = defaultdict(int)
    for name in objs:
        rest = name[len(prefix):]          # "<ClassName>/<file>"
        cls = rest.split("/", 1)[0]
        if cls:
            counts[cls] += 1
    return dict(counts)

def download_object(obj_name: str, dest_path: Path) -> None:
    dest_path.parent.mkdir(parents=True, exist_ok=True)
    r = os_client.get_object(namespace, BUCKET_NAME, obj_name)
    with dest_path.open("wb") as f:
        for chunk in r.data.raw.stream(1024 * 1024, decode_content=False):
            f.write(chunk)

## 2.5 Counts per class (train vs test)

In [None]:
train_counts: Dict[str, int] = counts_by_class(TRAIN_PREFIX)
test_counts: Dict[str, int] = counts_by_class(TEST_PREFIX)

df = pd.DataFrame({"train": train_counts, "val(test)": test_counts}).fillna(0).astype(int)
df.sort_index()

## 2.6 Random visual sanity check (download-only sample)

We download a small random sample of images from the **train** split into `CACHE_DIR`
and plot them. This avoids downloading the entire dataset.

In [None]:
CACHE_DIR.mkdir(parents=True, exist_ok=True)

train_objs = list_all_objects(TRAIN_PREFIX)
k = min(SAMPLE_N_IMAGES, len(train_objs))
picked = random.sample(train_objs, k=k)

samples: List[Path] = []
for obj in picked:
    rel = obj[len(TRAIN_PREFIX):]  # "<Class>/<file>"
    local_path = CACHE_DIR / "train" / rel
    if not local_path.exists():
        download_object(obj, local_path)
    samples.append(local_path)

len(samples), samples[0]

## 2.7 Plot sample images (cv2)

In [None]:
cols = 4
rows = math.ceil(len(samples) / cols)
plt.figure(figsize=(12, 3 * rows))

for i, p in enumerate(samples, 1):
    img_bgr = cv2.imread(str(p))
    if img_bgr is None:
        continue
    img = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
    plt.subplot(rows, cols, i)
    plt.imshow(img)
    # parent folder is the class name
    plt.title(p.parent.name)
    plt.axis("off")

plt.tight_layout()
plt.show()

## 2.8 Image size distribution (sample)

We compute image width/height distributions on a limited sample (`SIZE_SAMPLE_MAX`) to keep it fast.

In [None]:
def get_sizes(local_paths: List[Path], max_n: int) -> List[Tuple[int, int]]:
    sizes: List[Tuple[int, int]] = []
    for p in local_paths[:max_n]:
        img = cv2.imread(str(p))
        if img is None:
            continue
        h, w = img.shape[:2]
        sizes.append((w, h))
    return sizes

# Ensure we have enough cached files; if not, download more
needed = max(SIZE_SAMPLE_MAX, len(samples))
if len(samples) < needed:
    # grab more objects, but keep it bounded
    extra_k = min(needed - len(samples), max(0, len(train_objs) - len(samples)))
    if extra_k > 0:
        extra_picked = random.sample([o for o in train_objs if (CACHE_DIR / "train" / o[len(TRAIN_PREFIX):]).exists() is False], k=extra_k)
        for obj in extra_picked:
            rel = obj[len(TRAIN_PREFIX):]
            local_path = CACHE_DIR / "train" / rel
            download_object(obj, local_path)
            samples.append(local_path)

sizes = get_sizes(samples, SIZE_SAMPLE_MAX)
ws = [w for w, h in sizes]
hs = [h for w, h in sizes]

plt.figure(figsize=(6, 4))
plt.hist(ws, bins=30)
plt.xlabel("Width (px)")
plt.ylabel("Count")
plt.title("Width distribution (sample)")
plt.show()

plt.figure(figsize=(6, 4))
plt.hist(hs, bins=30)
plt.xlabel("Height (px)")
plt.ylabel("Count")
plt.title("Height distribution (sample)")
plt.show()

## 2.9 Notes on preprocessing (for YOLO classification)

For YOLOv8 classification, heavy preprocessing is usually **not required** for a first iteration:

- The training pipeline will **resize** images to `imgsz` (e.g., 224) automatically.
- What *is* useful:
  - detect corrupt images (we implicitly skip unreadable images above)
  - check class balance
  - sanity-check lighting/blur/noise

If you later want to optimize IO/cost, you can add an optional step:
- pre-resize to 224 and save as JPEG (quality 85–90) to a new prefix (cache dataset).