# Multilabel Dataset Generator (Food11 + FoodOrNot)

This notebook generates a real multilabel image dataset by combining samples from two different sources:

Food-11 (food categories)

FoodOrNot (non-food / negative samples)

The main objective is to create a dataset suitable for training a deep learning model capable of performing multi-label classification, meaning that a single image may contain one, multiple, or none of the target categories.

✅ Target Classes (Multilabel)

   The dataset is built using the following three multilabel classes:

     -lacteos (dairy products)

    -arroz (rice)

     -frutas/verduras (fruits and vegetables)

   Each generated image receives a label vector such as:

       [1, 0, 0] → contains dairy only

        [1, 1, 0] → contains dairy + rice

       [0, 1, 1] → contains rice + fruits/vegetables

       [0, 0, 0] → non-food image (none of the classes)

✅ Dataset Generation Strategy

To simulate real-world scenarios where multiple food items may appear in the same image, this notebook generates collage images by randomly selecting images from the Food-11 dataset and composing them into a single 224×224 RGB image.

The dataset includes:
1) Multilabel Collages (Food Images)

   -Collages can contain 1 to 3 different food categories

   -Labels are assigned dynamically depending on which classes were included

   -Each class has a configurable probability of being included in the final image

2) Non-Food Negative Images (No Category)

   -Additional samples are taken from FoodOrNot (negative_non_food)

   -These images are labeled as (0,0,0) to represent “none”

   -This helps reduce false positives and improves rejection performance


# 1) PROJECT ROOT DETECTION

 This block ensures that the notebook always runs from the correct
 project root folder, even if VS Code starts the kernel from another
 directory (like /outputs or /flask_app).

 Why is this important?
 Because we use relative paths such as:
   - raw/food11/training
   - dataset/labels.csv
 and they will fail if the current working directory is incorrect.

In [1]:
import os

def find_project_root(start="."):
    here = os.path.abspath(start)
    while True:
        # root válido si tiene estas carpetas
        if os.path.isdir(os.path.join(here, "raw")) and os.path.isdir(os.path.join(here, "dataset")):
            return here
        parent = os.path.dirname(here)
        if parent == here:
            raise RuntimeError("No encontré el root del proyecto (carpetas raw/ y dataset/). Abre la carpeta correcta en VS Code.")
        here = parent

PROJECT_ROOT = find_project_root(".")
os.chdir(PROJECT_ROOT)

print("✅ PROJECT_ROOT:", PROJECT_ROOT)
print("✅ CWD:", os.getcwd())
print("✅ Existe raw/food11/training?:", os.path.isdir("raw/food11/training"))
print("✅ Existe dataset/labels.csv?:", os.path.exists("dataset/labels.csv"))


✅ PROJECT_ROOT: c:\Users\HP OMEN\Documents\GitHub\Food-Multi-Label-Classification-Pipeline-with-TensorFlow-Dataset-Builder
✅ CWD: c:\Users\HP OMEN\Documents\GitHub\Food-Multi-Label-Classification-Pipeline-with-TensorFlow-Dataset-Builder
✅ Existe raw/food11/training?: True
✅ Existe dataset/labels.csv?: True


# 2) IMPORTS + GLOBAL CONFIGURATION (Dataset Generator Settings)

 This block imports the required libraries and defines all the global
 configuration variables used to build the multilabel dataset.

 The dataset will be generated by combining images from:

   ✅ Food-11 (food categories)
   
   ✅ FoodOrNot (non-food images)

 Output:
   - dataset/images/   -> generated images (collages + non-food)
   - dataset/labels.csv -> multilabel annotations for each image

In [2]:

import os
import random
import pandas as pd
from PIL import Image

random.seed(42)

# =========================
# CONFIG
# =========================
RAW_FOOD11 = "raw/food11"
RAW_FOODORNOT = "raw/foodornot"

OUT_IMAGES = "dataset/images"
OUT_LABELS = "dataset/labels.csv"

# ✅ 3 clases (multilabel real)
CLASSES = ["lacteos", "arroz", "frutas/verduras"]

FOOD11_FOLDER_MAP = {
    "lacteos": "Dairy product",
    "arroz": "Rice",
    "frutas/verduras": "Vegetable-Fruit",
}

# cantidad de collages por split
N_TRAIN = 3000
N_VAL = 600
N_TEST = 600

# probabilidad de que cada clase aparezca en una imagen compuesta
P_INCLUDE = 0.60

# cantidad de NO comida por split
MAX_NONFOOD_TRAIN = 1200
MAX_NONFOOD_VAL = 200
MAX_NONFOOD_TEST = 200

IMG_SIZE = (224, 224)


# 3) HELPER FUNCTIONS (Dataset building utilities)

This section contains utility functions used by the dataset generator.
 They handle:
   - creating folders
   - reading image file paths
   - loading images per class and split
   - randomly generating multilabel targets
   - creating collages (multi-image composition)
   - saving outputs and collecting non-food negatives
   - building a full split (train/val/test)

In [3]:
# =========================
# 1) HELPERS
# =========================
def ensure_dir(p):
    os.makedirs(p, exist_ok=True)

def list_images(folder: str):
    exts = (".jpg", ".jpeg", ".png", ".webp")
    if not os.path.isdir(folder):
        return []
    return [
        os.path.join(folder, f)
        for f in os.listdir(folder)
        if os.path.isfile(os.path.join(folder, f)) and f.lower().endswith(exts)
    ]

def load_pool(split_name: str):
    split_dir = os.path.join(RAW_FOOD11, split_name)
    if not os.path.isdir(split_dir):
        raise FileNotFoundError(f"No existe: {split_dir}")

    pools = {}
    for cls in CLASSES:
        class_dir = os.path.join(split_dir, FOOD11_FOLDER_MAP[cls])
        if not os.path.isdir(class_dir):
            raise FileNotFoundError(f"No existe carpeta: {class_dir}")

        pools[cls] = list_images(class_dir)
        if len(pools[cls]) == 0:
            raise RuntimeError(f"Carpeta vacía: {class_dir}")

    return pools

def pick_labels():
    labs = []
    for _ in CLASSES:
        labs.append(1 if random.random() < P_INCLUDE else 0)

    # la mayoría de veces debe haber al menos 1 clase
    if sum(labs) == 0 and random.random() < 0.80:
        labs[random.randrange(len(CLASSES))] = 1

    return labs

def make_collage(selected_imgs):
    canvas = Image.new("RGB", IMG_SIZE, (0, 0, 0))

    # slots para hasta 3 imágenes
    slots = [
        (0, 0, 112, 112),
        (112, 0, 224, 112),
        (0, 112, 112, 224),
    ]

    for img, slot in zip(selected_imgs, slots):
        x1, y1, x2, y2 = slot
        w, h = x2 - x1, y2 - y1
        im = img.convert("RGB").resize((w, h))
        canvas.paste(im, (x1, y1))

    # si solo hay 1 clase, mejor usarla grande
    if len(selected_imgs) == 1:
        canvas = selected_imgs[0].convert("RGB").resize(IMG_SIZE)

    return canvas

def save_image(img, filename):
    img.save(os.path.join(OUT_IMAGES, filename), quality=95)

def collect_nonfood(split_key: str):
    paths = []

    train_non = os.path.join(RAW_FOODORNOT, "train", "negative_non_food")
    test_non = os.path.join(RAW_FOODORNOT, "test", "negative_non_food")

    if split_key == "train" and os.path.isdir(train_non):
        paths.extend(list_images(train_non))

    if os.path.isdir(test_non):
        paths.extend(list_images(test_non))

    random.shuffle(paths)
    return paths

def build_split(food11_split, n_samples, prefix, nonfood_limit, nonfood_split_key):
    pools = load_pool(food11_split)
    rows = []
    idx = 0

    # 1) collages multilabel
    for _ in range(n_samples):
        labs = pick_labels()

        chosen_paths = []
        for cls, flag in zip(CLASSES, labs):
            if flag == 1:
                chosen_paths.append(random.choice(pools[cls]))

        if len(chosen_paths) == 0:
            cls = random.choice(CLASSES)
            labs = [1 if c == cls else 0 for c in CLASSES]
            chosen_paths = [random.choice(pools[cls])]

        imgs = [Image.open(p) for p in chosen_paths]
        collage = make_collage(imgs)

        idx += 1
        fn = f"{prefix}_{idx:07d}.jpg"
        save_image(collage, fn)

        rows.append({
            "filename": fn,
            **{c: int(v) for c, v in zip(CLASSES, labs)},
            "source": f"collage/{food11_split}"
        })

    # 2) NO comida => ninguno (0,0,0)
    nonfoods = collect_nonfood(nonfood_split_key)[:nonfood_limit]

    for p in nonfoods:
        idx += 1
        fn = f"{prefix}_none_{idx:07d}.jpg"
        img = Image.open(p).convert("RGB").resize(IMG_SIZE)
        save_image(img, fn)

        rows.append({
            "filename": fn,
            **{c: 0 for c in CLASSES},
            "source": "foodornot/negative_non_food"
        })

    return rows


# 4) MAIN PIPELINE 
 This block is the "entry point" of the dataset generator.
 It orchestrates the full process:
   1) Create output folders (dataset/images and dataset/labels.csv path)
   2) Remove old generated images (clean output folder)
   3) Generate train/val/test splits:
       - multilabel collages from Food-11
        - non-food negatives from FoodOrNot
   4) Save a CSV file with multilabel annotations (labels.csv)
   5) Print summary statistics to verify the dataset balance

In [4]:
# (1) Ensure output directories exist
def main():
    ensure_dir(OUT_IMAGES)
    ensure_dir(os.path.dirname(OUT_LABELS))

# (2) Clean previous generated images
    if os.path.isdir(OUT_IMAGES):
        for f in os.listdir(OUT_IMAGES):
            if f.lower().endswith((".jpg", ".jpeg", ".png", ".webp")):
                os.remove(os.path.join(OUT_IMAGES, f))


 # (3) Build dataset splits (train / validation / test)
    all_rows = []
    all_rows += build_split("training",   N_TRAIN, "train", MAX_NONFOOD_TRAIN, "train")
    all_rows += build_split("validation", N_VAL,   "val",   MAX_NONFOOD_VAL,   "val")
    all_rows += build_split("evaluation", N_TEST,  "test",  MAX_NONFOOD_TEST,  "test")

# (4) Save labels CSV (multilabel annotations)
    df = pd.DataFrame(all_rows)
    df.to_csv(OUT_LABELS, index=False, encoding="utf-8")

    print("\n✅ Dataset MULTILABEL REAL listo")
    print("Total:", len(df))
    print("Positivos por clase:\n", df[CLASSES].sum())
    print("NINGUNO:", int((df[CLASSES].sum(axis=1) == 0).sum()))
    print("CSV:", OUT_LABELS)

main()



✅ Dataset MULTILABEL REAL listo
Total: 5800
Positivos por clase:
 lacteos            2627
arroz              2565
frutas/verduras    2573
dtype: int64
NINGUNO: 1600
CSV: dataset/labels.csv


# Conclusions
The multilabel dataset was successfully generated with 5,800 images, showing a well-balanced distribution across the three target classes (dairy, rice, and fruits/vegetables) and including 1,600 “none” (non-food) samples, which is essential to reduce false positives and improve the model’s ability to correctly reject images that do not belong to any category.


The final structure (dataset/images + dataset/labels.csv) is compatible with modern TensorFlow training pipelines using tf.data, making it easier to automate both training and evaluation.