## understanding the task

there are new images from users, uploaded on daily basis

we need to ingest them and prepare for further usage and processing

lets assume it's a dish classification case

points to consider:
* we expect to have some existing main image pool which we plan to extend with new images

* we ingest images daily
    * schedule or other trigger?

* data preparation
    * rescale to have a maximum side length of 512 pixels
    * add metadata
        * upload date
        * user id

* historization
    * unlikely

* where to store images and labels

* workflow
    * ingest as raw
    * process as rescaled
    * put through ML model and label as labeled
    * manually validate?
    * extend the main image



## further improvements

* proper storage
    * Azure Blob Storage

* other image formats
    * not only jpg

* functions put separately

### import and parameter

import shutil
import os
from PIL import Image

### copy files from upload folder

In [14]:
for fname in os.listdir("../data/image/upload"):
    shutil.copy2(f"../data/image/upload/{fname}", f"../data/image/bronze/{fname}")
    print(f"File {fname} copied")

File Homemade-Pizza_EXPS_FT23_376_EC_120123_3.jpg copied
File Salami-pizza-hero.jpg copied
File download.jpeg copied
File Pizza-3007395.jpg copied
File Eq_it-na_pizza-margherita_sep2005_sml.jpg copied


In [None]:
def resize_max_side(img: Image.Image, max_side=512):
    w, h = img.size
    if max(w, h) <= max_side:
        return img
    scale = max_side / float(max(w, h))
    new = (int(w*scale), int(h*scale))
    return img.resize(new, Image.LANCZOS)

In [12]:
# Load an image (replace with your file)
img = Image.open("../data/raw/pizza/11_pizza.jpg")

print("Original size:", img.size)

# Resize
resized = resize_max_side(img, max_side=512)
print("Resized size:", resized.size)

# Save the resized version
resized.save("../data/resized/pizza/11_pizza.jpg")

print("Resized image saved to data/resized/pizza_resized.jpg")

Original size: (5472, 3648)
Resized size: (512, 341)
Resized image saved to data/resized/pizza_resized.jpg


In [13]:
from datasets import load_dataset
import os

# --- Settings ---
CLASSES = ["pizza", "sushi"]   # choose your two classes
N = 10                         # how many per class
OUT_DIR = "../data/raw"
# ----------------

# Load dataset
ds = load_dataset("ethz/food101", split="train")

# Label mapping
label_names = ds.features["label"].names

# Ensure output folders exist
for cls in CLASSES:
    os.makedirs(os.path.join(OUT_DIR, cls), exist_ok=True)

# Track how many saved per class
saved_counts = {cls: 0 for cls in CLASSES}

for idx, ex in enumerate(ds):
    label = label_names[ex["label"]]

    if label in CLASSES and saved_counts[label] < N:
        # Use dataset row index as unique, stable ID
        fname = f"{idx:06d}_{label}.jpg"   # e.g. 000123_pizza.jpg
        path = os.path.join(OUT_DIR, label, fname)

        # Save image (overwrites if already exists)
        ex["image"].save(path)

        saved_counts[label] += 1
        print("Saved", path)

    # Stop once all targets reached
    if all(saved_counts[c] >= N for c in CLASSES):
        break

print("Done! Saved:", saved_counts)


Saved ../data/raw/pizza/007500_pizza.jpg
Saved ../data/raw/pizza/007501_pizza.jpg
Saved ../data/raw/pizza/007502_pizza.jpg
Saved ../data/raw/pizza/007503_pizza.jpg
Saved ../data/raw/pizza/007504_pizza.jpg
Saved ../data/raw/pizza/007505_pizza.jpg
Saved ../data/raw/pizza/007506_pizza.jpg
Saved ../data/raw/pizza/007507_pizza.jpg
Saved ../data/raw/pizza/007508_pizza.jpg
Saved ../data/raw/pizza/007509_pizza.jpg


KeyboardInterrupt: 