# Food Dataset Analysis (EDA)

### Suggestions / Things to Explore in EDA (both datasets):

Note: for each insight found about the dataset, it is recommended to explain what it tells us about the dataset, why it's significant, 

- [ ] **Dataset directory and split integrity:** verify the expected Food-101 structure and examine the `/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256` directory and contents. Confirm class counts match expectations (1,000 images per class)
- [ ] **Image resolutions and aspect ratios:** plot width / height histograms, aspect ratios, resolution scatter, and detect outliers
- [ ] **Brightness / contrast and dynamic range:** inspect pixel intensity histograms and per-image mean/std. Find and keep note of any overly dark, blown-out, or low-contrast classes (for normalization)
- [ ] **Sharpness / blur and quality issues:** use Laplacian variance for blur scores to identify classes with many blurry images


### UEC-Food256 Dataset
Things to consider while looking / exploring dataset

- [ ] **Dataset directory names:** as you can see when you first download the dataset, the folders are named as numbers (1-256). It would be a good idea to rename each folder based on the `category.txt` file which stores the name and id
  - After renaming, check to see if there are any folders of the same name. If there are, decide to merge or keep separate with reasoning.
- [ ]

### Resources:
(may be helpful)
*   https://neptune.ai/blog/data-exploration-for-image-segmentation-and-object-detection
*   https://medium.com/@juanabascal78/exploratory-image-analysis-part-1-advanced-density-plots-19b255075dbd
*   https://www.datacamp.com/tutorial/seeing-like-a-machine-a-beginners-guide-to-image-analysis-in-machine-learning

## Import + Download Dataset

In [1]:
%pip install python-dotenv
%pip install roboflow


Collecting roboflow
  Downloading roboflow-1.2.12-py3-none-any.whl.metadata (9.7 kB)
Collecting idna==3.7 (from roboflow)
  Downloading idna-3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting opencv-python-headless==4.10.0.84 (from roboflow)
  Downloading opencv_python_headless-4.10.0.84-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting pi-heif<2 (from roboflow)
  Downloading pi_heif-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.5 kB)
Collecting pillow-avif-plugin<2 (from roboflow)
  Downloading pillow_avif_plugin-1.5.2-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting filetype (from roboflow)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Downloading roboflow-1.2.12-py3-none-any.whl (91 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.7/91.7 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading idna-3.7-py3-none-any.whl (66 kB)
[2K   [90m━━━━━━━━━━━━━

In [2]:
%pip install kagglehub



In [3]:
# RUN FOR UEC-FOOD256 DATASET

import kagglehub 
# Download latest version 
path = kagglehub.dataset_download("rkuo2000/uecfood256")
print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/rkuo2000/uecfood256?dataset_version_number=1...


100%|██████████| 3.94G/3.94G [03:04<00:00, 22.9MB/s]  

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1


In [None]:
# # RUN FOR YUSUF FOOD DATASET

# from roboflow import Roboflow
# from dotenv import load_dotenv
# import os

# load_dotenv()  # loads variables from .env into the environment

# api_key = os.getenv("YF_API_KEY")

# rf = Roboflow(api_key=api_key) 
# project = rf.workspace("caretech").project("food-dataset-uj20h-w2s4m")
# version = project.version(1)
# dataset = version.download("yolov8")


B4jK8Gc5eIoqulJRDvRV
loading Roboflow workspace...
loading Roboflow project...


In [4]:
import os

for subdir, dirs, files in os.walk(path):
    print(f"{subdir} → {len(files)} files")

/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1 → 0 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256 → 2 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/4 → 122 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/200 → 118 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/71 → 110 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/171 → 105 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/127 → 119 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/56 → 141 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/230 → 109 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/35 → 116 files
/root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/183 → 109 files
/root/.cache/kagglehub/datasets/rkuo20

In [5]:
# rename uec food 256 directories

import os
import re
import shutil

DATA_ROOT = os.path.join(path, "UECFOOD256")
CATEGORY_TXT_PATH = os.path.join(path, "UECFOOD256","category.txt")


def sanitize_name(name: str) -> str:
  """
  make a filesystem-safe folder name
  - lowercase
  - trim
  - replace spaces with underscores
  - remove non-alphanumeric characters
  """
  s = name.strip().lower()
  s = s.replace("’", "'")  # normalize curly apostrophes
  s = s.replace('"', '')
  s = s.replace("/", " ")  # avoid path separators
  s = s.replace("\\", " ")
  s = s.replace("&", " and ")
  s = s.replace("+", " plus ")
  s = s.replace("–", "-").replace("—", "-")  # dashes
  s = s.replace("’", "'")
  s = s.replace("’", "'")
  # replace whitespace with underscores
  s = re.sub(r"\s+", "_", s)
  # remove invalid chars (keep a-z0-9 _ - . ')
  s = re.sub(r"[^a-z0-9_\-\.']", "", s)
  # collapse underscores
  s = re.sub(r"_+", "_", s)
  # strip leading/trailing underscores or dots
  s = s.strip("._")
  # fall back if empty
  if not s:
    s = "unnamed"
  return s


def parse_category_txt(path: str) -> dict:
  """
  parse category.txt file
  @return dict mapping numeric id (str) -> sanitized_name
  """
  id_to_name = {}
  if not os.path.isfile(path):
    raise FileNotFoundError(f"category.txt not found at {path}")
  with open(path, "r", encoding="utf-8") as f:
    for line in f:
      line = line.strip()
      if not line or line.startswith("#"):
        continue
      # Lines may be "id  name" with multiple spaces; first token is id, rest is name
      parts = re.split(r"\s+", line, maxsplit=1)
      if len(parts) != 2:
        # skip headers like "id  name"
        continue
      id_str, raw_name = parts
      if not id_str.isdigit():
        continue
      safe = sanitize_name(raw_name)
      id_to_name[id_str] = safe
  return id_to_name


def rename_dirs(root: str, id_to_name: dict, dry_run: bool = False) -> list:
  """
  Rename directories in root from numeric id to category name
  @return list of (old_path, new_path)
  """
  changes = []
  if not os.path.isdir(root):
    raise NotADirectoryError(f"Root path not found: {root}")

  # list only top-level directories
  for entry in os.listdir(root):
    old_path = os.path.join(root, entry)
    if not os.path.isdir(old_path):
      continue
    if not entry.isdigit():
      # Already renamed or a non-id dir like 'UECFOOD256' or 'category.txt' parent; skip
      continue
    id_str = entry
    if id_str not in id_to_name:
      print(f"Warning: id {id_str} not found in category.txt. Skipping.")
      continue
    base_name = id_to_name[id_str]
    new_name = base_name
    new_path = os.path.join(root, new_name)

    # resolve collisions
    if os.path.exists(new_path):
      # If target already exists and is the same folder (unlikely), skip
      # Else append id to make it unique
      alt_name = f"{base_name}_{id_str}"
      alt_path = os.path.join(root, alt_name)
      if os.path.exists(alt_path):
        # As a last resort append a numeric suffix
        suffix = 2
        while True:
          candidate = f"{base_name}_{id_str}_{suffix}"
          candidate_path = os.path.join(root, candidate)
          if not os.path.exists(candidate_path):
            new_name = candidate
            new_path = candidate_path
            break
          suffix += 1
      else:
          new_name = alt_name
          new_path = alt_path

    if dry_run:
      print(f"[DRY RUN] Would rename: {old_path} -> {new_path}")
    else:
      os.rename(old_path, new_path)
      changes.append((old_path, new_path))
      print(f"Renamed: {old_path} -> {new_path}")
  print("changes: ", changes)
  return changes



print(f"Reading categories from: {CATEGORY_TXT_PATH}")
id_to_name = parse_category_txt(CATEGORY_TXT_PATH)
print(f"Parsed {len(id_to_name)} categories.")

# preview changes first
print("\nPreview (dry run):")
rename_dirs(DATA_ROOT, id_to_name, dry_run=True)

# if preview looks good, do actual rename
proceed = True
if proceed:
    print("\nApplying renames:")
    changes = rename_dirs(DATA_ROOT, id_to_name, dry_run=False)
    print(f"\nDone. Renamed {len(changes)} folders.")
else:
    print("\nNo changes applied.")

Reading categories from: /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/category.txt
Parsed 256 categories.

Preview (dry run):
[DRY RUN] Would rename: /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/4 -> /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/chicken-'n'-egg_on_rice
[DRY RUN] Would rename: /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/200 -> /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/rice_crispy_pork
[DRY RUN] Would rename: /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/71 -> /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/egg_roll
[DRY RUN] Would rename: /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/171 -> /root/.cache/kagglehub/datasets/rkuo2000/uecfood256/versions/1/UECFOOD256/moon_cake
[DRY RUN] Would rename: /root/.cache/kagglehub/datasets/rkuo2000/uecfood