<a href="https://colab.research.google.com/github/Parishri07/Learning-PyTorch/blob/main/Custom_data_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch Custom Data Creation (from Food101)

I'll get the images for 3 classes: Pizza, Steak, Sushi and store them in their respective files.


Want:

```
pizza_steak_sushi/
    train/
        pizza/
            image01.jpeg
            image02.jpeg
            ...
        steak/
            image04.jpeg
            image05.jpeg
            ...
        sushi/
            image07.jpeg
            ...
    test/
        pizza/
            image101.jpeg
            image102.jpeg
            ...
        steak/
            image104.jpeg
            image105.jpeg
            ...
        sushi/
            image107.jpeg
            ...
```

In [1]:
import torch
import torchvision

# Check PyTorch version and warn if it's too old
if int(torch.__version__.split(".")[1]) < 11:
    print("Warning: PyTorch version is less than 1.11.0. Food101 dataset might not work properly.")
else:
    print("PyTorch version is good!")

if int(torchvision.__version__.split(".")[1]) < 12:
    print("Warning: torchvision version is less than 0.12.0. Food101 dataset might not work properly.")
else:
    print("torchvision version is good!")

import torchvision.datasets as datasets
import torchvision.transforms as transforms

# Setup data directory
import pathlib
data_dir = pathlib.Path("../data")

PyTorch version is good!
torchvision version is good!



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.0 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ke

## Download data

Get the Food101 dataset from PyTorch.
* Food101 in `torchvision.datasets` - https://pytorch.org/vision/stable/generated/torchvision.datasets.Food101.html
* Original Food101 dataset - https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/

> **Note:** Downloading the dataset from PyTorch may take ~10-15 minutes depending on your internet speed. It will download ~5GB of data to the specified `root` directory.

In [None]:
# Get training data
train_data = datasets.Food101(root=data_dir,
                              split="train",
                              # transform=transforms.ToTensor(),
                              download=True)

# Get testing data
test_data = datasets.Food101(root=data_dir,
                             split="test",
                             # transform=transforms.ToTensor(),
                             download=True)

In [None]:
train_data

In [None]:
class_names = train_data.classes
class_names[:10]

In [None]:
# View first sample (PIL Image format)
print(class_names[train_data[0][1]])
train_data[0][0]

## Find subset of appropriate classes

Want: Steak, pizza, sushi.

Current path setup:

```
../data/food-101/images/CLASS_NAME/IMAGES.jpg
```

Going to get a list of the different target image classes (`pizza`, `steak`, `sushi`) filenames and then copy the images to separate folders.

I'd like to get a random 10% of the images from the target classes from both datasets.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Get random 10% of training images
import random

# Setup data paths
data_path = '/content/drive/MyDrive/Colab Notebooks/Satellite Image Classification dataset'
target_classes = ["cloudy", "water", "green_area"]

# Change amount of data to get (e.g. 0.1 = random 10%, 0.2 = random 20%)
amount_to_get = 0.2

# Create function to separate a random amount of data
def get_subset(image_path=data_path,
               data_splits=["train", "test"],
               target_classes=["pizza", "steak", "sushi"],
               amount=0.1,
               seed=42):
    random.seed(42)
    label_splits = {}

    # Get labels
    for data_split in data_splits:
        print(f"[INFO] Creating image split for: {data_split}...")
        label_path = /content/drive/MyDrive/Colab Notebooks/Satellite Image Classification dataset/{data_split}
        with open(label_path, "r") as f:
            labels = [line.strip("\n") for line in f.readlines() if line.split("/")[0] in target_classes]

        # Get random subset of target classes image ID's
        number_to_sample = round(amount * len(labels))
        print(f"[INFO] Getting random subset of {number_to_sample} images for {data_split}...")
        sampled_images = random.sample(labels, k=number_to_sample)

        # Apply full paths
        image_paths = [pathlib.Path(str(image_path / sample_image) + ".jpg") for sample_image in sampled_images]
        label_splits[data_split] = image_paths
    return label_splits

label_splits = get_subset(amount=amount_to_get)
label_splits["train"][:10]

SyntaxError: invalid syntax (<ipython-input-4-4250f29bebd4>, line 23)

## Move training and testing images to dedicated folders

In [None]:
# Create target directory path
target_dir_name = f"../data/pizza_steak_sushi_{str(int(amount_to_get*100))}_percent"
print(f"Creating directory: '{target_dir_name}'")

# Setup the directories
target_dir = pathlib.Path(target_dir_name)

# Make the directories
target_dir.mkdir(parents=True, exist_ok=True)

In [None]:
import shutil

for image_split in label_splits.keys():
    for image_path in label_splits[str(image_split)]:
        dest_dir = target_dir / image_split / image_path.parent.stem / image_path.name
        if not dest_dir.parent.is_dir():
            dest_dir.parent.mkdir(parents=True, exist_ok=True)
        print(f"[INFO] Copying {image_path} to {dest_dir}...")
        shutil.copy2(image_path, dest_dir)

In [None]:
# Check lengths of directories
def walk_through_dir(dir_path):
  """
  Walks through dir_path returning its contents.
  Args:
    dir_path (str): target directory

  Returns:
    A print out of:
      number of subdiretories in dir_path
      number of images (files) in each subdirectory
      name of each subdirectory
  """
  import os
  for dirpath, dirnames, filenames in os.walk(dir_path):
    print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")

walk_through_dir(target_dir)

Looks like we've got about ~75 training images per class and ~25 testing images per class (or more if you're using a higher percentage, e.g. ~150 training images per class and ~50 testing images per class for 20% of the data).

This should be enough for a starting dataset.

We can always increased them if needed.

## Zip up images folder to be more easily transported

In [None]:
# Zip pizza_steak_sushi images
zip_file_name = data_dir / f"pizza_steak_sushi_{str(int(amount_to_get*100))}_percent"
shutil.make_archive(zip_file_name,
                    format="zip",
                    root_dir=target_dir)

In [None]:
!ls -la ../data/

In [None]:
!mkdir -p pizza_steak_sushi
!unzip ../data/pizza_steak_sushi_20_percent.zip -d pizza_steak_sushi

In [None]:
!ls ../data

In [None]:
walk_through_dir("pizza_steak_sushi")

In [None]:
# # Remove extra data
# import os
# os.remove("pizza_steak_sushi")
# shutil.rmtree("pizza_steak_sushi")

In [None]:
!ls