<a href="https://colab.research.google.com/github/DietzscheNostoevsky/Learning_Pytorch/blob/main/05_Pytorch_Going_Modular.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PyTorch Going Modular

Going modular involves turning notebook code (from a Jupyter Notebook or Google Colab notebook) into a series of different Python scripts that offer similar functionality.

For example, we could turn our notebook code from a series of cells into the following Python files:

* `data_setup.py` - a file to prepare and download data if needed.  
* `engine.py` - a file containing various training functions.  
* `model_builder.py` or model.py - a file to create a PyTorch model.  
* `train.py` - a file to leverage all other files and train a target PyTorch model.
* `utils.py` - a file dedicated to helpful utility functions.


**Production code** is code that runs to offer a service to someone or something.

# Getting Data

In [4]:
# Helper function to see what is inside the directory 
# using os.walk()
import os


def walk_through_dir(dir_path):
  """
  Walks through dir_path returning its contents.
  Args:
    dir_path (str or pathlib.Path): target directory
  
  Returns:
    A print out of:
      number of subdiretories in dir_path
      number of images (files) in each subdirectory
      name of each subdirectory
  """
  for dirpath, dirnames, filenames in os.walk(dir_path):
    print(f"There are {len(dirnames)} directories and {len(filenames)} images in '{dirpath}'.")

### Smaller Dataset

In [5]:
import os
import zipfile

from pathlib import Path

import requests

# Setup path to data folder
data_path = Path("data/")
image_path = data_path / "pizza_steak_sushi"

# If the image folder doesn't exist, download it and prepare it... 
if image_path.is_dir():
    print(f"{image_path} directory exists.")
else:
    print(f"Did not find {image_path} directory, creating one...")
    image_path.mkdir(parents=True, exist_ok=True)
    
# Download pizza, steak, sushi data
with open(data_path / "pizza_steak_sushi.zip", "wb") as f:
    request = requests.get("https://github.com/mrdbourke/pytorch-deep-learning/raw/main/data/pizza_steak_sushi.zip")
    print("Downloading pizza, steak, sushi data...")
    f.write(request.content)

# Unzip pizza, steak, sushi data
with zipfile.ZipFile(data_path / "pizza_steak_sushi.zip", "r") as zip_ref:
    print("Unzipping pizza, steak, sushi data...") 
    zip_ref.extractall(image_path)

# Remove zip file
os.remove(data_path / "pizza_steak_sushi.zip")
walk_through_dir(image_path)

data/pizza_steak_sushi directory exists.
Downloading pizza, steak, sushi data...
Unzipping pizza, steak, sushi data...
There are 2 directories and 0 images in 'data/pizza_steak_sushi'.
There are 3 directories and 0 images in 'data/pizza_steak_sushi/train'.
There are 0 directories and 72 images in 'data/pizza_steak_sushi/train/sushi'.
There are 0 directories and 75 images in 'data/pizza_steak_sushi/train/steak'.
There are 0 directories and 78 images in 'data/pizza_steak_sushi/train/pizza'.
There are 3 directories and 0 images in 'data/pizza_steak_sushi/test'.
There are 0 directories and 31 images in 'data/pizza_steak_sushi/test/sushi'.
There are 0 directories and 19 images in 'data/pizza_steak_sushi/test/steak'.
There are 0 directories and 25 images in 'data/pizza_steak_sushi/test/pizza'.


### Larger Dataset

In [6]:
#Imports
import os
import zipfile
from pathlib import Path
import requests


# Mount the GDrive
from google.colab import drive
drive.mount('/content/drive')

data_path = Path("data/")  # The trailing forward slash (/) in the string
# is used to indicate that it represents a
# directory rather than a specific file.
# It's a common convention to include the trailing slash
# in directory paths to differentiate them from file paths.

image_path_full = data_path / "pizza_steak_sushi_full"

if image_path_full.is_dir():
    print(f"{image_path_full} directory exists.")
else:
    print(f"Did not find {image_path_full} directory, creating one...")
    image_path_full.mkdir(parents=True, exist_ok=True)
zip_data = "/content/drive/Othercomputers/My MacBook Air/GitHub/-Machine_Learning/Learning_Pytorch/pizza_steak_sushi_100_percent.zip"

with zipfile.ZipFile(zip_data, "r") as zip_ref:
    print("Unzipping pizza, steak, sushi data...")
    zip_ref.extractall(image_path_full)

# Setup Dirs
train_dir_full = image_path_full / "train"
test_dir_full = image_path_full / "test"

walk_through_dir(image_path_full)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data/pizza_steak_sushi_full directory exists.
Unzipping pizza, steak, sushi data...
There are 2 directories and 0 images in 'data/pizza_steak_sushi_full'.
There are 3 directories and 0 images in 'data/pizza_steak_sushi_full/train'.
There are 0 directories and 750 images in 'data/pizza_steak_sushi_full/train/sushi'.
There are 0 directories and 750 images in 'data/pizza_steak_sushi_full/train/steak'.
There are 0 directories and 750 images in 'data/pizza_steak_sushi_full/train/pizza'.
There are 3 directories and 0 images in 'data/pizza_steak_sushi_full/test'.
There are 0 directories and 250 images in 'data/pizza_steak_sushi_full/test/sushi'.
There are 0 directories and 250 images in 'data/pizza_steak_sushi_full/test/steak'.
There are 0 directories and 250 images in 'data/pizza_steak_sushi_full/test/pizza'.


# 2. Create Datasets and DataLoaders

A PyTorch Dataset and a PyTorch DataLoader are two different components used in PyTorch for handling and preparing data for training or inference.

1. PyTorch Dataset:
A PyTorch Dataset is an abstract class that represents a dataset. It provides an interface to access and manipulate the data. To use a custom dataset, you need to subclass the `torch.utils.data.Dataset` class and implement the `__len__` and `__getitem__` methods. The `__len__` method returns the size of the dataset, and the `__getitem__` method is used to retrieve a specific data sample given its index. The Dataset class is responsible for loading and preprocessing the data, but it doesn't perform any data loading in parallel or handle batching.

2. PyTorch DataLoader:
A PyTorch DataLoader is an iterator that provides a convenient way to iterate over a dataset in mini-batches. It wraps a PyTorch Dataset and provides options for parallel data loading, shuffling, and batching. The DataLoader takes care of creating and managing multiple worker processes to load data in parallel, which can significantly speed up the data loading process. It automatically collates individual data samples into batches, allowing you to efficiently process mini-batches of data during training or inference. Additionally, the DataLoader can shuffle the data at the beginning of each epoch to introduce randomness and prevent any bias during training.

In summary, a PyTorch Dataset is responsible for accessing and preprocessing the data, while a PyTorch DataLoader takes care of parallel data loading, batching, and shuffling. Together, they provide a convenient and efficient way to handle and process large datasets in PyTorch.

In [14]:
# Setting train and test directory 

train_dir = image_path / "train"
test_dir = image_path / "test"

# Change here to use full dataset 

#train_dir = image_path_full / "train"
#test_dir = image_path_full / "test"

In [15]:
from torchvision import datasets, transforms

# Create simple transform
data_transform = transforms.Compose([ 
    transforms.Resize((64, 64)),
    transforms.ToTensor(),
])

# Use ImageFolder to create dataset(s)
train_data = datasets.ImageFolder(root=train_dir, # target folder of images
                                  transform=data_transform, # transforms to perform on data (images)
                                  target_transform=None) # transforms to perform on labels (if necessary)

test_data = datasets.ImageFolder(root=test_dir, 
                                 transform=data_transform)

print(f"Train data:\n{train_data}\nTest data:\n{test_data}")

Train data:
Dataset ImageFolder
    Number of datapoints: 225
    Root location: data/pizza_steak_sushi/train
    StandardTransform
Transform: Compose(
               Resize(size=(64, 64), interpolation=bilinear, max_size=None, antialias=warn)
               ToTensor()
           )
Test data:
Dataset ImageFolder
    Number of datapoints: 75
    Root location: data/pizza_steak_sushi/test
    StandardTransform
Transform: Compose(
               Resize(size=(64, 64), interpolation=bilinear, max_size=None, antialias=warn)
               ToTensor()
           )


In [16]:
# Turn train and test Datasets into DataLoaders
from torch.utils.data import DataLoader

train_dataloader = DataLoader(dataset=train_data, 
                              batch_size=1, # how many samples per batch?
                              num_workers=1, # how many subprocesses to use for data loading? (higher = more)
                              shuffle=True) # shuffle the data?

test_dataloader = DataLoader(dataset=test_data, 
                             batch_size=1, 
                             num_workers=1, 
                             shuffle=False) # don't usually need to shuffle testing data

train_dataloader, test_dataloader

(<torch.utils.data.dataloader.DataLoader at 0x7fab137fa860>,
 <torch.utils.data.dataloader.DataLoader at 0x7fab137f8fd0>)