# 3 Build a custom dataloader

In [None]:
from pathlib import Path
import numpy as np
from typing import Iterator, Tuple, List
import mads_datasets
mads_datasets.__version__

The problem with images is that the size grows pretty fast.

In [None]:
image_size = (180, 180, 3)

for i in [1, 10, 100]:
    size = (i, ) + image_size
    X = np.zeros(size)
    size_byte = X.nbytes
    print(f"Size for {i} images: {size_byte / (2**20)} MB")

Imagine what would happen if you actually have a million images! And no, the answer to this
is not "just get more RAM in the cloud". You actually don't need to store everything at
the same time in memory, right? So we will use the dataloader pattern to fix this problem. 

Tensorflow has a nice [collection of datasets](https://www.tensorflow.org/datasets) for machine learning tasks. Let's download the 'flower_photos' dataset. We will use that dataset for image classification later on. 

In [None]:
from mads_datasets import DatasetFactoryProvider, DatasetType
flowersfactory = DatasetFactoryProvider.create_factory(DatasetType.FLOWERS)
flowersfactory.download_data()


In [None]:
image_folder = flowersfactory.subfolder
print(image_folder)

Let's  build a datagenerator from scratch; even though there are a lot of libraries (tensorflow, pytorch, trax) that provide datagenerators for images, it is a usefull practice to learn how the inside works. 

Eventually you will encounter a task were you will need to read in data from disk, and it is always usefull if you know how to adapt to a custom case. First step is to list all files in the directory:

In [None]:
def walk_dir(path: Path) -> Iterator:
    """loops recursively through a folder

    Args:
        path (Path): folder to loop trough. If a directory
            is encountered, loop through that recursively.

    Yields:
        Generator: all paths in a folder and subdirs.
    """

    for p in Path(path).iterdir():
        if p.is_dir():
            yield from walk_dir(p)
            continue
        # resolve works like .absolute(), but it removes the "../.." parts
        # of the location, so it is cleaner
        yield p.resolve()

Note that the first file is a .txt file, so we will need to filter that.

In [None]:
paths = walk_dir(image_folder)
file1 = next(paths)
file2 = next(paths)
file1, file2

So, we now have a generator of paths in the directory. We can use a path to load an image from disk.
The stucture that is often used for storing images is to have subfolders that indicate a label. 
This is an easy way to create a dataset by a human (just drag and drop the images in the right folder to label them).

If the photo is inside the `tulips` subfolder, the class label should be `tulips`

In [None]:
from PIL import Image
file = next(paths)
img = Image.open(file)

In [None]:
img.show()

The `iter_valid_paths` function pulls all files, strips the corrects suffixes (we only want images), retrieves the classnames by gathering the names of the subfolders, and returns both

In [None]:
# lets have a look at the available file types
from mads_datasets.settings import FileTypes
for ft in FileTypes:
    print(ft)

In [None]:
def iter_valid_paths(path: Path, formats: List[FileTypes]) -> Tuple[Iterator, List[str]]:
    # gets all files in folder and subfolders
    walk = walk_dir(path)

    # retrieves foldernames as classnames
    class_names = [subdir.name for subdir in path.iterdir() if subdir.is_dir()]

    # keeps only specified formats
    formats_ = [f.value for f in formats]
    paths = (path for path in walk if path.suffix in formats_)
    return paths, class_names

In [None]:
formats = [FileTypes.JPG]
paths, class_names = iter_valid_paths(
    path = image_folder / "flower_photos",
    formats=formats
)

In [None]:
next(paths), class_names

And, last, we need the `load_image` function.

While there are multiple libraries available to load images (`pyvips`, `PIL`) the functions from `tensorflow` are the fastest for the sequence of tasks:
- load image from disk
- decode into an array of numbers
- resize the image to a fixed size
- cast to `numpy` array

In [None]:
imgpath = next(paths)
newsize = (150, 150)
img_ = Image.open(imgpath).resize(newsize, Image.LANCZOS)

In [None]:
img = np.asarray(img_)
img.shape

In [None]:
def load_image(
    path: Path, image_size: Tuple[int, int]
) -> np.ndarray:
    # load file
    img_ = Image.open(path).resize(image_size, Image.LANCZOS)
    return np.asarray(img_)

In [None]:
%timeit load_image(file, image_size=(180, 180))

In [None]:
file = next(paths)
img = load_image(file, (180, 180))
type(img), img.shape

Lets have a look at the image we loaded

In [None]:
Image.fromarray(img.astype(np.uint8))