# Raw (disk) datasets
So far, we have discussed training techniques for images that are stored and accessed directly from memory. This works for small datasets, such as those with tiny images, or for larger images in datasets that contain fewer than 50,000 images. But once we start training with larger-size images and large numbers of images, such as several hundred thousand images, your dataset will likely be stored on disk. This subsection covers common conventions for storing images on disk and accessing them for training.

 Beyond the curated datasets used for academic/research purposes, the datasets we use in production are likely stored on disk (or a database, if structured data). In the case of image data, we need to do the following:
 1. Read images and corresponding labels from disk into memory (assuming image data fits into memory).
 2. Resize the images to match the input vector of the CNN.

##  Directory structure
Placing images into a directory folder structure on a local disk is one of the most common layouts.

In this layout, shown in figure 4.19,
1. the root (parent) folder is a container for the dataset.  Below the root level are
2. one or more subdirectories. Each subdirectory corresponds to
3. a class (label) and contains the images that correspond to that class.

<img src="img_18.png">

Alternatively, if the dataset has been previously split into training and test data, we’d first group the data by train/test, and then group the data by the two classes for cats and dogs, as depicted in figure 4.20.

<img src="img_19.png">

When the dataset is hierarchically labeled,
each top-level class (label) subfolder is further partitioned into child subfolders according to the class (label) hierarchy. Using our cats and dogs example, each image is hierarchically labeled by whether it’s a cat or dog (species) and then by breed. See figure 4.21

<img src="img_20.png"/>


## CSV files
Another common layout is to use a comma-separated values (CSV) file to identify the location and class (label) of each image.

In this case,
1. each row in the CSV file is a separate image, and
2. the CSV file contains at least two columns, one for the location of the image, and the other for the class (label) of the image.
3. The location might be a local path, a remote location, or the pixel data that’s embedded as the value of the location.


**Local path example:**
>>  label,location
    'cat', cats_n_dogs/cat/1.jpg
    'dog',cats_n_dogs/dog/2.jpg

**Remote path example:**
>>   label,location
     'cat','http://mysite.com/cats_n_dogs/cat/1.jpg'
     'dog','http://mysite.com/cats_n_dogs/dog/2.jpg'

**Embedded data example:**
>>   label,location
     'cat',[[...],[...],[...]]
     'dog',[[...], [...], [...]]

##  JSON file
Another common layout is to use a JavaScript Object Notation (JSON) file to identify
the location and class (label) of each image. In this case, the JSON file is an array of
objects; each object is a separate image, and each object has at least two keys, one for
the location of the image, and the other for the class (label) of the image.
 The location might be either a local path, a remote location, or pixel data embedded as the value of the location. Here is a local path example:
[
{'label': 'cat', 'location': 'cats_n_dogs/cat/1.jpg' },
{'label': 'dog', 'location': 'cats_n_dogs/dog/2.jpg'}
…
]

## Reading Images
When training on an on-disk dataset, the first step is to read an image from disk into memory. The image on disk will be in an image format such as JPG, PNG, or TIF. These formats define how the image is encoded and compressed for storage. An image can be read into memory by using the PIL Image.open() method

In practice, you will have many images that need to be read in. Let’s assume you want
to read in all the images under a subdirectory (for example, cats). In the following
code, we scan (get a list of) all the files in the subdirectory, read each one in as an
image, and maintain a list of the read-in images as a list:

In [1]:
from PIL import Image
import os

def loadImages(subdir):
    images = []
    files = os.scandir(subdir)
    for file in files:
        images.append(Image.open(file))
    return images

Let’s expand on the preceding example and assume that the image dataset is laid out as a directory structure; each subdirectory is a class (label). In this case, we would want to scan each subdirectory separately and keep a record of the subdirectory names for the classes: