# Custom Dataset for  image classification
In this tutorial, we will describe how to create and use a custom dataset for the aim of image classification, following https://www.kaggle.com/basu369victor/pytorch-tutorial-the-classification.

## Custom dataset
First of all, you need to collect all the images you need to create the dataset (preferably ~1000 images per category) and define the different categories in exam.
At the end the directory ***dataset*** containing all your images should have a structure like this:

![](images/structure_img.png)

where each image has been placed inside the subdirectory class_i corresponding to the class it belongs to.

In [1]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data.sampler import SubsetRandomSampler
import torchvision.transforms as transforms

#define the device
device = torch.device('cpu')

In [3]:
image = []
labels = []
#path to the directory containing your dataset
data_path = '../dataset_imagerec/'
for file in os.listdir(data_path):
    if os.path.isdir(os.path.join(data_path, file)):
        for img in os.listdir(os.path.join(data_path, file)):
            image.append(img)
            labels.append(file)

# Creation of a csv Data-frasmithers.me from the raw dataset. You might not have to follow
# this step if you are already provided with csv file which contains the desired 
# input and target value.
data = {'Images':image, 'labels':labels} 
data = pd.DataFrame(data) 
data.head()

lb = LabelEncoder()
data['encoded_labels'] = lb.fit_transform(data['labels'])
data.head()

# save the csv file inside the dataset directory 
data.to_csv('../dataset_imagerec/dataframe.csv', index=False)
#in order to import the file run this command
#data = pd.read_csv('dataset_imagerec/dataframe.csv')

## Splitting of the dataset
The dataset needs to be split between the train and test process. Usually you will use 80% of all the images for the training phase and the remainig 20% for the testing phase.

There are two ways to do this: one is to do it from scratch, the other one is by using ***train_test_split*** function ***from scikit-learn*** (recommended).

In [4]:
batch_size = 128
validation_split = .2
shuffle_dataset = True
random_seed= 42

dataset_size = len(data)
indices = list(range(dataset_size))
split = int(np.floor(validation_split * dataset_size))
if shuffle_dataset :
    np.random.seed(random_seed)
    np.random.shuffle(indices)
train_indices, val_indices = indices[split:], indices[:split]

# Creating PT data samplers and loaders:
train_sampler = SubsetRandomSampler(train_indices)
test_sampler = SubsetRandomSampler(val_indices)

### Images Preparation
After collecting the images, it is necessary to apply them some transformations in order to be used during the training and testing phases.

- ***Transforms*** are common image transformations, that can be chained together using ***Compose***.
- You need to convert a PIL Image or numpy.ndarray to tensor using ***transforms.ToTensor()***. It converts a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] if the PIL Image belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the numpy.ndarray has dtype = np.uint8
- The tensor images should be ***normalized*** with mean and standard deviation. Given mean: (M1,...,Mn) and std: (S1,..,Sn) for n channels, the transformation ***transforms.Normalize*** will normalize each channel of the input torch.*Tensor i.e. input[channel] = (input[channel] - mean[channel]) / std[channel].

Here you can find an example of transormation that can be applied to the images of your dataset. 

In [5]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

## Create custom dataset class
You now need to create a dataset class to be used as first argument in the function ***torch.utils.data.DataLoader()***.

The skeleton of your custom dataset class has to be as the one in the cell below. It must contain the following functions to be used by data loader later on.
- ***__init__()*** function is where the initial logic happens like reading a csv, assigning transforms, filtering data, etc.
- ***__getitem__()*** function returns the data and labels. This function is called from dataloader like this:

     img, label = MyCustomDataset.***__getitem__***(99)  # For 99th item

In [6]:
from torch.utils.data.dataset import Dataset

class MyCustomDataset(Dataset):
    def __init__(self, args):
        # stuff
        self.args = args
        
    def __getitem__(self, index):
        # stuff
        return (img, label)

    def __len__(self):
        return count # of how many examples(images) you have

An example of how you can create this custom dataset class is the following (see also  ***dataset/imagerec_dataset.py***):  

In [7]:
from torch.utils.data.dataset import Dataset
class Imagerec_Dataset(Dataset):
    def __init__(self, img_data, img_path, transform=None):
        self.img_path = img_path
        self.img_data = img_data
        self.transform = transform
        
    def __len__(self):
        return len(self.img_data)
    
    def __getitem__(self, index):
        img_name = os.path.join(self.img_path,self.img_data.loc[index, 'labels'],
                                self.img_data.loc[index, 'Images'])
        image = Image.open(img_name)
        #image = image.convert('RGB')
        image = image.resize((300,300))
        label = torch.tensor(self.img_data.loc[index, 'encoded_labels'])
        if self.transform is not None:
            image = self.transform(image)
        else:
            image = transforms.ToTensor()(image)
        return image, label

After defining the class for your custom dataset, you can create it and use it inside the function ***torch.utils.data.DataLoader()*** as described in the following part.

In [8]:
dataset = Imagerec_Dataset(data, data_path, transform)
train_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, 
                                           sampler=train_sampler)
test_loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size,
                                                sampler=test_sampler)