<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DL0321EN-SkillsNetwork/image/IDSN-logo.png" width="200" alt="cognitiveclass.ai logo">


<h1>Data Preparation with PyTorch</h1>


<p>Crack detection has vital importance for structural health monitoring and inspection. We will train neural network to detect Cracks, images that contain cracks will be denoted as positive and images with no cracks as negative. In this notebook, we will build a dataset object. There are five questions in this notebook, Including some questions that are intermediate steps to help us build the dataset object. </p>

<h2>Table of Contents</h2>


1. Imports and Auxiliary Functions 
2. Download Data
3. Examine Files
4. Question 1: find number of files
5. Assign Labels to Images
6. Question 2: Assign labels to image
7. Training  and Validation  Split
8. Question 3: Training  and Validation  Split
9. Create a Dataset Class
10. Question 4: Display  training dataset object
11. Question 5: Display  validation dataset object
<hr>

<h2 id="auxiliary">Imports and Auxiliary Functions</h2>


In [None]:
# Import libraries

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

import os
import glob

import torch
from torch.utils.data import Dataset

import skillsnetwork 

In [None]:
# Define function to plot

def show_data(data_sample, shape = (28, 28)):
    plt.imshow(data_sample[0].numpy().reshape(shape), cmap = 'gray')
    plt.title('y = ' + data_sample[1])

<h2 id="download_data">Download Data</h2>


In this section, you are going to download the data from IBM object storage using **skillsnetwork.prepare** command. <b>skillsnetwork.prepare</b> is a command that's used to download a zip file, unzip it and store it in a specified directory. Locally we store the data in the directory  **/resources/data**. 


In [None]:
await skillsnetwork.prepare("https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0321EN/data/images/concrete_crack_images_for_classification.zip",
                            path = "/resources/data", overwrite = True)

<h2 id="examine_files">Examine Files </h2>


In the previous notebook, we created two lists; one to hold the path to the Negative files and one to hold the path to the Positive files. This process is shown in the following few lines of code.

In [None]:
# Obtain the list that contains the path to the negative files

directory = "/resources/data"
negative = 'Negative'
negative_file_path = os.path.join(directory, negative)
negative_files = [os.path.join(negative_file_path, file)
                  for file in os.listdir(negative_file_path)
                  if file.endswith('.jpg')]
negative_files.sort()

# Show first three path names
negative_files[0:3]

In [None]:
# Obtain the list that contains the path to the positive files

positive = 'Positive'
positive_file_path = os.path.join(directory, positive)
positive_files = [os.path.join(positive_file_path, file)
                  for file in os.listdir(positive_file_path)
                  if file.endswith('.jpg')]
positive_files.sort()

# Show first three path names
positive_files[0:3]

<h2 id="Question_1">Question 1</h2>
<b>Find the <b>combined</b> length of the list <code>positive_files</code> and <code>negative_files</code> using the function <code>len</code> . Then assign  it to the variable <code>number_of_samples</code></b>


In [None]:
number_of_samples = len(positive_files) + len(negative_files)
number_of_samples

<h2 id="assign_labels">Assign Labels to Images </h2>


In this section we will assign a label to each image in this case we  can assign the positive images, i.e images with a crack to a value one  and the negative images i.e images without a crack to a value of zero <b>Y</b>.

In [None]:
# Create tensor or vector of zeros, each element corresponds to new sample
# Tensor length = number of samples

Y = torch.zeros([number_of_samples])

As we are using the tensor <b>Y</b> for classification we cast it to a <code>LongTensor</code>. 


In [None]:
# Cast Y to LongTensor

Y = Y.type(torch.LongTensor)
Y.type()

With respect to each element we will set the even elements to class one and the odd elements to class zero.


In [None]:
Y[::2] = 1
Y[1::2] = 0

<h2 id="Question_2">Question 2</h2>
<b>Create a list all_files such that the even indexes contain the path to images with positive or cracked samples and the odd element contain the negative images or images without cracks.</b>


In [None]:
all_files = [None]*number_of_samples
all_files[::2] = positive_files
all_files[1::2] = negative_files
all_files[:4]

In [None]:
# Print testing samples

for y, file in zip(Y, all_files[0:4]):
    plt.imshow(Image.open(file))
    plt.title('y = ' + str(y.item()))
    plt.show()

<h2 id="split">Training  and Validation  Split  </h2>
When training the model we  split up our data into training and validation data. If the variable train is set to <code>True</code>  the following lines of code will segment the  tensor <b>Y</b> such at  the first 30000 samples are used for training. If the variable train is set to <code>False</code> the remainder of the samples will be used for validation data. 


In [None]:
train = False

if train:
    all_files = all_files[0:30000]
    Y = Y[0:30000]

else:
    all_files = all_files[30000:]
    Y = Y[30000:]

<h2 id="Question_3">Question 3</h2>
Modify the above lines of code such that if the variable <code>train</code> is set to <c>True</c> the first 30000 samples of all_files are use in training. If <code>train</code> is set to <code>False</code> the remaining  samples are used for validation. In both cases reassign  the values to the variable all_files, then use the following lines of code to print out the first four validation sample images.


In [None]:
def split(train):

    Y = torch.zeros([number_of_samples])
    Y = Y.type(torch.LongTensor)
    Y[::2] = 1
    Y[1::2] = 0
    
    all_files = [None]*(number_of_samples)
    all_files[::2] = positive_files
    all_files[1::2] = negative_files

    if train:
        all_files = all_files[0:30000]
        Y = Y[0:30000]

    else:
        all_files = all_files[30000:]
        Y = Y[30000:]
        
    return all_files, Y

In [None]:
# Acquire validation dataset
all_files, Y = split(train = False)

# Print first four validation samples
for y, file in zip(Y, all_files[0:4]):
    plt.imshow(Image.open(file))
    plt.title('y = ' + str(y.item()))
    plt.show()

<h2 id="data_class">Create a Dataset Class</h2>


In this section, we will use the previous code to build a dataset class. 



Complete the code to build a Dataset class <code>dataset</code>. As before, make sure the even samples are positive, and the odd samples are negative.  If the parameter <code>train</code> is set to <code>True</code>, use the first 30 000  samples as training data; otherwise, the remaining samples will be used as validation data.  


In [None]:
# Build a Dataset class dataset

class Dataset(Dataset):

    # Constructor
    def __init__(self, transform = None, train = True):
        directory = '/resources/data'
        positive = 'Positive'
        negative = 'Negative'

        positive_file_path = os.path.join(directory, positive)
        negative_file_path = os.path.join(directory, negative)
        
        positive_files = [os.path.join(positive_file_path, file) for file
                          in os.listdir(positive_file_path) if file.endswith('.jpg')]
        positive_files.sort()
        
        negative_files = [os.path.join(negative_file_path, file) for file
                          in os.listdir(negative_file_path) if file.endswith('.jpg')]
        negative_files.sort()

        self.all_files = [None]*number_of_samples
        self.all_files[::2] = positive_files # Even samples are positive
        self.all_files[1::2] = negative_files # Odd samples are negative
        
        # Transform will be used on image
        self.transform = transform
        
        # torch.LongTensor
        self.Y = torch.zeros([number_of_samples]).type(torch.LongTensor)
        self.Y[::2] = 1
        self.Y[1::2] = 0
        
        if train:
            self.all_files = self.all_files[0:30000]
            self.Y = self.Y[0:30000]
            self.len = len(self.all_files)
        else:
            self.all_files = self.all_files[30000:]
            self.Y = self.Y[30000:]
            self.len = len(self.all_files)

    # Get the length
    def __len__(self):
        return self.len

    # Getter
    def __getitem__(self, idx):
        image = Image.open(self.all_files[idx])
        y = self.Y[idx]

        # If there is any transform method, apply it onto the image
        if self.transform:
            image = self.transform(image)

        return image, y

<h2 id="Question_4">Question 4</h2>
<b>Create a Dataset object <code>dataset</code> for the training data, use the following lines of code to print out sample the 10th and  sample 100 (remember zero indexing)  </b>


In [None]:
training_dset = Dataset(train = True) # Creating Dataset object
samples = [9, 99] # 10th and 100th samples

for sample in samples:
    plt.imshow(training_dset[sample][0])
    plt.xlabel('y = ' + str(training_dset[sample][1].item()))
    plt.title('Training data; sample {}'.format(int(sample)+1))
    plt.show()

<h2 id="Question_5">Question 5</h2>
<b>Create a Dataset object <code>dataset</code> for the validation  data, use the following lines of code to print out the 16 th and  sample 103 (remember zero indexing)   </b>


In [None]:
validation_dset = Dataset(train = False) # Creating Dataset object for the validation data
samples = [15, 102] # 15th and 103 th samples

for sample in samples:
    plt.imshow(validation_dset[sample][0])
    plt.xlabel('y = ' + str(validation_dset[sample][1].item()))
    plt.title('Validation data; sample {}'.format(int(sample)+1))
    plt.show()

### Copyright © 2018 Cognitive Class. This notebook and its source code are released under the terms of the MIT License.