# Image Pre-Processing

Use this notebook to download, explore, and prepare the image data that you will work with in subsequent challenges.

## Download and Extract the Images
The images are provided in a zipped archive. Run the following code cell to download and extract this archive.

> Note: You can run OS commands in a notebook by prefixing them with a `!` character. This works well for simple tasks, but for more complex OS operations you should use a terminal session.

In [None]:
# Download and extract image files
! curl -O https://computervisionhack.blob.core.windows.net/challengefiles/gear_images.zip
! unzip -o gear_images.zip

## Explore the Images
Run the following cell to iterate through subfolders in the extracted **gear_images** folder, and display the first image in each subfolder. Each subfolder represents a category, or *class*, of image.

In [None]:
import os
import shutil
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

# Required magic to display matplotlib plots in notebooks
%matplotlib inline

# This is where we extracted the images
imgdir = 'gear_images'

# Set up a figure of an appropriate size
fig = plt.figure(figsize=(12, 16))

# loop through the subfolders
dir_num = 0
for root, folders, filenames in os.walk(imgdir):
    for folder in folders:
        # Load the first image file using the PIL library
        file = os.listdir(os.path.join(root,folder))[0]
        imgFile = os.path.join(root,folder, file)
        img = Image.open(imgFile)
        # Add the image to the figure (which will have 4 rows and enough columns to show a file from each folder)
        a=fig.add_subplot(4,np.ceil(len(folders)/4),dir_num + 1)
        imgplot = plt.imshow(img)
        # Add a caption with the foilder name
        a.set_title(folder)
        dir_num = dir_num + 1


Note that each folder contains images of a specific product type.

> Note: Examine the code above carefully and make sure you understand what it is doing. You'll be using similar code to process and display images throughout the rest of this hack.

## Standardize the Images
The images are a mix of formats (JPGs and PNGs), and vary in size and shape. Some machine learning techniques for computer vision work best when the image data is a consistent format and size, so you must prepare the data accordingly.

Run the code in the following cell to standarize the images so that they are all in JPG format, and they are all 128x128 pixels in size.

> Note: Images are essentially multidimensional arrays of pixel values. In this case, the images are represented as 128x128x3 arrays that encapsulate the width, height, and red, green, and blue *channel* pixel values.

In [None]:
import os
import shutil
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

# Helper function to resize image
def resize_image(img, size): 
    from PIL import Image, ImageOps 
    
    # resize the image so the longest dimension matches our target size
    img.thumbnail(size, Image.ANTIALIAS)
    
    # Create a new square white background image
    newimg = Image.new("RGB", size, (255, 255, 255))
    
    # Paste the resized image into the center of the square background
    if np.array(img).shape[2] == 4:
        # If the source is in RGBA format, use a mask to eliminate the transparency
        newimg.paste(img, (int((size[0] - img.size[0]) / 2), int((size[1] - img.size[1]) / 2)), mask=img.split()[3])
    else:
        newimg.paste(img, (int((size[0] - img.size[0]) / 2), int((size[1] - img.size[1]) / 2)))
  
    # return the resized image
    return newimg


# Create resized copies of all of the source images
size = (128,128)

indir = 'gear_images'
outdir = 'resized_images'

# Create the output folder if it doesn't already exist
if os.path.exists(outdir):
    shutil.rmtree(outdir)

# Loop through each subfolder in the input dir
for root, dirs, filenames in os.walk(indir):
    for d in dirs:
        print('processing folder ' + d)
        # Create a matching subfolder in the output dir
        saveFolder = os.path.join(outdir,d)
        if not os.path.exists(saveFolder):
            os.makedirs(saveFolder)
        # Loop through the files in the subfolder
        files = os.listdir(os.path.join(root,d))
        for f in files:
            # Open the file
            imgFile = os.path.join(root,d, f)
            print("reading " + imgFile)
            img = Image.open(imgFile)
            # Create a resized version and save it
            proc_img = resize_image(img, size)
            saveAs = os.path.join(saveFolder, 'resized_' + f)
            print("writing " + saveAs)
            proc_img.save(saveAs)
            

## Compare the Original and Resized Images
NRun the following code cell to view the original and resized version of the first image in each subfolder. Note that by default, **matplotlib** plots include axes that indicate the pixel dimensions of the image. The resized images should all be 128x128 pixels in size.

In [None]:
# Create a new figure
fig = plt.figure(figsize=(10, 40))

# loop through the subfolders in the input directory
img_num = 1
for root, folders, filenames in os.walk(indir):
    for folder in folders:
        # Get the first image in the subfolder and add it to a plot that has two columns and row for each folder
        file = os.listdir(os.path.join(root,folder))[0]
        imgFile1 = os.path.join(indir,folder, file)
        img1 = Image.open(imgFile1)
        a=fig.add_subplot(len(folders), 2, img_num)
        imgplot = plt.imshow(img1)
        a.set_title(folder)
        # The next image is the resized counterpart - load and plot it
        img_num = img_num + 1
        imgFile2 = os.path.join(outdir,folder, 'resized_' + file)
        img2 = Image.open(imgFile2)
        b=fig.add_subplot(len(folders), 2, img_num)
        imgplot = plt.imshow(img2)
        b.set_title('resized ' + folder)
        img_num = img_num + 1

**Note**: Data pre-processing is often the most time-consuming part of a machine learning project. When working with image data, there are many ways in which the data can be enhanced for use in machine learning, for example by scaling the images to be a consistent size and shape,  adjusting contrast to correct for over/under-exposure, or cropping to include only the most relevant visual elements. Most modern machine learning frameworks include functions to perform these tasks, but a professional data scientist working in computer vision scenarios will benefit from having knowledge of how to work with images using common Python libraries.