# Part 1: Data Wrangling with Breeds on CPU

# Objective

Understand ways to find a data set and to prepare a data set for machine learning and training.

## Activities 
**In this section of the training you will**
- Transfer a data set from the shared location on the server to your current directory. 
- View your initial data
- Clean and normalize the data set
- Augment the data
- Organize the data into training and testing groups 


# Find a Dataset

### Research Existing Data Sets

Artificial intelligence projects depend upon data. When beginning a project, data scientists look for existing data sets that are similar to or match the given problem. This saves time and money, and leverages the work of others, building upon the body of knowledge for all future projects. 

Typically you begin with a search engine query. For this project, we were looking for a data set with an unencumbered license.  

This project starts with the Oxford IIIT Pet Data set http://www.robots.ox.ac.uk/~vgg/data/pets/ , a 37-category pet data set with roughly 200 images for each class. The images have a large variations in scale, pose, and lighting. All images have an associated ground truth annotation of breed, head region of interest (ROI), and pixel-level trimap segmentation.


### Background
"The pet images were downloaded from Catster* and Dogster*, two social web sites dedicated to the collection and discussion of images of pets, from Flickr* groups, and from Google Images*. People uploading images to Catster and Dogster provide the breed information as well, and the Flickr groups are specific to each breed, which simplifies tagging. For each of the 37 breeds, about 2,000 – 2,500 images were downloaded from these data sources to form a pool of candidates for inclusion in the dataset. From this candidate list, images were dropped if any of the following conditions applied, as judged by the annotators: (i) the image was gray scale, (ii) another image portraying the same animal existed (which happens frequently in Flickr), (iii) the illumination was poor, (iv) the pet was not centered in the image, or (v) the pet was wearing clothes. The most common problem in all the data sources, however, was found to be errors in the breed labels. Thus labels were reviewed by the human annotators and fixed whenever possible. When fixing was not possible, for instance because the pet was a cross breed, the image was dropped.”

From *Cats and Dogs*, http://www.robots.ox.ac.uk/~vgg/publications/2012/parkhi12a/parkhi12a.pdf

# Fetch Your Data
![Fetch Data](assets/part1_1.JPG)

### Activity

Click the cell below and then click **Run**

In [None]:
!rm -rf breeds/
!mkdir -p breeds
!rsync -r --progress /data/aidata/breeds/original/ breeds/

!echo "Done."

<br>

# View the Baseline Data

Take a look at the images in your data set. This will give you some idea as to how much cleaning and normalizing will be required. 


![View and Understand Your Data](assets/part1_2.JPG)

### Activity

In the cell below, update the display_images function by changing the **numOfImages** parameter to a number from 1 to 5. Click **Save**, and then click **Run**.
 
*Hint: The display_images function sets a display grid showing NxN pet images. The default number of images is set to **?**. Change the **?** character to something greater than 1; for example, **numOfImages = 6**.*

In [None]:
import glob
import re
import random
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%matplotlib inline

def get_category(file):
    m = re.search("\d", file, re.IGNORECASE)
    if m:
        return file[:m.start() - 1].lower().split("/")[1]

def display_images(file_names, numOfImages = ?):
    indicies = random.sample(range(len(file_names)), numOfImages * numOfImages)
    train_images = [file_names[i] for i in indicies]
    
    fig, axes = plt.subplots(nrows=numOfImages,ncols=numOfImages, figsize=(15,15), sharex=True, sharey=True, frameon=False)
    for i,ax in enumerate(axes.flat):
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)
        curr_i = train_images[i]
        imgplot = mpimg.imread(curr_i)
        ax.imshow(imgplot)
        ax.text(10,20,get_category(curr_i), fontdict={"backgroundcolor": "black","color": "white" })
        ax.axis('off')
    plt.tight_layout(h_pad=0, w_pad=0)    
    
    
display_images(glob.glob('breeds/*.jpg'))

print("Done.")

<br>

# Clean and Normalize the Data
Existing image recognition data-sets often include images of multiple dimensions, color mixed with black and white photos, maybe even line art plus photos. File names may follow multiple formats, and the subject matter within the images may be single, multiple, profile, straight-on face, back of head, surrounded by a complex background or more. 
Cleaning and normalizing the data means fixing the inconsistencies so that the machine processing can occur with minimal errors. Oftentimes data cleaning is tedious and requires significant time commitment. 
Data preprocessing techniques include:
1.	Data cleaning − Eliminates noise and resolves inconsistencies in the data. 
2.	Data integration − Migrates data from various different sources into one coherent source, such as a data warehouse.
3.	Data transformation – Standardizes or normalizes any form of data.
4.	Data reduction – Reduces the size of the data by aggregating it.

Another name for this effort is extract, transform, and load (ETL).
This project required the team to normalize the file dimensions, file names and create a data layout expected by the framework. 

It is common for the data cleanup tasks to be pared with framework and topology selection because different topologies expect different data layouts and formats. When experimenting with different topologies it might be necessary to have several copies of the data in various formats.  Multiple copies of data-sets can take up a lot of space, so ensure you’ve got lots of storage and processing capability.

![Clean and Normalize the Data](assets/part1_3.JPG)

## Activity
The code in the next cell performs some of the cleanup tasks. Review the code and notice that it is removing corrupt files, files with the wrong format, and files with incorrect metadata.

Click the cell below and then click **Run**.

In [None]:
import cv2
import os

for file in glob.glob("breeds/*"):
    if not file.endswith(".jpg"):
        #Not ending in .jpg
        print("Deleting (.mat): " + file)
        os.remove(os.path.join(os.getcwd(), file))
    else: 
        flags = cv2.IMREAD_COLOR
        im = cv2.imread(file, flags)
        
        if im is None:
            #Can't read in image
            print("Deleting (None): " + file)
            os.remove(os.path.join(os.getcwd(), file))
            continue
        elif len(im.shape) != 3:
            #Wrong amount of channels
            print("Deleting (len != 3): " + file)
            os.remove(os.path.join(os.getcwd(), file))
            continue
        elif im.shape[2] != 3:
            #Wrong amount of channels
            print("Deleting (shape[2] != 3): " + file)
            os.remove(os.path.join(os.getcwd(), file))
            continue
            
        with open(os.path.join(os.getcwd(), file), 'rb') as f:
            check_chars = f.read()
        if check_chars[-2:] != b'\xff\xd9':
            #Wrong ending metadata for jpg standard
            print('Deleting (xd9): ' + file)
            os.remove(os.path.join(os.getcwd(), file))
        elif check_chars[:4] != b'\xff\xd8\xff\xe0':
            #Wrong Start Marker / JFIF Marker metadata for jpg standard
            print('Deleting (xd8/xe0): ' + file)
            os.remove(os.path.join(os.getcwd(), file))
        elif check_chars[6:10] != b'JFIF':
            #Wrong Identifier metadata for jpg standard
            print('Deleting (xd8/xe0): ' + file)
            os.remove(os.path.join(os.getcwd(), file))
        elif "beagle_116.jpg" in file or "chihuahua_121.jpg" in file:
            #Using EXIF Data to determine this
            print('Deleting (corrupt jpeg data): ', file)
            os.remove(os.path.join(os.getcwd(), file))     


print('Done.')

# Augment Your Data

Most of the time you’re cleaning data and removing noise. Since our app needs to work with images of wet, muddy, or injured animals, or perhaps blurry images because the animal is running away in fear, we actually need to ADD noise to the data-set. 

We decided to add image noise by building a small program to flip, flop, blur, and extract color channels from the images in the dataset. These actions expanded our training data-set by 6x.

The cell below uses a parallel method to scale the image processing tasks to all available processors.

![Augment Your Data](assets/part1_4.JPG)

### Activity
Click the cell below and then click **Run**.

In [None]:
#echo "Start resizing to 227x227"
#parallel -j 200 convert {} -resize 227x227 -filter spline -unsharp 0x6+0.5+0 -background black -gravity center -extent 227x227  {} ::: *.jpg
#echo "Resizing done"

#mkdir flop
#echo "Start augmentation 1"
#parallel -j 200 convert {} -flop flop/{.}-flop.jpg ::: *.jpg
#echo "Finish augmetation 1"

#mkdir flip
#echo "Start augmentation 2"
#parallel -j 200 convert {} -transverse -rotate 90 flip/{.}-flip.jpg ::: *.jpg
#echo "Finish augmetation 2"

#mkdir blur
#echo "Start augmentation 3"
#parallel -j 200 convert {} -blur 0x1 blur/{.}-blur.jpg ::: *.jpg
#echo "Finish augmetation 3"

#mkdir red
#echo "Start augmentation 4"
#parallel -j 200 convert {} -channel R -separate red/{.}-red.jpg ::: *.jpg
#echo "Finish augmetation 4"

#mkdir blue
#echo "Start augmentation 5"
#parallel -j 200 convert {} -channel B -separate blue/{.}-blue.jpg ::: *.jpg
#echo "Finish augmetation 5"

#mkdir green
#echo "Start augmentation 6"
#parallel -j 200 convert {} -channel G -separate green/{.}-green.jpg ::: *.jpg
#echo "Finish augmetation 6"

#echo "Copying augmented data to main folder"
#cp flop/* flip/* blur/* red/* blue/* green/* .

#echo "Augmentation done"

from multiprocessing import Pool
from PIL import Image
import sys

def resize_image(file, size=227):
    black_background = Image.new('RGB', (size, size), "black")
    img = Image.open(file)
    img.thumbnail((size,size))
    x, y = img.size
    black_background.paste(img, (int((size - x) / 2), int((size - y) / 2)))
    black_background.save(file)
    return black_background
  
pool = Pool()
for i, _ in enumerate(pool.map(resize_image, glob.glob("breeds/*"))):
    if i % 10 == 0:
        sys.stdout.write('\r{0} out of {1} processed'.format(i+1, len(glob.glob("breeds/*"))))

sys.stdout.write('\n')
sys.stdout.flush()      
print("Done.")

<br>

# View Results 

The script you just ran flipped and flopped the images, added blur, and extracted each color channel. Take a look at the results.

![View Results](assets/part1_5.JPG)

### Activity

Click the cell below and then click **Run**.

In [None]:
display_images(glob.glob('breeds/*'))

print("Done.")

<br>
# Organize Data for Consumption by Caffe*

The framework you choose for your project determines how you need to organize your data. After extensive experimentation, we selected Caffe for this project. This section describes how to organize your data layers.

Our data needs to be organized in a specific manner. Caffe requires separate data sets for training and validation and that the data be stored in two separate folders. Why separate images sets for training and validation? To prevent “overfitting” which occurs when you train and test on the same images. You train on a set, then test on a new/different set to validate that the machine is truly learning to recognize the images. 

Our folders are named **train** and **val**. We used the industry standard ratio of 80% train and 20% test/validation to split the data set.

Within the test/val folders we create category folders with every breed in the dataset, within which are stored the images of the cat breeds and dog breeds.  This next cell of code creates the data layout as expected by Caffe.



![Organize Data for Consumption by Framework](assets/part1_6.JPG)

### Activity

In the cell below, set the **train_ratio** to **0.8** and then click **Run**.

*Hint: We need to set the train_ratio = ? to a value between 0 and 1.*

In [None]:
import os
import re
import errno
import math

def get_category(file):
    m = re.search("\d", file, re.IGNORECASE)
    if m:
        return file[:m.start() - 1].lower()

def make_sure_path_exists(path):
    try:
        os.makedirs(path)
    except OSError as exception:
        if exception.errno != errno.EEXIST:
            raise

file_names = os.listdir('breeds')               
category_names = [ get_category(file) for file in file_names]
category_names = [ name for name in category_names if name is not None ]
category_names = sorted(list(set(category_names)))
for category in category_names:
    make_sure_path_exists("breeds/train/" + str(category))
    make_sure_path_exists("breeds/val/" + str(category))

train_ratio = ?
train_txt = {}
val_txt = {}

for idx, category in enumerate(category_names):
    category_list = []
    for file in file_names:
        if category.lower() in file.lower():
            category_list.append(file)
    
    category_list = sorted(category_list)
    split_ratio = math.floor(len(category_list) * train_ratio)
    train_list = category_list[:split_ratio]
    test_list = category_list[split_ratio:]
    
    for i, file in enumerate(train_list):
        os.rename("breeds/" + file, "breeds/train/" + str(category) + "/" + file)
        if i % 10 == 0:
            sys.stdout.write('\r>> Moving train image %d to category folder %s' % (i+1, category))
            sys.stdout.flush()
        train_txt[str(category) + "/" + file] = idx
        
    sys.stdout.write('\n')
    sys.stdout.flush()        
        
    for i, file in enumerate(test_list):
        os.rename("breeds/" + file, "breeds/val/" + str(category) + "/" + file)
        if i % 10 == 0:
            sys.stdout.write('\r>> Moving validation image %d to category folder %s' % (i+1, category))
            sys.stdout.flush()
        val_txt[str(category) + "/" + file] = idx
        
    sys.stdout.write('\n')
    sys.stdout.flush()        
        
print("Done splitting data")        
        
train = open("breeds/train.txt", "w")
for key, val in train_txt.items():
    train.write("{0} {1}\n".format(key, val))
train.close()
print("Wrote train.txt")

validation = open("breeds/val.txt", "w")
for key, val in val_txt.items():
    validation.write("{0} {1}\n".format(key, val))
validation.close()
print("Wrote val.txt")

categories = open("breeds/categories.txt", "w")
for val in category_names:
    categories.write("{0}\n".format(val))
categories.close()
print("Wrote categories.txt")

print("")
print("Done.")


# Confirm Folder Structure is Correct

Notice we have folders for each breed category within our Train and Validation folders. The images of each breed have been sorted into their respective folders.

![Confirm Folder Structure is Correct](assets/part1_7.JPG)

### Activity
Click the cell below and then click **Run**.

In [None]:
for root, dirs, files in os.walk("breeds"):
    level = root.replace(os.getcwd(), '').count(os.sep)
    print('{0}{1}/'.format('    ' * level, os.path.basename(root)))
    for file in files[:5]:
        print('{0}{1}'.format('    ' * (level + 1), file))
print("Done.")

# Optimize Data for Ingestion

### Data Layers
Data enters Caffe through data layers. Data can come from efficient databases (LevelDB or LMDB), directly from memory, or, when efficiency is not critical, from files on disk in HDF5 or common image formats.

### Transform data set to an LMDB dataset and determine the mean for all images
So far we have a set of folders and a lot of image files. To optimize data load times, we create an LMDB database out of our many image files. The database will create pointers to the image files and identify them by number so that image processing and training can occur more quickly and efficiently. 

When we create a mean value for all images, it is the most common form of preprocessing. It involves subtracting the mean across every individual feature in the data, and has the geometric interpretation of centering the cloud of data around the origin along every dimension. This step gives us three values; one for each color channel Red, Green, and Blue. It is a value in between 0 to 255. Pixel values outside of this range are invalid.

![Optimize Data for Ingestion](assets/part1_8.JPG)

### Activity
Click the cell below and then click **Run**.

In [None]:
!echo "Creating train lmdb..."

!rm -rf breeds/train_lmdb

!/glob/intel-python/versions/2018u2/intelpython3/bin/convert_imageset \
    --shuffle \
    breeds/train/ \
    breeds/train.txt \
    breeds/train_lmdb

!echo "Creating val lmdb..."

!rm -rf breeds/val_lmdb

!/glob/intel-python/versions/2018u2/intelpython3/bin/convert_imageset \
    --shuffle \
    breeds/val/ \
    breeds/val.txt \
    breeds/val_lmdb

!echo "Creating train_mean.binaryproto..."

!/glob/intel-python/versions/2018u2/intelpython3/bin/compute_image_mean breeds/train_lmdb \
  breeds/train_mean.binaryproto
  
!echo "Creating val_mean.binaryproto..."

!/glob/intel-python/versions/2018u2/intelpython3/bin/compute_image_mean breeds/val_lmdb \
  breeds/val_mean.binaryproto

!echo "Done."

<br>

### After all of this data wrangling we can actually begin the training process

When we started this project, we always had an edge device in mind as our ultimate deployment platform. To that end we always considered three things when selecting our topology or network; time to train, size, and inference speed. 

**Time to Train:** Depending on the number of layers and computation required, a network can take a significantly shorter or longer time to train. Computation time and programmer time are costly resources, so wanted short training times.  

**Size:** Since we're targeting an edge device and an Intel® Movidius™ Neural Compute Stick we must consider the size of the network that is allowed in memory as well as supported networks.

**Inference speed:** Typically the deeper and larger the network, the slower the inference speed. In our use case we are working with a live video stream; we want at least 10 frames per second on inference.

At this point we're going to continue with the Caffe framework plus the GoogeLeNet topology/network since we're currently working on a complex dataset.

![googlenet](assets/googlenet.png)

# Part 2: Training Breeds with Caffe and GoogleNet on CPU

# Objective 
Understand the stages of preparing a data set for training using the Caffe framework and GoogLeNet topology. You will initiate training and view a completed graph.

# Activities 
**In this section of the training you will**
- View Solver and Train Prototxt
- Start Training
- Accuracy and Loss for Full Run
- Looking at a Sample Image
- Inference on a Sample Image


### Solver Files
The solver.prototxt file is a configuration file used to tell Caffe how you want the network trained. It includes parameters that you can change to adjust the training.  Inside the solver file for example, you can set parameters to save checkpoints after NN iterations so that you can have a reference point. You can also tell Caffe to test your data after a given number of iterations as well. In fact, you can easily switch between CPU and GPU training by changing a single parameter.

The train.prototxt file is a text description of the network you see in the image above. This is how Caffe ingests the network and knows how to pass the image data back and forth through the network during training.


[Wiki for Solver parameters](https://github.com/BVLC/caffe/wiki/Solver-Prototxt)


# Display Tunable Parameters for Training

![Display Tunable Parameters for Training](assets/part2_1.JPG)

### Activity
Click the cell below and then click **Run**.

In [None]:
!echo "Displaying Solver Prototxt: "
!echo ""
!cat breeds_googlenet_solver.prototxt
!echo ""
!echo "Displaying Train Prototxt: "
!echo ""
!cat breeds_googlenet_train.prototxt
!echo "Done"

# Start training

Let’s start training with Caffe. 

![Start Training](assets/part2_2.JPG)

### Activity
Point Caffe to look at the correct solver file by setting the prototext section to **breeds_googlenet_solver.prototxt** and then click **Run**.


*Hint: When we start training we have to point caffe at the correct solver file.  Fill in the **???**.prototxt section to **breeds_googlenet_solver.prototxt**. Then come back and run the next command.*

In [None]:
!cd ..
!mkdir breeds_googlenet
!/glob/intel-python/versions/2018u2/intelpython3/bin/caffe train -solver ??????.prototxt

<br>
<br>
## Accuracy and Loss for Breeds using GoogleNet Topology

This is a graph of the completed training of Breeds using GoogLeNet. The image represents a 6-hour training session. If we had enough time to run through all of the training you would see something similar to the graphs below.

![After 3 hours Graphs will look like below:](assets/part2_3.JPG)

<img src="assets/GoogleNet_Breeds.png" style="width: 700px;"/>

## Look at a Sample Image

We're going to use this image to run through the network and see the results.

![Look at Sample Image](assets/part2_4.JPG)

### Activity
Click the cell below and then click **Run**.

In [None]:
from PIL import Image

Image.open('breeds/train/maine_coon/Maine_Coon_100.jpg')

### Inference on an Image

We can use the newly created frozen graph file to test a sample image.  We're using the label_image script that takes an image, frozen graph, labels.txt files, and displays the top five probabilities for the given image.

![Inference on an Image](assets/part2_5.JPG)

### Activity
Click the cell below and then click **Run**.

In [None]:
import caffe
import numpy as np
from PIL import Image

files = glob.glob('breeds_googlenet/breeds_googlenet_iter_*.caffemodel')
latest = max(files, key=os.path.getctime)

caffe.set_mode_cpu()
net = caffe.Net('deploy.prototxt', latest, caffe.TEST)

net.blobs['data'].reshape(1,        # batch size
                          3,         # 3-channel (BGR) images
                          224, 224)  # image size is 227x227

blob = caffe.proto.caffe_pb2.BlobProto()
with open('breeds/train_mean.binaryproto', 'rb') as f:
    blob.ParseFromString(f.read())
    data = np.array(blob.data).reshape([blob.channels, blob.height, blob.width])
    mu = np.array([np.mean(data[0]), np.mean(data[1]), np.mean(data[2])])

# create transformer for the input called 'data'
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1))  # move image channels to outermost dimension
transformer.set_mean('data', mu)            # subtract the dataset-mean value in each channel
transformer.set_raw_scale('data', 255)      # rescale from [0, 1] to [0, 255]
transformer.set_channel_swap('data', (2,1,0))  # swap channels from RGB to BGR

#load image
image = Image.open('breeds/train/maine_coon/Maine_Coon_100.jpg')
data = np.asarray(image)
transformed_image = transformer.preprocess('data', data)

# copy the image data into the memory allocated for the net
net.blobs['data'].data[...] = transformed_image

# load categories.txt labels
labels_file = 'breeds/categories.txt'
labels = np.loadtxt(labels_file, str, delimiter='\n')

### perform classification
net.forward()

# obtain the output probabilities
output_prob = net.blobs['prob'].data[0]

# sort top predictions from softmax output
top_inds = output_prob.argsort()[::-1]

print('Probabilities and Labels:', list(zip(output_prob[top_inds], labels[top_inds])))
print('Done.')

### Summary

- Getting your dataset
- Sorting your dataset
- Generating LMDB Record
- Training your dataset
- Using your caffemodel to test image classification

![Summary](assets/part2_6.JPG)

### Resources

What is Intel® Optimized Caffe, https://software.intel.com/en-us/videos/what-is-intel-optimized-caffe

Caffe | Deep Learning Framework, http://caffe.berkeleyvision.org/

Intel Caffe on GitHub*, https://github.com/intel/caffe


Manufacturing Package Fault Detection Using Deep Learning, https://software.intel.com/en-us/articles/manufacturing-package-fault-detection-using-deep-learning

Automatic Defect Inspection Using Deep Learning for Solar Farm, https://software.intel.com/en-us/articles/automatic-defect-inspection-using-deep-learning-for-solar-farm


**Notices**

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel, the Intel logo, Intel Xeon Phi, Movidius, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. 

*Other names and brands may be claimed as the property of others.

© 2018 Intel Corporation