# The task

We want to improve Dataperformers's computer vision model in order to save our client's defect detection costs, defect detection rate and damage to both the brand's reputation and customers.

In order to achieve this goal, we will be using Convolutional neural networks (aka CNNs) which have been proven excellent in pattern seeking tasks such as computer vision.

This type of model requires a fixed size input, which will be attained by processing the data.

We also explored the idea of building a combination of 2 models also based on CNNs :
* One that would be trained on constant conditions images
* One that would be trained on images with varying conditions

In order to make a prediction with that approach, an image would be submitted to both models which output a continuous probability (regression) about that image's defectibility instead of a discrete decision (classification). The model with the most extreme (decisive) probability about the outcome would get the upper hand on the final decision.

We ended up opting for a single CNN trained on both types of data because the 2 datasets could benefit from each other and a bigger sample size. With some preprocessing, both datasets could also be less distant from each other, which is great.

# Our tools

In [14]:
# files system lib
import os, os.path

# import data processing libs
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import cv2

# import model libs
import tensorflow as tf
from tensorflow.keras import layers, datasets, models

# Data : first look

We start by looking at data distributions and data format in order to find the best way to manipulate it.

In [2]:
# data repo
constantGoodPath = '..\data\Part1Raw\Part_1_Dataset\Train\Good'
constantDefectedPath = '..\data\Part1Raw\Part_1_Dataset\Train\Defected'
variedGoodPath = '..\data\Part2Raw\Part_2_Dataset\Train\Good'
variedDefectedPath = '..\data\Part2Raw\Part_2_Dataset\Train\Defected'

# count occurrence of all instances
constantGood = len([img for img in os.listdir(constantGoodPath)])
constantDefected = len([img for img in os.listdir(constantDefectedPath)]) 
variedGood = len([img for img in os.listdir(variedGoodPath)])
variedDefected = len([img for img in os.listdir(variedDefectedPath)])

print("constant good images : {}, ({}% of constant images)".format(constantGood, 100*constantGood/(constantGood+constantDefected)))
print("constant defected images : {}, ({}% of constant images)".format(constantDefected, 100*constantDefected/(constantGood+constantDefected)))
print("varied good images : {}, ({}% of varied images)".format(variedGood, 100*variedGood/(variedGood+variedDefected)))
print("varied defected images : {}, ({}% of varied images)".format(variedDefected, 100*variedDefected/(variedGood+variedDefected)))

constant good images : 160, (75.82938388625593% of constant images)
constant defected images : 51, (24.170616113744074% of constant images)
varied good images : 440, (78.43137254901961% of varied images)
varied defected images : 121, (21.568627450980394% of varied images)


As we could expect from the case description, there is way more images of good products. A model that predicts good everything would then be biased and would overperform.

# Data processing

Since the provided images are JPG encoded, which is not very friendly to work with, we must first find a way to generate a greyscale int vector from every single one of them. This is possible because the defects are not color dependant, but rather contrast dependant with the background of the coin, which we will be able to see clearly in grayscale.

Grayscaling the images also greatly reduces the dimensionality of our data, making our training time vastly more reasonable (RBG encoding is 3 times 8bits int per pixels rather than 1 8bits int for grayscale).

In [11]:
# define path to dump grayscale processed images
grayscaleConstantGood = '..\data\Part1Processed\good\grayscale.csv'
grayscaleConstantDefected = '..\data\Part1Processed\defected\grayscale.csv'
grayscaleVariedGood = '..\data\Part2Processed\good\grayscale.csv'
grayscaleVariedDefected = '..\data\Part2Processed\defected\grayscale.csv'

# grayscale constant good images
#f = open(grayscaleConstantGood, 'a')
#for img in os.listdir(constantGoodPath) :
#    image = cv2.imread(constantGoodPath+'\\'+img)
#    grayscaleImage = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY).flatten()
#    np.savetxt(f, grayscaleImage, delimiter=",")

# grayscale constant defected images
#f = open(grayscaleConstantDefected, 'a')
#for img in os.listdir(constantDefectedPath) :
#    image = cv2.imread(constantDefectedPath+'\\'+img)
#    grayscaleImage = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY).flatten()
#    np.savetxt(f, grayscaleImage, delimiter=",")
    
# grayscale varied good images
#f = open(grayscaleVariedGood, 'a')
#for img in os.listdir(variedGoodPath) :
#    image = cv2.imread(variedGoodPath+'\\'+img)
#    grayscaleImage = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY).flatten()
#    np.savetxt(f, grayscaleImage, delimiter=",")
    
# grayscale varied defected images
#f = open(grayscaleVariedDefected, 'a')
#for img in os.listdir(variedDefectedPath) :
#    image = cv2.imread(variedDefectedPath+'\\'+img)
#    grayscaleImage = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY).flatten()
#    np.savetxt(f, grayscaleImage, delimiter=",")

# Data processing pt 2

Also, all images are not of the same size, which is a problem for a fixed length CNN input layer.
The defected images of the dataset seem to be much more detailed than the good ones.
We explored 3 different solutions :
* Downscale larger images to the size of the smallest one of the dataset.
* Upscale coarser images to the size of the largest one of the dataset.
* Downscale detailed images to a specific resolution and upscale coarse images to that same resolution (common ground). We determine the said resolution by selecting the value size that minimalizes the sum of changes (added pixels + removed pixels) to be made in the dataset.

Having said that, a summary inspection of both datasets reveals that some images contain a lot of empty background space which could be cropped out and thus reducing the dimensionality of multiple images greatly (especially in dataset 2 where the conditions vary a lot). If done succesfully, this modification would reduce the amount of lost information with data upscaling and downscaling.

Here are some good examples of said images :

<table><tr><td><img src="img\20191006_143830.jpg" alt="img1" style="width: 100px;"/></td><td><img src="img\20191006_144125.jpg" alt="img1" style="width: 100px;"/></td><td><img src="img\20191006_144118.jpg" alt="img1" style="width: 100px;"/></td><td><img src="img\20191006_144115.jpg" alt="img1" style="width: 100px;"/></td></tr></table>

Here are some observations :
* The background of these unnecessary large images is usually homogeneous
* If we look at specific rows and columns, the ones that contain only background (which we want to crop out) would have very low variance on the grayscale due to that homogeneity
* Cropping out rows and columns that have very low variance (below a certain threshold) would leave us with only the coin left, the ideal situation
* The only thing making this strategy viable without the help of an edge detection tool is that the center of the coin is empty on every picture. This hole makes it so that a column or row containing the coin contains some amount of background everytime, which increases the variance (we do not remove rows and columns with high variance). That garantees to not crop out any part of the coin

The only thing to do is determine the amount of variance below which we delete a row or column (which is fairly easy if we are able to obtain grayscale of all images).

With all the images cropped, it is now time to determine the dimension size to which we will scale the images.

In order to achieve this, we compute the mean number of columns and the mean number of rows (floor or ceil, to see) of all images independently from their respective datasets.
This will assure us that the amount of modification regarding removed and added pixels is minimized.

The images that are smaller than the determined size will be upscaled, in opposition to larger images, which will be downscaled.

Such image manipulation can be done with tools such as **scikit-image** or multiple other machine learning models trained specifically for this task.

We could want to avoid dependency on others AI models.

Downscaling can be implemented via multiple compression algorithms.

For upscaling, some heuristics exist without the use of machine learning. See :
https://en.wikipedia.org/wiki/Image_scaling

# First challenge : Small training set

CNNs sometimes require thousands or millions of data vectors in order to perform at a sufficiently high level. 
Our data consists of less than 1000 images total, which is quite small of a sample.

In order to fix this problem, we are going to use data augmentation for both defected and good images.
The best part about our data is that it is not oriented, meaning an image does not need to have a specific orientation to make sense and be interpreted the same as the original.

That gives us the possibility to make some changes such as **reflection** and **rotation**.

Rotating all images by 45 degree once or multiple times (with white padding (grayscale value of 255) when not aligned with x or y axis) is a valid way to augment data.
This multiplies our sample size by **8** (360/45). 

It is also possible to make a 180 degree reflection along x or y axis without affecting the input size.
This multiplies our sample size by **2**.

Gaussian noise can also be added to a certain degree to every image we have up to this point.
This multiplies our sample size by **2**.

We can also both increase and decrease contrast on every image (not too much) which works great on grayscale.
This multiplies our sample size by **3**.

With every previously mentionned augmentation technique applied, that gives us a multiplication factor of **96**.

In [20]:
totalBefore = constantGood+constantDefected+variedGood+variedDefected
goodBefore = constantGood+variedGood
defectedBefore = constantDefected+variedDefected
totalAfter = totalBefore*96
goodAfter = goodBefore*96
defectedAfter = defectedBefore*96

print("Sample size after augmentation : {}".format(totalAfter))
print("Good sample size after augmentation : {}".format(goodAfter))
print("Defected sample size after augmentation : {}".format(defectedAfter))

Sample size after augmentation : 74112
Good sample size after augmentation : 57600
Defected sample size after augmentation : 16512


# Second challenge : Imbalanced data

# Third challenge : Understand the model

# Train the model

# Extract metrics from model predictions

# Conclusion