# Quality Model: Maximizing Utility of a Submitted Image

In [1]:
from datatypes import ImageQuality
from models import Models, CNN_models
from DataProcessing import transformations, ImageGenerator
import numpy as np
from misc import utils
import os
import cv2
from collections import Counter

Using TensorFlow backend.


## Goal

Create a model to observe image submissions from both the client and server side to provide a level of certainty that the image will be useful to diagnosis models. 


## Data

When training a model to identify 'good' and 'bad' images, these identifiers don't quite set a clear criteria for objectively defining the differences, so it's important that we make an attempt to make this distinction ourselves before haphazardly throwing labeled images into a model hoping the model will just figure it out. My most foundational reasoning behind deciding if a picture should be positively labeled was asking myself, "Does this image provide sufficient detail and perspective to make an informed analysis of the state of its health?". From there, I would only become more stringent in my criterion. 

Upon my arrival, I was given a dataset of images separated into two folders, good and bad. Originally there were 3201 bad images and 3153 images labeled good. After looking through the good images, there were some that I thought would still be difficult to diagnose the plant's disease from a human perspective. I proceeded by  looking through each of the good images and filtering out those I felt met a higher standard since, upon a quick skim, not all did meet a standard. 


#### Good, Clear Images: 2340
I define an image as being a good image by there is a clear, framed subject of the photo, with the subject having enough contrast with the background that leaf and non leaf regions would be distinguishable by me with a quick glance. I've provided an example below.

![Good Photo](data/goodbadcombined/clear_leaves/diseased_image_3266.jpg)


#### Blurry Images: 116

The next classification of images is the blurry image. I define a blurry image by an image that has an easily identifiable subject, but this subject isn't clear enough to provide sufficient detail for thorough analysis. I have yet to decide whether I should use these images and if I do, how. I currently use a binary classification architecture, so I feel that these blurry images would only confuse the model. My original reasoning from separating these from the rest of the 'bad' images was that, at a glance, blurry images would provide the higher level features of a good image, but it would go against my foundational statement that the photo should provide high enough resolution for an informed inspection. I've provided an example of this image below.

![Blurry Photo](data/goodbadcombined/blurry/diseased_image_466.jpg)


#### Busy Images: 554

The final classification of images is the "busy image" label. By this, I mean that images don't provide a clear subject and therefore just feel busy/chaotic. This is basically, of the images I personally checked, those images that don't meet the previous two criteria. These will clearly be the negatively labeled images.

![Busy Photo](data/goodbadcombined/busy/diseased_image_425.jpg)

In [2]:
# Data Paths
clear_path = 'data/goodbadcombined/clear_leaves'
blurry_path = 'data/goodbadcombined/blurry/'
busy_path = 'data/goodbadcombined/busy/'
bad_path = 'data/goodbadcombined/bad_image'

In [3]:
print 'Number of clear images:', len(os.listdir(clear_path))
print 'Number of blurry images:', len(os.listdir(blurry_path))
print 'Number of busy images:', len(os.listdir(busy_path))
print 'Number of busy images:', len(os.listdir(bad_path))

Number of clear images: 2341
Number of blurry images: 117
Number of busy images: 555
Number of busy images: 2241


##### Early Hypothesis: Edges

Early, when inspecting the data, I had the hypothesis that the good images would have a significant amount of variance in number of detected edges especially after separating images into busy and clear. To test this, I use [Canny Edge Detector](https://en.wikipedia.org/wiki/Canny_edge_detector) from the Python cv2 package. Let's start with a quick example to see why I may have thought this.

###### "good" image
![goodd example Photo](data/example_images/good_edge_image.png)

###### "bad" image
![bad example Photo](data/example_images/bad_edges_image.png)

In [4]:
bad_image_example = 'data/goodbadcombined/busy/diseased_image_525.jpg'
good_image_example = 'data/goodbadcombined/clear_leaves/diseased_image_411.jpg'


print '-------"good" image---------'
transformations.print_avg_edge_pixels(good_image_example)
print ''
print '-------"bad" image---------'
transformations.print_avg_edge_pixels(bad_image_example)


-------"good" image---------
Image shape: 1365 x 768
Average Edge pixels: 13.22 %

-------"bad" image---------
Image shape: 320 x 240
Average Edge pixels: 49.48 %


Obviously, from a single random example, there isn't a whole lot to draw, but my intuition was that edges would exploit some type of differences, so let's explore that a little more. Another reason this initial result may be misleading is becuase of the difference in sizes of images can cause for a huge difference of percentage of pixels and edges only require a thin sliver of pixels. To account for this, I resized all the images to 128x128, the size I use as the input to the CNN later during the forming of the actual model.

In [5]:
def avg_edge_pixels(path, name, size=(128,128)):
    images = utils.load_data(path, setting=0)
    images = transformations.resize_all_images(images, size)
    images = np.ravel(transformations.canny_edge_abs_pixels(images))
    avg_pixels = sum(images)/float(len(images))
    print 'average edge pixels for all', name ,'images:', round(avg_pixels, 2)
    return avg_pixels

# TODO: make this into a visual as well  

avg_edge_pixels(blurry_path, 'Blurry')
print '_____________________________'
avg_edge_pixels(busy_path, 'Busy')
print '_____________________________'
avg_edge_pixels(clear_path, 'Clear')
print '_____________________________'
avg_edge_pixels(bad_path, 'Bad')

loading all files from data/goodbadcombined/blurry/ ...
average edge pixels for all Blurry images: 1.75
_____________________________
loading all files from data/goodbadcombined/busy/ ...
average edge pixels for all Busy images: 1.7
_____________________________
loading all files from data/goodbadcombined/clear_leaves ...
average edge pixels for all Clear images: 1.78
_____________________________
loading all files from data/goodbadcombined/bad_image ...
average edge pixels for all Bad images: 1.72


1.7156372342790875

###### correlation??
There does appear to be some correlation in the average number of raw edge pixels as a single heuristic for judging an image's quality, however this is not good enough to fit a regression to. For thoroughness, I will also include the distribution of image sizes

In [37]:
# TODO: make this into a plot for ease of use

print 'Blurry size distribution'
print Counter([v.shape for v in utils.load_data(blurry_path, 0)])
print '_____________________________'
print 'Bad size distribution'
print Counter([v.shape for v in utils.load_data(bad_path, 0)])
print '_____________________________'
print 'Busy size distribution'
print Counter([v.shape for v in utils.load_data(busy_path, 0)])
print '_____________________________'
print 'Clear size distribution'
print Counter([v.shape for v in utils.load_data(clear_path, 0)])

Blurry size distribution
loading all files from data/goodbadcombined/blurry/
Counter({(320, 240): 58, (1365, 768): 23, (1024, 768): 11, (1024, 1280): 7, (576, 1024): 5, (2448, 3264): 5, (240, 320): 4, (3264, 2448): 2, (2160, 3840): 1})
_____________________________
Bad size distribution
loading all files from data/goodbadcombined/bad_image
Counter({(320, 240): 1614, (1365, 768): 613, (240, 320): 427, (1024, 768): 218, (576, 1024): 149, (3840, 2160): 69, (2160, 3840): 27, (768, 1024): 20, (3264, 2448): 16, (1024, 1280): 16, (2448, 3264): 16, (1600, 1200): 10, (2560, 1920): 2, (640, 640): 1, (1280, 768): 1, (1280, 720): 1, (299, 299): 1})
_____________________________
Busy size distribution
loading all files from data/goodbadcombined/busy/
Counter({(320, 240): 273, (1365, 768): 106, (240, 320): 68, (1024, 768): 34, (576, 1024): 24, (2448, 3264): 17, (2160, 3840): 13, (3264, 2448): 11, (1024, 1280): 5, (3840, 2160): 2, (768, 1024): 1})
_____________________________
Clear size distribution

In [None]:
train_generator, validation_generator = ImageGenerator.get_generator()
model = CNN_models.get_cnn()

model.fit_generator(
        train_generator,
        steps_per_epoch=1000,
        epochs=20,
        validation_data=validation_generator,
        validation_steps=400)

model.save_weights('serialized_objects/first_try.h5') 

Found 3878 images belonging to 2 classes.
Found 1663 images belonging to 2 classes.
Epoch 1/20
  90/1000 [=>............................] - ETA: 948s - loss: 0.2006 - acc: 0.6944