### Data Preparation
One way of feeding the image data into Keras is by organizing the train/val/test dataset in the following directory structure.

```
images 
└───train
│   └───class_1
│       │   image001.jpg
│       │   image002.jpg
│       │   ...
│   └───class_2
└───val
│   └───class_1
│       │   ...
└───test
│   └───class_1
│       │   ...
```
We can use this library __split_folders__ to split our original dataset into train/val/test.

In [None]:
import split_folders

split_folders.ratio('../../data/images/train', output="../../data/images/new", seed=1337, ratio=(.8, .2))

#### Corrupted exif tags in Images
It was observed that several warnings were shown during Keras model training. This was resulted by some corrupted exif tags on the jpeg images. In order to ensure that the corrupted images are not affecting the training process, we can perform a simple fix to this by removing all the exif tags in all the images.

In [None]:
import piexif

for folder in folders_list:
    folder_path = pathlib.Path(folder).resolve()
    files = folder_path.glob('*.*')
    for file in files:
        piexif.remove(str(file))

### Imbalanced classes

In [59]:
import pathlib
from pprint import pprint
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

train_images_path = '../../data/images/train'
path = pathlib.Path(train_images_path).resolve()
folders = path.glob("*/")
folders_list = [folder.resolve() for folder in folders]

files_count = {folder.name: len(list(folder.glob("*.*"))) for i, folder in enumerate(folders_list)}
files_count

{'Borehole - Mechanized': 52,
 'Borehole - Mechanized with diesel': 4,
 'Bucket': 640,
 'Hand Pump': 640,
 'Hand Pump - Afridev': 640,
 'Hand Pump - India Mark II': 640,
 'Hand Pump - Vergnet': 640,
 'Kiosk': 640,
 'Other': 6,
 'Protected Spring': 207,
 'Tapstand': 640}

Two classes have extremely low number of training data and would likely result in a model that can not classify these two classes properly.
- Borehole - Mechanized with diesel
- Other

However, I would like to keep the 11 classes first as a preliminary model and evaluate if it gives undesirable results.

### Class weights
In order to prevent a biased model which will always predict the class with more data. One way of fixing the imbalance class issue in a classification problem is that we can force our algorithm to treat every instance of "class 1" as 50 instances of "class 0". This is done by implementing class weights.

The class weights in keras is in dictionary format:
```
{0.0: <class_weight0>, 1.0: <class_weight1>, 2.0: <class_weight2>, ...}
```

In [63]:
class_count = {float(i): len(list(folder.glob("*.*"))) for i, folder in enumerate(folders_list)}
class_weight = {key: max(class_count.values())/class_count[key] for key in iter(class_count)}
class_weight

{0.0: 12.307692307692308,
 1.0: 160.0,
 2.0: 1.0,
 3.0: 1.0,
 4.0: 1.0,
 5.0: 1.0,
 6.0: 1.0,
 7.0: 1.0,
 8.0: 106.66666666666667,
 9.0: 3.0917874396135265,
 10.0: 1.0}

In [65]:
import numpy as np
import keras
import matplotlib.pyplot as plt
from keras.layers import Dense, GlobalAveragePooling2D, Dropout
from keras.applications import MobileNet
from keras.applications.mobilenet import preprocess_input
from keras.preprocessing import image
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Model, load_model
from keras.optimizers import Adam
from keras_tqdm import TQDMNotebookCallback

base_model = MobileNet(weights='imagenet',
                       include_top=False)  # imports the mobilenet model and discards the last 1000 neuron layer.

dropout_rate = 0.35

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)  # we add dense layers so that the model can learn more complex functions and classify for better results.
x = Dropout(dropout_rate)(x)
x = Dense(1024, activation='relu')(x)  # dense layer 2
x = Dropout(dropout_rate)(x)
x = Dense(512, activation='relu')(x)  # dense layer 3
preds = Dense(11, activation='softmax')(x)  # final layer with softmax activation

model = Model(inputs=base_model.input, outputs=preds)

for layer in model.layers[:20]:
    layer.trainable = False
for layer in model.layers[20:]:
    layer.trainable = True

In [3]:
classes = [folder.name for folder in folders_list]
classes

['Borehole - Mechanized',
 'Borehole - Mechanized with diesel',
 'Bucket',
 'Hand Pump',
 'Hand Pump - Afridev',
 'Hand Pump - India Mark II',
 'Hand Pump - Vergnet',
 'Kiosk',
 'Other',
 'Protected Spring',
 'Tapstand']

In [None]:
model.summary()

In [None]:
# Inspect specific layers
# for layer in model.layers[-4:]:
#     pprint(layer.get_config())

In [4]:
train_datagen = ImageDataGenerator(rescale=1./255,
                                    shear_range=0.2,
                                    zoom_range=0.2,
                                    horizontal_flip=True)  # included in our dependencies

train_generator = train_datagen.flow_from_directory('../../data/images/train',
                                                    target_size=(224, 224),
                                                    color_mode='rgb',
                                                    batch_size=64,
                                                    classes=classes,
                                                    class_mode='categorical',
                                                    shuffle=True)

valid_generator = train_datagen.flow_from_directory('../../data/images/val',
                                                    target_size=(224, 224),
                                                    color_mode='rgb',
                                                    batch_size=64,
                                                    classes=classes,
                                                    class_mode='categorical',
                                                    shuffle=True)

Found 4749 images belonging to 11 classes.
Found 1189 images belonging to 11 classes.


### Learning Rate Scheduler
https://machinelearningmastery.com/using-learning-rate-schedules-deep-learning-models-python-keras/

In this model training process, learning rate will decay based on the epochs of the training process (i.e. Step Decay). This learning rate refinement will facilitate the convergance of models accuracy automatically instead of manually adjusting the learning rate manually.

When LearningRateScheduler is used, the learning rate specified by Adam is ignored.

In [50]:
initial_lrate = 0.000
epoch = 10
drop = 0.1
epochs_drop = 5.0
lrate = initial_lrate * math.pow(drop, math.floor((epoch)/epochs_drop))
float(lrate)

0.0

In [51]:
from keras.callbacks import LearningRateScheduler, ModelCheckpoint, EarlyStopping
import math

# learning rate schedule
def step_decay(epoch):
    initial_lrate = 0.0005
    drop = 0.5
    epochs_drop = 5.0
    lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
    return lrate

# learning schedule callback
lrate = LearningRateScheduler(step_decay)

### Early Stopping
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
Since 

In [52]:
# simple early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=20)

### Model Checkpoint
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

In [53]:
mc = ModelCheckpoint('best_model.h5', monitor='val_acc', mode='max', verbose=1, save_best_only=True)
callbacks_list = [lrate,es,mc]

In [66]:
model.compile(optimizer=Adam(), loss='categorical_crossentropy', metrics=['accuracy'])

step_size_train = train_generator.n // train_generator.batch_size
step_size_valid = valid_generator.n // valid_generator.batch_size
model.fit_generator(generator=train_generator, validation_data=valid_generator, steps_per_epoch=step_size_train, validation_steps=step_size_valid, epochs=200, class_weight=class_weight, 
                    verbose=2, callbacks=callbacks_list)

### Saving the Model and the Metrics

In [67]:
import pickle

# save:
f = open('history.pckl', 'wb')
pickle.dump(model.history, f)
f.close()

# retrieve:
f = open('history.pckl', 'rb')
history = pickle.load(f)
f.close()

In [None]:
# save model and architecture to single file
model.save("model_oversample3.h5")
print("Saved model to disk")

### Training Process Evaluation

In [None]:
import matplotlib.pyplot as plt
import pickle

# history = model.fit(x, y, validation_split=0.25, epochs=50, batch_size=16, verbose=1)

# retrieve:
f = open('history.pckl', 'rb')
history = pickle.load(f)
f.close()

# Plot training & validation accuracy values
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()

### Model Evaluation

In [14]:
from keras.models import load_model
from keras.preprocessing.image import ImageDataGenerator

# model.save('my_model.h5')  # creates a HDF5 file 'my_model.h5'
# del model  # deletes the existing model

# returns a compiled model
# identical to the previous one
model = load_model('best_model.h5')

# STEP_SIZE_TEST=test_generator.n//test_generator.batch_size
# test_generator.reset()
# pred=model.predict_generator(test_generator,
# steps=STEP_SIZE_TEST,
# verbose=1)

test_datagen = ImageDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory('../../data/images/test',
                                                    target_size=(224, 224),
                                                    color_mode='rgb',
                                                    batch_size=64,
                                                    classes=classes,
                                                    class_mode='categorical')
step_size_test = test_generator.n // test_generator.batch_size

loss_and_metrics = model.evaluate_generator(test_generator, steps=step_size_test)
print(loss_and_metrics)

Found 1484 images belonging to 11 classes.
[0.7527404284995535, 0.7778532608695652]
