# Assignment: Amoeba Classification

This is the amoeba classification assignment. The students are encouraged to fill out the code block in "Build and train the model" and "Evaluate the model" parts by understanding the code in "Example: Clothes classification".

Here, we use the images that were collected in our research lab to train our own custom model and classify the images if they contain an amoeba or not. The dataset is not large enough, we use two techniques to make our experiment reproducible: data augmentation and repeat the experiment multiple times. 


## Table of content

* Load images dataset
* Data preparation
* Build and train the model (blank in here)
* Evaluate the model (blank in here)
* Inference

# Load images dataset

The images dataset is loaded and we will use them to train our custom model. All of the images are collected in our lab.  

In [None]:
# download dataset from github

%%shell
git clone https://github.com/BaosenZ/amoeba-detection.git


Cloning into 'amoeba-detection'...
remote: Enumerating objects: 2109, done.[K
remote: Counting objects: 100% (34/34), done.[K
remote: Compressing objects: 100% (27/27), done.[K
remote: Total 2109 (delta 13), reused 23 (delta 7), pack-reused 2075[K
Receiving objects: 100% (2109/2109), 315.32 MiB | 26.81 MiB/s, done.
Resolving deltas: 100% (338/338), done.
Checking out files: 100% (3056/3056), done.




In [None]:
# copy the dataset from github folder to 'content' 
!cp -r '/content/amoeba-detection/dataset-level1/dataset-amoebaClassification' '/content'

In [None]:
# upload zip file of the dataset from local

# from google.colab import files
# uploaded = files.upload()
# for fn in uploaded.keys():
#     print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

# !unzip dataset-amoebaClassification.zip

# Data preparation

In this step, we will prepare dataset, including data augmentation, spliting training, validation and test dataset, and normalize dataset. More way to prepare the dataset can be found here: https://keras.io/api/data_loading/image/. 

## Data augmentation

Data augmentation can obtain more data for training and validation. When the training dataset is small, we can do data augmentation to existing data and add those data to the training dataset. For image data specifically, data augmentation could consist of things like flipping the image horizontally or vertically, rotating the image, zooming in or out, cropping, or varying the color and so on. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import random
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from PIL import Image

gen = ImageDataGenerator(rotation_range=10, width_shift_range=0.1, height_shift_range=0.1, shear_range=0.15, zoom_range=0.1, 
    channel_shift_range=10., horizontal_flip=True)
    
# Data augmentation for train amoeba dataset
fileList=os.listdir("./dataset-amoebaClassification/train/amoeba/")

for i in fileList:
    image_path = 'dataset-amoebaClassification/train/amoeba/' + i
    image = np.expand_dims(plt.imread(image_path),0)
    aug_iter = gen.flow(image)
    aug_images = [next(aug_iter)[0].astype(np.uint8) for m in range(2)]

    for j in range(0,2):
        img = Image.fromarray(aug_images[j])
        save_path = './dataset-amoebaClassification/train/amoeba' + "/" + "aug-" + str(j) + str(i)
        img.save(save_path)

# The augmentation image will save into ./dataset-amoebaClassification/train/amoeba folder. Check this folder and find out\
# augmentation image with prefix 'aug-'. 

In [None]:
# Data augmentation for train noAmoeba dataset
fileList2=os.listdir("./dataset-amoebaClassification/train/noAmoeba/")

for i2 in fileList2:
    image_path = 'dataset-amoebaClassification/train/noAmoeba/' + i2
    image = np.expand_dims(plt.imread(image_path),0)
    aug_iter = gen.flow(image)
    aug_images = [next(aug_iter)[0].astype(np.uint8) for m2 in range(2)]

    for j2 in range(0,2):
        img = Image.fromarray(aug_images[j2])
        save_path = './dataset-amoebaClassification/train/noAmoeba' + "/" + "aug-" + str(j2) + str(i2)
        img.save(save_path)

# The augmentation image will save into ./dataset-amoebaClassification/train/noAmoeba folder. Check this folder and find out\
# augmentation image with prefix 'aug-'. 

In [None]:
# Data augmentation for test Amoeba dataset
fileList3=os.listdir("./dataset-amoebaClassification/test/amoeba/")

for i3 in fileList3:
    image_path = 'dataset-amoebaClassification/test/amoeba/' + i3
    image = np.expand_dims(plt.imread(image_path),0)
    aug_iter = gen.flow(image)
    aug_images = [next(aug_iter)[0].astype(np.uint8) for m3 in range(2)]

    for j3 in range(0,2):
        img = Image.fromarray(aug_images[j3])
        save_path = './dataset-amoebaClassification/test/amoeba' + "/" + "aug-" + str(j3) + str(i3)
        img.save(save_path)

# The augmentation image will save into ./dataset-amoebaClassification/test/amoeba folder. Check this folder and find out\
# augmentation image with prefix 'aug-'. 

In [None]:
# Data augmentation for test noAmoeba dataset
fileList4=os.listdir("./dataset-amoebaClassification/test/noAmoeba/")

for i4 in fileList4:
    image_path = 'dataset-amoebaClassification/test/noAmoeba/' + i4
    image = np.expand_dims(plt.imread(image_path),0)
    aug_iter = gen.flow(image)
    aug_images = [next(aug_iter)[0].astype(np.uint8) for m4 in range(2)]

    for j4 in range(0,2):
        img = Image.fromarray(aug_images[j4])
        save_path = './dataset-amoebaClassification/test/noAmoeba' + "/" + "aug-" + str(j4) + str(i4)
        img.save(save_path)

# The augmentation image will save into ./dataset-amoebaClassification/test/noAmoeba folder. Check this folder and find out\
# augmentation image with prefix 'aug-'. 

## Split and normalize the dataset

In [None]:
import os
import numpy as np
from tqdm import tqdm
from glob import glob
from PIL import Image
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras

# training data preparation

# define image size, it can be modified 
img_size = 150
# training images dataset path
train_path = 'dataset-amoebaClassification/train' 
nub_train = len(glob(train_path + '/*/*.jpg'))
# Create empty array, fill out the image array to newly-created array. 
X_train_full = np.zeros((nub_train,img_size,img_size,3),dtype=np.uint8) 
y_train_full = np.zeros((nub_train,),dtype=np.uint8)

i = 0
for img_path in tqdm(glob(train_path + '/*/*.jpg')):
    img = Image.open(img_path)
    # image resize
    img = img.resize((img_size,img_size)) 
    # images are converted to array
    arr = np.asarray(img)
    # assign array
    X_train_full[i, :, :, :] = arr
    
    if img_path.split('/')[-2] == 'amoeba':
        # Set amoeba class as 1
        y_train_full[i] = 1
    else:
        # Set no amoeba class as 0
        y_train_full[i] = 0
        
    i += 1

In [None]:
# # test data preparation

img_size = 150
test_path = 'dataset-amoebaClassification/test'
nub_test = len(glob(test_path + '/*/*.jpg'))

X_test = np.zeros((nub_test,img_size,img_size,3),dtype=np.uint8) 
y_test = np.zeros((nub_test,),dtype=np.uint8)

i = 0
for img_path in tqdm(glob(test_path + '/*/*.jpg')):
    img = Image.open(img_path)
    img = img.resize((img_size,img_size))
    arr = np.asarray(img)
    X_test[i, :, :, :] = arr
          
    if img_path.split('/')[-2] == 'amoeba':
        # Set amoeba class as 1
        y_test[i] = 1
    else:
        # Set no amoeba class as 0
        y_test[i] = 0
        
    i += 1

In [None]:
# # Visualize the training dataset
fig,axes = plt.subplots(3,4,figsize=(20, 20))

j = 0
for i,img in enumerate(X_train_full[:12]):
    axes[i//4,j%4].imshow(img)
    j+=1

In [None]:
# normalize the dataset
X_mean = X_train_full.mean(axis=0, keepdims=True)
X_std = X_train_full.std(axis=0, keepdims=True) + 1e-7
X_train_full_norm = (X_train_full - X_mean) / X_std
X_test_norm = (X_test - X_mean) / X_std

X_train_full_norm = X_train_full_norm[..., np.newaxis]
X_test_norm = X_test_norm[..., np.newaxis]

In [None]:
# Split the X_train_full_norm and y_train_full dataset randomly
def split_dataset(X_train_full_norm, y_train_full):
    # randomize X_train_full_norm and y_train_full array at the same order
    total_images = len(X_train_full_norm)
    idx = np.random.choice(np.arange(total_images), total_images, replace=False)
    X_train_full_norm = X_train_full_norm[idx]
    y_train_full = y_train_full[idx]

    # split the X_train_full into X_train(3/4) and X_valid(1/4)
    X_train_norm, X_validation_norm = X_train_full_norm[0 : int(total_images*3/4)], X_train_full_norm[int(total_images*3/4):total_images]
    y_train, y_validation = y_train_full[0 : int(total_images*3/4)], y_train_full[int(total_images*3/4):total_images]
    return X_train_norm, X_validation_norm, y_train, y_validation

# Build and train the model (blank in here)

Simple convolustional neural network (CNN) is used to train the model. It is usually composed of different layers in keras. The Layers are the basic building blocks of neural networks in Keras. Those layers are composed of and can be referred in Keras doc: convolutional layers (https://keras.io/api/layers/convolution_layers/convolution2d/), pooling layer (https://keras.io/api/layers/pooling_layers/), input and dense layer (https://keras.io/api/layers/core_layers/), flatten layer (https://keras.io/api/layers/reshaping_layers/flatten/). The layer or CNN structure can refer to example in keras doc ( https://keras.io/examples/vision/mnist_convnet/) or related books (such as •	Géron, A., 2019. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".). 

In [None]:
# Build the model (sequential CNN model)
from functools import partial

DefaultConv2D = partial(keras.layers.Conv2D, kernel_size=3, activation='relu', padding="SAME")

model = keras.models.Sequential([                           
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[150, 150, 3]),  # input shape should match the input image size
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=256),
    DefaultConv2D(filters=256),
    keras.layers.MaxPooling2D(pool_size=2),
    keras.layers.Flatten(),
    keras.layers.Dense(units=128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=64, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=32, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=16, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=2, activation='softmax'),
])

The purpose of loss functions is to compute the quantity that a model should seek to minimize during training. For more available loss function, refer keras doc: https://keras.io/api/losses/

An optimizer is a function that modifies the attributes of the neural network, such as weights and learning rate. Thus, it helps in reducing the overall loss and improve the accuracy. For more available optimizer, refer keras doc: https://keras.io/api/optimizers/

A metric is a function that is used to judge the performance of your model. For more available metrics, refer keras doc: https://keras.io/api/metrics/

In [None]:
# Compile the model
opt = keras.optimizers.Nadam(learning_rate=0.00001)  # learning rate can be changed to increase the performance of the model
model.compile(loss="sparse_categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

Then we will start to train the model by calling `model.fit()` and save the model by calling `model.save()`. More information about these two functions are in the Keras doc: https://keras.io/api/models/. 

! fill out the blank here

In [None]:
# Train the model

# Create a dir to save our model
if not os.path.exists("model_dir"):
  os.mkdir("model_dir")

repeat_exp = 11  # Set how many times (repeat_exp - 1) you want to repeat the experiment
for i in range(1, repeat_exp): 
    # Split the dataset randomly
    X_train_norm, X_validation_norm, y_train, y_validation = split_dataset(X_train_full_norm, y_train_full)

    # rebuild the model
    DefaultConv2D = partial(keras.layers.Conv2D, kernel_size=3, activation='relu', padding="SAME")
    model = keras.models.Sequential([                           
    DefaultConv2D(filters=64, kernel_size=7, input_shape=[150, 150, 3]),  # input shape should match the input image size
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=128),
    DefaultConv2D(filters=128),
    keras.layers.MaxPooling2D(pool_size=2),
    DefaultConv2D(filters=256),
    DefaultConv2D(filters=256),
    keras.layers.MaxPooling2D(pool_size=2),
    keras.layers.Flatten(),
    keras.layers.Dense(units=128, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=64, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=32, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=16, activation='relu'),
    keras.layers.Dropout(0.5),
    keras.layers.Dense(units=2, activation='softmax'),
    ])

    # Compile the model
    opt = keras.optimizers.Nadam(learning_rate=0.00001)  # learning rate can be changed to increase the performance of the model
    model.compile(loss="sparse_categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

    # Start to train the model
    print("start repeatation ", i)
    history = model.fit( , , epochs=10, validation_data=(X_validation_norm, y_validation)) # fill out the blank
    # Save the trained model in model_dir dir
    saved_model = "model_dir/" + "model" + str(i)
    model.save(saved_model)


# Evaluate the model (blank in here)

In [None]:
# visualize the model structure with model.summary(). Feel free to comment out the code below to visualize the model structure

# model.summary()

The test dataset is not used for training and validation, which means they are new to the trained model. And we will use it to get the performance of the model. The performance is acceptable because the accuracy for test dataset is nearly the same with accuracy for train and validation dataset.

Another way to evaluate the small dataset is to repeat the experiment of evaluating the model multiple times. And we can calculate mean test accuracy. 

! fill out the blank here

In [None]:
import os   
fileList = os.listdir("./model_dir/")

test_accuracy_list = list()
loss_list = list()
for m in fileList:
    model = keras.models.load_model("./model_dir/"+m) # m='model.h5'
    results = model.evaluate( ,  , batch_size=128)  # fill out the blank here
    test_accuracy_list.append(results[1])
    loss_list.append(results[0])

mean_test_accuracy = np.mean(test_accuracy_list)
print("mean test accuracy: ", mean_test_accuracy)

mean test accuracy:  1.0


Find out the best model among all trained models.

In [None]:
# find the best model
for a,b,c in zip(fileList, test_accuracy_list, loss_list):
    maxb = max(test_accuracy_list)
    minc = min(loss_list)
    if b == maxb and c==minc:
        model = keras.models.load_model("./model_dir/"+a)

At the end of epoch, the accuary for training and validation dataset should be close. This is an easy way to determine if there is overfitting or not. 

In [None]:
# plot accuracy vs epoch
plt.plot(history.history['accuracy'],'r')
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left') 
plt.ylim([0, 1.1])
plt.show()

# Inference

We will run inference using the best model we find.

We visualize the image and use your judgement to see if there is amoeba exist in the image and then compare it with the model prediction.

In [None]:
from keras.applications.imagenet_utils import decode_predictions
import matplotlib.pyplot as plt
from keras.preprocessing import image
import numpy as np

# Visualize one image, X_test[x]. Here we choose X_test[1]. You can choose any of the images among all test dataset
inference_image_number = 1  # Choose the number from 0-66 (amount of test dataset)!
img1 = X_test[inference_image_number]
plt.imshow(img1)


We will predict the image above with `model.predict()` function to see if it matches with your judgement. Check the `model.predict()` reference: https://keras.io/api/models/model_training_apis/#predict-method

In [None]:
# class label
class_label = ['no amoeba exists', 'amoeba exists']

# image process
x = np.squeeze(X_test_norm[inference_image_number])
x = image.img_to_array(x)
x = np.expand_dims(x, axis=0)

# predict the image with model.predict()
y_prob = model.predict(x)
print("probability for each of the catogaries: ", y_prob)
y_class = y_prob.argmax(axis=-1)
print("model predict: ", class_label[y_class[0]])