# 🔷 PART 1: Exploratory Data Analysis 🔷

In this Jupyter notebook, we analyze our given external datasets through a **basic comprehensive** lens: we manipulate, curate, and prepare data in order to ask critical questions and gain an effective understanding of how to perform higher-level prediction-driven data modification.

---

## 🔵 TABLE OF CONTENTS 🔵 <a name="TOC"></a>

Use this **table of contents** to navigate the various sections of the preprocessing notebook.

#### 1. [Section A: Imports and Initializations](#section-A)

    All necessary imports and object instantiations for data preprocessing.

#### 2. [Section B: Manipulating Our Data](#section-B)

    Data manipulation operations, including (but not limited to) 
    null value imputation and data cleaning. 

#### 3. [Section C: Visualizing Trends Across Our Data](#section-C)

    Data visualizations to outline trends and patterns 
    inherent across our data that may mandate further analysis.

#### 4. [Section D: Saving Our Interim Datasets](#section-D)

    Saving preprocessed data states for further access.

#### 5. [Appendix: Supplementary Custom Objects](#appendix)

    Custom object architectures used throughout the data preprocessing.
    
---

## 🔹 Section A: Imports and Initializations <a name="section-A"></a>

General Imports for Data Manipulation and Visualization.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Specialized Imports for Deep Learning Architectures and Supporting Structures.

In [71]:
from keras import regularizers
from keras.utils import Sequence
from keras.preprocessing.image import ImageDataGenerator
from keras.layers import Dense, Conv2D, MaxPool2D, Flatten, Dropout, InputLayer, Reshape, Conv1D, MaxPool1D, SeparableConv2D
from keras.applications import MobileNetV2, VGG19
from keras.callbacks import EarlyStopping
from keras.models import Sequential
from keras.optimizers import Adam

Specialized Imports for Model Evaluation and Selection.

In [3]:
from sklearn.model_selection import cross_validate, train_test_split

Specialized Imports for Globular File/Directory Navigation and Script Timing.

In [4]:
import os, shutil, time

Specialized Imports for Image Modification.

In [5]:
from PIL import Image
from PIL.ExifTags import TAGS

Custom Algorithmic Structures for Processed Data Visualization.

In [6]:
import sys
sys.path.append("../source/structures")

# TODO: Place custom structures from `../source/structures` here.

##### [(back to top)](#TOC)

---

## 🔹 Section B: Manipulating Our Data <a name="section-B"></a>

In [7]:
DIRPATH = "/Volumes/Bianca/DEVELOPER/data-science/Malaria-Imaging"
RAWPATH, INTPATH = "datasets/1-raw/cell_images/", "datasets/2-interim/cell_images/"
POSITIVE, NEGATIVE = "Parasitized", "Uninfected"
TRAIN, VALID, TEST = "Training", "Validation", "Testing"

In [8]:
DIRPATHS_RAW = {
    POSITIVE: os.path.join(DIRPATH, RAWPATH, POSITIVE),
    NEGATIVE: os.path.join(DIRPATH, RAWPATH, NEGATIVE),
}

DIRPATHS_INT = {
    POSITIVE: {
        TRAIN: os.path.join(DIRPATH, INTPATH, POSITIVE, TRAIN),
        VALID: os.path.join(DIRPATH, INTPATH, POSITIVE, VALID),
        TEST:  os.path.join(DIRPATH, INTPATH, POSITIVE, TEST)
    },
    NEGATIVE: {
        TRAIN: os.path.join(DIRPATH, INTPATH, NEGATIVE, TRAIN),
        VALID: os.path.join(DIRPATH, INTPATH, NEGATIVE, VALID),
        TEST:  os.path.join(DIRPATH, INTPATH, NEGATIVE, TEST)
    }
}

Initialization of interim datasets for ingestion.

In [9]:
for position, filename in enumerate(os.listdir(DIRPATHS_RAW[POSITIVE])):
    newpath = "{}_{}.jpg".format(POSITIVE, position)
    source = os.path.join(DIRPATHS_RAW[POSITIVE], filename)
    destination = os.path.join(DIRPATH, INTPATH, POSITIVE, newpath)
    shutil.copy(source, destination)

In [10]:
for position, filename in enumerate(os.listdir(DIRPATHS_RAW[NEGATIVE])):
    newpath = "{}_{}.jpg".format(NEGATIVE, position)
    source = os.path.join(DIRPATHS_RAW[NEGATIVE], filename)
    destination = os.path.join(DIRPATH, INTPATH, NEGATIVE, newpath)
    shutil.copy(source, destination)

Additional Initialization of Interim Datasets for Ingestion: **Parasitized Images**.

In [15]:
images = ["{}_{}.jpg".format(POSITIVE, position) for position in range(3000)]
for image in images:
    source = os.path.join(DIRPATH, INTPATH, POSITIVE, image)
    destination = os.path.join(DIRPATHS_INT[POSITIVE][TRAIN], image)
    shutil.copyfile(source, destination)
    
images = ["{}_{}.jpg".format(POSITIVE, position) for position in range(3000, 4000)]
for image in images:
    source = os.path.join(DIRPATH, INTPATH, POSITIVE, image)
    destination = os.path.join(DIRPATHS_INT[POSITIVE][VALID], image)
    shutil.copyfile(source, destination)
    
images = ["{}_{}.jpg".format(POSITIVE, position) for position in range(4000, 4500)]
for image in images:
    source = os.path.join(DIRPATH, INTPATH, POSITIVE, image)
    destination = os.path.join(DIRPATHS_INT[POSITIVE][TEST], image)
    shutil.copyfile(source, destination)

Additional Initialization of Interim Datasets for Ingestion: **Uninfected Images**.

In [16]:
images = ["{}_{}.jpg".format(NEGATIVE, position) for position in range(3000)]
for image in images:
    source = os.path.join(DIRPATH, INTPATH, NEGATIVE, image)
    destination = os.path.join(DIRPATHS_INT[NEGATIVE][TRAIN], image)
    shutil.copyfile(source, destination)
    
images = ["{}_{}.jpg".format(NEGATIVE, position) for position in range(3000, 4000)]
for image in images:
    source = os.path.join(DIRPATH, INTPATH, NEGATIVE, image)
    destination = os.path.join(DIRPATHS_INT[NEGATIVE][VALID], image)
    shutil.copyfile(source, destination)
    
images = ["{}_{}.jpg".format(NEGATIVE, position) for position in range(4000, 4500)]
for image in images:
    source = os.path.join(DIRPATH, INTPATH, NEGATIVE, image)
    destination = os.path.join(DIRPATHS_INT[NEGATIVE][TEST], image)
    shutil.copyfile(source, destination)

Augmenting Images for Pipeline Ingestion.

In [18]:
data_generator = ImageDataGenerator(rescale=1.0 / 255.0,
                                    validation_split=0.33)

Image Augmentation: **Training Data**. 

In [102]:
BATCH_SIZE = 32
SUB_BATCH_SIZE = BATCH_SIZE / 2

In [107]:
positive_training_generator = data_generator.flow_from_directory(directory=os.path.join(DIRPATH, INTPATH, POSITIVE),
                                                                 classes=[TRAIN], 
                                                                 target_size=(128, 128),
                                                                 class_mode="binary",
                                                                 subset="training",
                                                                 shuffle=True,
                                                                 batch_size=BATCH_SIZE)

negative_training_generator = data_generator.flow_from_directory(directory=os.path.join(DIRPATH, INTPATH, NEGATIVE),
                                                                 classes=[TRAIN],
                                                                 target_size=(128, 128),
                                                                 class_mode="binary",
                                                                 subset="training",
                                                                 shuffle=True,
                                                                 batch_size=BATCH_SIZE)

training_generator = MergedGenerator(BATCH_SIZE, 
                                     generators=[positive_training_generator, negative_training_generator], 
                                     sub_batch_size=[SUB_BATCH_SIZE] * 2)

Found 2010 images belonging to 1 classes.
Found 2010 images belonging to 1 classes.


Image Augmentation: **Validation Data**.

In [108]:
positive_validation_generator = data_generator.flow_from_directory(directory=os.path.join(DIRPATH, INTPATH, POSITIVE),
                                                                   classes=[VALID],
                                                                   target_size=(128, 128),
                                                                   class_mode="binary",
                                                                   subset="validation",
                                                                   shuffle=True,
                                                                   batch_size=32)

negative_validation_generator = data_generator.flow_from_directory(directory=os.path.join(DIRPATH, INTPATH, NEGATIVE),
                                                                   classes=[VALID],
                                                                   target_size=(128, 128),
                                                                   class_mode="binary",
                                                                   subset="validation",
                                                                   shuffle=True,
                                                                   batch_size=32)

training_generator = MergedGenerator(BATCH_SIZE, 
                                     generators=[positive_validation_generator, negative_validation_generator], 
                                     sub_batch_size=[SUB_BATCH_SIZE] * 2)

Found 330 images belonging to 1 classes.
Found 330 images belonging to 1 classes.


Model Instantiation and Testing: **Depth-Wise Separable CNN (DS-CNN)**. 

In [109]:
INPUT_LENGTH = (128, 128, 3)

model_dscnn = Sequential()

model_dscnn.add(Conv2D(16, (3, 3), activation="relu", input_shape=INPUT_LENGTH))
model_dscnn.add(MaxPool2D(2, 2))
model_dscnn.add(Dropout(0.2))

model_dscnn.add(Conv2D(32, (3, 3), activation="relu"))
model_dscnn.add(MaxPool2D(2, 2))
model_dscnn.add(Dropout(0.2))

model_dscnn.add(SeparableConv2D(64, (3, 3), activation="relu"))
model_dscnn.add(MaxPool2D(2, 2))
model_dscnn.add(Dropout(0.3))

model_dscnn.add(SeparableConv2D(128, (3, 3), activation="relu"))
model_dscnn.add(MaxPool2D(2, 2))
model_dscnn.add(Dropout(0.3))

model_dscnn.add(Flatten())
model_dscnn.add(Dense(64, activation="relu"))
model_dscnn.add(Dropout(0.5))

model_dscnn.add(Dense(1, activation="sigmoid"))

optimizer = Adam(lr=0.0005, beta_1=0.9, beta_2=0.999)

model_dscnn.compile(optimizer=optimizer,
                    loss="binary_crossentropy",
                    metrics=["accuracy"])
model_dscnn.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 126, 126, 16)      448       
_________________________________________________________________
max_pooling2d_8 (MaxPooling2 (None, 63, 63, 16)        0         
_________________________________________________________________
dropout_10 (Dropout)         (None, 63, 63, 16)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 61, 61, 32)        4640      
_________________________________________________________________
max_pooling2d_9 (MaxPooling2 (None, 30, 30, 32)        0         
_________________________________________________________________
dropout_11 (Dropout)         (None, 30, 30, 32)        0         
_________________________________________________________________
separable_conv2d_4 (Separabl (None, 28, 28, 64)       

In [110]:
early_stoppage = EarlyStopping(monitor="val_loss",
                               patience=2)

Model Fulfillment.

In [116]:
history = model_dscnn.fit_generator(training_generator,
                                    epochs=20,
                                    steps_per_epoch=len(training_generator),
                                    validation_data=(validation_generator),
                                    callbacks=[early_stoppage],
                                    verbose=1)

Epoch 1/20

ValueError: Failed to find data adapter that can handle input: (<class 'list'> containing values of types {"<class '__main__.MergedGenerator'>"}), <class 'NoneType'>

Visualization of Training and Validation Accuracy Measures.

In [114]:
display_training_scores(history, which="accuracy")
display_training_scores(history, which="loss")

NameError: name 'history' is not defined

Model State Save.

In [41]:
STATEPATH, STATEVERSION = "models", 1

model_state = "malaria_imaging_dscnn_v{:02d}.h5".format(STATEVERSION)

model_dscnn.save_weights(os.path.join(DIRPATH, STATEPATH, model_state))

'/Volumes/Bianca/DEVELOPER/data-science/Malaria-Imaging/models/malaria_imaging_dscnn_v01.h5'

## 🔹 Appendix: Supplementary Custom Objects <a name="appendix"></a>

In [25]:
def combine_generator(generator1, generator2):
    while True:
        yield(next(generator1), next(generator2))

In [33]:
def display_training_scores(history, which="accuracy"):
    plt.figure(figsize=(10, 6))
    if which == "accuracy":
        plt.plot(history.history["accuracy"], label="Training", marker="*", linewidth=3)
        plt.plot(history.history["val_accuracy"], label="Validation", marker="o", linewidth=3)
        plt.title("Accuracy Assessment: Training vs. Validation")
        plt.ylabel("Accuracy")
    elif which == "loss":
        plt.plot(history.history["loss"], label="Training", marker="*", linewidth=3)
        plt.plot(history.history["val_loss"], label="Validation", marker="o", linewidth=3)
        plt.title("Loss Assessment: Training vs. Validation")
        plt.ylabel("Loss")
    plt.xlabel("Epochs")
    plt.legend(fontsize="x-large")
    plt.show()

In [106]:
class MergedGenerator(Sequence):
    def __init__(self, batch_size, generators=[], sub_batch_size=[]):
        self.generators = generators
        self.sub_batch_size = sub_batch_size
        self.batch_size = batch_size
    def __len__(self):
        return int(
            sum([(len(self.generators[idx]) * self.sub_batch_size[idx])
                 for idx in range(len(self.sub_batch_size))]) /
            self.batch_size)
    def __getitem__(self, index):
        """Getting items from the generators and packing them"""
        X_batch = []
        Y_batch = []
        for generator in self.generators:
            if generator.class_mode is None:
                x1 = generator[index % len(generator)]
                X_batch = [*X_batch, *x1]
            else:
                x1, y1 = generator[index % len(generator)]
                X_batch = [*X_batch, *x1]
                Y_batch = [*Y_batch, *y1]
        if self.generators[0].class_mode is None:
            return np.array(X_batch)
        return np.array(X_batch), np.array(Y_batch)

##### [(back to top)](#TOC)

---