# Image Modelling - Pipeline Creation (Python file)
In this notebook we will cover: 
- how to prepare images for training a neural network and using shell commands instead of pandas to do so
- We’ll start by preparing the data in a way that it can be loaded into tensorflow, followed by the loading itself and checking if everything went fine. 
- Each step will be defined as a function, which we will directly write into a python file. 

In the second notebook we will import and use those functions in order to train a neural network that classifies our pictures.

In [None]:
# Remove any file that gets constructed by the notebook.
## noch anpassen 
!rm -f image_modeling.py 

The following cell defines a register cell magic which lets you write the content of a cell into a python script automatically, while still executing the cell. Mode 'a' (can be set with the -a flag) appends to the file while mode 'w' overwrites all existing lines.

In [None]:
# Let's make some dark cell magic. Why not!
from IPython.core.magic import register_cell_magic

@register_cell_magic
def write_and_run(line, cell):
    argz = line.split()
    file = argz[-1]
    mode = 'w'
    if len(argz) == 2 and argz[0] == '-a':
        mode = 'a'
        print("Appended to file ", file)
    else:
        print('Written to file:', file)
    with open(file, mode) as f:
        f.write(cell.format(**globals()))        
    get_ipython().run_cell(cell)

Import needed libraries. `%%write_and_run image_modeling.py` is the call of the register cell magic from above in 'w' mode (default). It writes the imports at the beginning of the `image_modeling.py`.

In [None]:
%%write_and_run image_modeling.py
import pathlib
import IPython.display as display
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import os
import csv
import pandas as pd
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img

Get the absolute path to the data folder, count all images and get the class names. 

In [None]:
# Get paths as POSIX paths
#home_path = str(pathlib.Path.home())
data_dir = '../images'
data_dir = pathlib.Path(data_dir)
print(f'The total number of images is: {len(os.listdir(data_dir))}')

## Define functions to process the data

From now on we will use python and Tensorflow to define some variables and functions to be used in the second notebook when we train our CNN to classify images of turtles.

We set some parameters for the model and call the register cell magic `write_and_run` again this time with the `-a` flag. This makes sure that the content of the cell is appended to `image_modeling.py` and existing lines are not overwritten.

In [None]:
%%write_and_run -a image_modeling.py

#Load csv-data
image_dir = "../images/"
train_data = pd.read_csv('../data/train.csv')
train_data.image_id = train_data.image_id.apply(lambda x: x.strip()+".JPG")
test_data = pd.read_csv('../data/test.csv')
test_data.image_id = test_data.image_id.apply(lambda x: x.strip()+".JPG")

#Get unique_turtle_ids from train.csv
unique_turtle_ids = list(train_data['turtle_id'].unique())
#Add category for new turtle for test set
unique_turtle_ids.append("new_turtle")
#Get number of images for train/test split
split = 0.7
lines = round(len(train_data)*split)
length_data = len(train_data)

#We set some parameters for the model
HEIGHT = 224 #image height
WIDTH = 224 #image width
CHANNELS = 3 #image RGB channels
CLASS_NAMES = unique_turtle_ids
NCLASSES = len(CLASS_NAMES)
BATCH_SIZE = 32
SHUFFLE_BUFFER = 10 * BATCH_SIZE
AUTOTUNE = tf.data.experimental.AUTOTUNE

TRAINING_SIZE = lines
VALIDATION_SIZE = length_data - lines                    
VALIDATION_STEPS = VALIDATION_SIZE // BATCH_SIZE

### Augmentation

In [None]:
%%write_and_run -a image_modeling.py

# flow_from_directory : Takes the path to a directory & generates batches of augmented data.
# use "rescale" to scale array of original image pixel values to be between [0,1] and specify the parameter rescale=1./255.

def preprocess(augment = True):
    if augment == True:
        train_datagen = ImageDataGenerator(
                rotation_range     = 40,
                width_shift_range  = 0.2,
                height_shift_range = 0.2,
                rescale            = 1./255,
                shear_range        = 0.2,
                zoom_range         = 0.2,
                horizontal_flip    = True,
                fill_mode          = 'nearest')

        test_datagen = ImageDataGenerator(rescale=1./255)
    
    else:
        train_datagen = ImageDataGenerator(rescale=1./255)
        test_datagen  = ImageDataGenerator(rescale=1./255)
        
    return train_datagen, test_datagen

# this is a generator that will read pictures found in
# subfolers of 'data/train', and indefinitely generate
# batches of augmented image data
#x_col value : which will be the name of column(in dataframe) having file names
#y_col value : which will be the name of column(in dataframe) having class/label

def use_image_generator(train_datagen, test_datagen, training=True): 
    
    if training == True:
        # Augment and scale images for training
        train_generator = train_datagen.flow_from_dataframe(dataframe =train_data[0:lines], 
                directory   = image_dir,
                x_col       = "image_id" ,
                y_col       = "turtle_id",
                target_size = (HEIGHT, WIDTH),
                batch_size  = BATCH_SIZE,
                classes     = CLASS_NAMES,
                class_mode  = 'categorical',
                shuffle     = True)
                #save_to_dir="output/",  if you wanna save the cropped images
                #save_prefix="",
                #save_format='png')

        # Scale images for validation
        validation_generator = test_datagen.flow_from_dataframe(dataframe = train_data[lines:], 
                directory    = image_dir,
                x_col        = "image_id",
                y_col        = "turtle_id",
                target_size  = (HEIGHT, WIDTH),
                batch_size   = BATCH_SIZE,
                classes      = CLASS_NAMES,
                class_mode   = 'categorical',
                shuffle      = True)
        
        return train_generator, validation_generator
    
    else:
        # Scale images for testing, no target provided and returned
        test_generator = test_datagen.flow_from_dataframe(dataframe = test_data, 
                directory   = image_dir,
                x_col       = "image_id",
                target_size = (HEIGHT, WIDTH),
                batch_size  = BATCH_SIZE,
                class_mode  = None,
                shuffle     = False)
            
        return test_generator