# Creating data for training and testing

This notebook will describe the process in generating training data for the FCN. There are 4 scripts in the 
preprocessing section of the code, each containing a set of functions. Description for each function resides in the scripts themeselves.

`generate_augments.py` is a module that makes augmented images from an input image such as rotations flips, and coarse graining.

`image_parse.py` contains functions to read and parse raw (labels or stem) images.

`make_data.py` contains functions to generate variations of the images as well as create data that's readable for the FCN.

`1_preprocess.py` is a script with the same code as in this notebook but can be run in terminal (after setting the parameters).


In this example, we will generate training data from WSeTe simulated images. We will go through some functions and descirbe the meaning of the parameters needed

start by importing the functions in `make_data.py`

In [None]:
from make_data import *

## Folder Hierarchy ##
Next we set the parameters. To understand some parameters, we must describe the folder hierarchy that is assumed. The hieararchy is as follows:

input_dir:

    data_dir_0:
        input.ftype 
        label_l0.ftype
        label_l2.ftype
        ...
        label_lm.ftype
    data_dir_2:
        (similar to data_dir_0)
    ...
    data_dir_n:
    
    parsed_dir_name:
            test_00000.p
            
            train_pname1.p
            train_pname2.p
            ...
            train_pnamek.p

Each `data_dir` folder contains its own raw input (stem) image along with its corresponding labels. There is one image corresponding to a set of labels. The `data_dir` directories can be labeled however we wish, but inside each directory, the files have a specific name structure:

*input.ftype*: There must be a file labeled "input.tiff" (or .png or any other ftype). Some file types may not work, so converting to a functioning one (like tiff) may be necessary.

*label_x.ftype*: The images with labels must start with "label_" followed by a label name. For example "label_Se.tiff". The filetype of the label images should be the same as the input image.

## Parameters ##
First, we need to provide the location of `input_dir`, the directory where all the data is.
In this directory that contains a set of data directories, we need to specify which ones are for training and which are for the validation set. We denote these list of directories in `train_dirs` and `test_dir`, respectively. `test_dir` will have only one directory in a list.

We also need a list of the label names, `label_list`, that are in each data directory "\[l0, l1, ..., lm\]". 
Note that for `label_list`, it need not contain a list of all the labels in the data directory folders. One just includes the set of labels they wish to train an FCN on. It can be a list of one element.

Next, we specify what the name of the directory where we place all our train and test data `parsed_dir_name`.

Finally, we specify what filetype we're working with `ftype`.

In [None]:
input_dir  = "../../data/WSeTe/simulated/"
train_dirs = ["0", "1", "2"]
test_dir   = ["3"]
label_list = ["Se"]
parsed_dir_name = 'parsed_label_Se'
ftype = '.tiff'

Next we define parameters of extracting images. If we were to imagine the raw image a sheet of dough, we are getting our training set by cutting out pieces from that sheet. To do this, we need to describe the shape of our cookie cutter, and where on the sheet we are cutting out the pieces.

`l_shape`: the height and width of the images going into the FCN (the size of the cookie cutter)

`stride`: when going through the raw image extracting images of size `l_shape` the stride says how many pixels to move over to the left and down to get the next image. if the stride is equal to the shape, then the set of extracted images have no overlapping pixels. 

In [None]:
l_shape = (256,256)
stride = (64,64)

Preprocessing is performed by cutting a bunch of images from the raw data in various ways and pickling them in a number of pickled files. If our training set is small, we can just deal with one pickeled file by setting `one_pickle` to true. Otherwise, we can specify how many cut images we want in a training file, `tr_fsize` and in a test file `ts_fsize`. 

In [None]:
one_pickle=False
tr_fsize = 2000
ts_fsize = 200

The way we parse our label images is that we take in the label file, and set pixel values above `tol` to 1 and pixel values below `tol` to zero. This is set below

In [None]:
tol = 0.05

Suppose we made a lot of cutouts with little to no defects. we can filter these images out by setting `ones_percent` to a nonzero value. This value is the percent of pixels in a cut image that is `1`

In [None]:
ones_percent = 0.

Now that we set all the parameters, we can run functions to generate training data. The first of these is `create_augments`. This function will take the input images in a data directory, and output augments of that image in a folder inside each `data_dir/` called `augments`. The types of augments are inversions, rotations, and down/up sampling. 
Hence for a single input image, the augments create 2*4*3 = 24 different input images. This is only run on the train images.

In [None]:
create_augments(input_dir, train_dirs, ftype)

Finally, we make the data that will be placed in `parsed_dir_name` for the train data...

In [None]:
make_data(input_dir, train_dirs, label_list, l_shape, stride, ftype, parsed_dir_name=parsed_dir_name, \
          prefix="train", AUG=True, tol=tol, ones_pcent=ones_percent, one_save=one_pickle, fsize=tr_fsize)

and the test data.

In [None]:
make_data(input_dir, test_dir, label_list, l_shape, stride, ftype, parsed_dir_name=parsed_dir_name, \
          prefix="test", AUG=False, tol=tol, ones_pcent=ones_percent, one_save=True, fsize=ts_fsize)

Now if we want to check our data, we use the function `check_data`:

In [None]:
import numpy as np
parsed_fn = input_dir + parsed_dir_name + "/test_00000.p"
check_data(parsed_fn, l_shape=l_shape) 