# Training Data Generation 

In this step we will extract the training/validation data from the even/odd tomograms. You will have the option to train a network on 3D or 2D data. 3D is the standard method.

In [None]:
from generate_train_data import *

import mrcfile
from os.path import join, isdir
from os import makedirs
from glob import glob

from matplotlib import pyplot as plt
import numpy as np

Now we first load the tomograms. The default tomogram name is `half-tomo.rec` but if you used SIRT the name will be something like `half-tomo_SIRT_iter_<numiterations>.rec`. So for SIRT, you will have to edit the filename below.

In [None]:
# Load the two tomograms 
even = mrcfile.open(glob('frames/even/tomogram/half-tomo.rec')[0]).data
odd = mrcfile.open(glob('frames/odd/tomogram/half-tomo.rec')[0]).data


In [None]:
mean, std = compute_mean_std(np.stack((even, odd)))

# Create the train_data directory
if not isdir('train_data/'):
    makedirs('train_data/')

## Masking

In some cases you might not want to draw training samples from every part of the tomogram. In this case you can specify this in the mask, which which specifies which area training data is sampled from. You can specify a rectangle that will be used to sample from (seen from the default view; optical axis). 

Unless you really have good reason to not sample from the whole tomogram, just leave the line below as is, run it, and move on.

In [None]:
mask = np.ones(even.shape, dtype=np.int8)

## 3D or 2D?

You can use a 2D or 3D network for denoising. 3D is generally better for practical purposes and should be your default option. You can use the 2D option if you used SIRT in the reconstruction. 2D can produce less distorted tomograms when viewed in a volume viewer like UCSF Chimera.

Set the type below to one of the options below:
* `type2D`
* `type3D` 

In [None]:
# change line below if you want to use 2D
dimensionality = project.type3D

print('Using a ' + str(dimensionality.value) + ' network for the remainder of this project.')

## Sample Coordinates

With our mask we will now sample coordinates for the train and validation volumes (3D) or slices (2D). 

The method `sample_coordinates` will return two lists with coordinates of volumes or slices. The train and validation volumes will not overlap. It will take a few minutes.

Default `sample length` for `type3D` is 64 while it is 128 for `type2D` 

A good number of sample points is 1200 for `type3D` and for `type2D` 40 is a good number (as it will be multiplied by the z-height of the tomogram). Number of validation samples are best left at about 1/10 of number of training samples.

In [None]:
train_coords, val_coords = sample_coordinates(mask,
                                              num_train_samples=1200,
                                              num_val_samples=120,
                                              sample_length=64,
                                              net_dim = dimensionality.value)

## Extract Volumes
Now we use our sampled coordinates to extract volumes that can be used as training data for the network.

In [None]:
# Now we use the sampled coordinates to extract the train- and validation-volumes.
X, Y, X_val, Y_val = extract_samples(even, odd, train_coords, val_coords, mean, std, net_dim = dimensionality.value)

In [None]:
# pick a random sample to show (re-run this cell to see different samples!)
plot_train_data(X, Y, X_val, Y_val, net_dim = dimensionality.value)

## Save Train-/Validation-Data

In [None]:
np.savez('train_data/train_data.npz', X=X, Y=Y, X_val=X_val, Y_val=Y_val)