
# Multitask Data Preparation
This file prepares the data for further use for model training. It will create numpy array of the data with proper train test splits.

## Requirements : 
| Library | version | version name |
| :---        |    :----:   |   ------:  |
| cudatoolkit |   9.0 | h13b8566_0 |
| cudnn |                     7.6.5 |                cuda9.0_0 |  
|ipykernel|                 5.3.4|            py37h5ca1d4c_0|    
|ipython |                  7.18.1|           py37h5ca1d4c_0|    
|jupyter_client|            6.1.7|                      py_0|    
|jupyter_core|              4.6.3|                    py37_0|    
|keras-applications|        1.0.8|                      py_1|  
|keras-preprocessing|       1.1.0|                      py_1 | 
|matplotlib|                3.3.3|                    pypi_0|    
|matplotlib-base|           3.3.2|            py37h817c723_0|  
|nibabel|                   3.2.1|                    pypi_0|    
|numpy|                     1.19.2|           py37h54aff64_0|
|opencv|                    3.4.2|            py37h6fd60c2_1|  
|pandas|                    1.1.3|            py37he6710b0_0|  
|pillow|                    8.0.1|            py37he98fc37_0|  
|py-xgboost|                0.90|             py37he6710b0_1|    
|python|                    3.7.9|                h7579374_0|  
|scikit-image     |         0.17.2|                   pypi_0|    
|scikit-learn     |         0.23.2|           py37h0573a6f_0|    
|scipy            |         1.5.2  |          py37h0b6359f_0|  
|seaborn          |         0.11.0 |                    py_0|  
|tensorboard     |          1.14.0 |          py37hf484d3e_0|  
|tensorflow     |           1.14.0 |         gpu_py37hae64822_0|  
|tensorflow-gpu|            1.14.0 |              h0d30ee6_0|  



In [None]:
import pandas as pd
import os
import pathlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

from tensorflow import keras

from sklearn.preprocessing import LabelEncoder


from tensorflow.keras import backend as K
import nibabel as nib
import cv2
import time
from skimage.transform import resize
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit, StratifiedKFold
from tensorflow.keras.utils import to_categorical

In [None]:
from prep_data import get_roi, get_all_subjects, get_subject, get_subject_list, pad_to_shape, remove_padding
from PatchGenerator import PatchGenerator

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, CSVLogger, EarlyStopping, TensorBoard
from tensorflow.keras.applications.resnet50 import preprocess_input
import models

In [None]:
# Importing segmentation masks and original data after preprocessing
basepath = '../data/Step7-Segmentations/' 

In [None]:
paths = pathlib.Path.iterdir(basepath)

In [None]:
paths1 = os.scandir(basepath) #scanning the directory for the subject id forlders present

In [None]:
"""
Makinig an array of all the IDs available after the segmentation of the data
"""
id1 = []
basepath = pathlib.Path(basepath)
for entry in basepath.iterdir():
    if entry.is_dir():
        print(entry.name)
        id1.append(entry.name)

In [None]:
id1.sort()

In [None]:
len(id1)

In [None]:
"""
Creating dataframe with all the file paths which will then be stored with all the important data
as a csv file for future use 
"""

dfn_T1ce = []
dfn_Flair = []
dfn_T2 = []
dfn_Mask = []
for id in id1:
    temp = '../Nimhans_data/Step6-Mask_Multiplication/' + str(id) + '/T1WCE_bet.nii.gz'
    dfn_T1ce += [temp]
    temp1 = '../Nimhans_data/Step6-Mask_Multiplication/' + str(id) + '/' + str(id) + '_mul_FLAIR.nii.gz'
    dfn_Flair += [temp1]
    temp2 = '../Nimhans_data/Step6-Mask_Multiplication/' + str(id) + '/'+ str(id) +'_mul_T2W.nii.gz'
    dfn_T2 += [temp2]
    temp3 = basepath + str(id) +'/Pred_Mask_Segm.nii.gz'
    dfn_Mask += [temp3]

In [None]:
d = {'Patient ID':id1, 'T1ce':dfn_T1ce, 'FLAIR': dfn_Flair, 'T2': dfn_T2, 'Mask': dfn_Mask}

In [None]:
temp_df = pd.DataFrame(d, columns = ['Patient ID', 'T1ce', 'FLAIR', 'T2', 'Mask'])

In [None]:
temp_df.head()

In [None]:
temp_df.to_csv('TCGA_new_data2_path.csv') #saving final data as a csv 

In [None]:
df_train = temp_df
Labels_train = np.zeros(len(df_train))

In [None]:
df_train.set_index('Patient ID', inplace = True)

### Image preprocessing

Code cells below will now process our split data and give us numpy arrays.
The following preprocessing is done on the data:
1. Data is converted to numpy array
2. Slices are taken over the 3D image
3. z2 normalization is performed over each slice
4. A bounding box is taken over each slice for the area around the mask
5. The masked box is then taken over the slice and only the tumor part is considered
6. The bounded slices are then stacked over each other also labels are also stacked in the same order as that of the slices.
7. Final numpy arrays are then returned and saved for further use in model training

In [None]:
"""
    The code below uses the function get_all_subjects from our prep_data.py file. 
    This gives us the numpy array of normalised images for every subject from the test data. 
    Similar process will be done for train data.
"""

FLAIR_train_data_list, train_mask_list = get_all_subjects(df_train,'FLAIR','Mask',label_values=[1,2,4],transpose_axes=[2,0,1],norm_type='zscore')
T1ce_train_data_list, _ = get_all_subjects(df_train,'T1ce','Mask',label_values=[1,2,4],transpose_axes=[2,0,1],norm_type='zscore')
T2_train_data_list, _ = get_all_subjects(df_train,'T2','Mask',label_values=[1,2,4],transpose_axes=[2,0,1],norm_type='zscore')

In [None]:
start = time.time()

number_of_slices = 15 #number of slices to selected

X_train_orig = []
Y_train_orig = []
modality_list = []

for idx, msk in enumerate(train_mask_list):
    FLAIR_img = FLAIR_train_data_list[idx]
    T1ce_img = T1ce_train_data_list[idx]
    T2_img = T2_train_data_list[idx]    
    label = Labels_train[idx]
    
    print('Class: ',label,'\tName : ',df_train.index[idx])
    hMin, hMax, wMin, wMax, dMin, dMax = get_roi(msk,10)    
    
    ## important Axial slices
    roi_areas = [(slc,area) for slc,area in enumerate(np.sum(msk,axis=(1,2)))]
    roi_areas = sorted(roi_areas,key=lambda x: x[1],reverse=True)
    imp_slices = [x for x,_ in roi_areas[0:number_of_slices]]
    
    for slc in imp_slices:
        
        modalities = ['FLAIR','T1ce','T2']
        modality_mapping = {'FLAIR':1,'T1ce':2,'T2':3}
        data_dict = {'FLAIR':FLAIR_img,'T1ce':T1ce_img,'T2':T2_img}
        
        for mod in modalities:
            # for each slice and its respective mask the bounded image is created and are stacked                            
            tmp_slice = data_dict[mod][slc]
            tmp_slice = data_dict[mod][slc][wMin:wMax,dMin:dMax] # taking bounding box across tumor in slice
            tmp_slice = resize(tmp_slice,[128,128],anti_aliasing=True) # resizing to required dimensions               
            
            modality_list.append(modality_mapping[mod])
            Y_train_orig.append(label)
            X_train_orig.append(tmp_slice)        
        
end = time.time()
print(end-start)

X_train_orig = np.stack(X_train_orig, axis=0)
X_train_orig = np.expand_dims(X_train_orig, -1)

Y_train_orig = np.stack(Y_train_orig,axis=0)
print('training : ',np.unique(Y_train_orig,return_counts=True))
Y_train = to_categorical(Y_train_orig,num_classes=None)

print(X_train_orig.shape,Y_train.shape)

In [None]:
del(FLAIR_train_data_list) # save memory
del(T2_train_data_list) # save memory
del(T1ce_train_data_list) # save memory
del(train_mask_list) # save memory

Below code saves all the data in numpy format, including signle modalitiy files and stacked modalities as different numpy arrays for each. 

**PS. Note that here this data is preprocessed for autoencoders which were trained on the unlabeled data thus no y labels were availabel and wern't saved.
The data is also just used for autoencoder training so there is no train test split performed in the above codes. 
If a split is needed to be performed it can be simply done by using sklearn train_test_split. Otherwise Midline_imageprep documentation can also be refered for the steps.** 

In [None]:
np.save('../Data/X_train.npy',x_train)
np.save('../Data/Y_train.npy',y_train)

In [None]:
# separating modalitites

x_train_FLAIR = x_train[np.arange(0,2025,3),:,:,0]
x_train_T1ce = x_train[np.arange(1,2025,3),:,:,0]
x_train_T2 = x_train[np.arange(2,2025,3),:,:,0]

#x_test_FLAIR = x_test[np.arange(0,171,3),:,:,0]
#x_test_T1ce = x_test[np.arange(1,171,3),:,:,0]
#x_test_T2 = x_test[np.arange(2,171,3),:,:,0]

print(x_train_FLAIR.shape,x_train_T1ce.shape, x_train_T2.shape)
#print(x_test_FLAIR.shape,x_test_T1ce.shape,x_test_T2.shape)

In [None]:
#stacking the modalities to make a 3 channel image with 3 modalities stacked

X_train = np.stack([x_train_FLAIR,x_train_T1ce, x_train_T2],axis=-1)
#x_test = np.stack([x_test_FLAIR,x_test_T1ce, x_test_T2],axis=-1)
print(X_train.shape)#,x_test.shape)

In [None]:
Y_train = y_train[np.arange(0,len(y_train),3)]
#y_test = y_test[np.arange(0,len(y_test),3)]
print(Y_train.shape)#,y_test.shape)

In [None]:
np.save('../Data/X_train_stacked.npy',X_train)
np.save('../Data/Y_train_stacked.npy',Y_train)