# [U-Net Segmentation Approach to Cancer Diagnosis](https://www.kaggle.com/c/data-science-bowl-2017#tutorial)
*approach to predicting whether a CT scan is of a patient who either has or will develop cancer within the next 12 months or not*

General Approach:
1. train a network to segment out potentially cancerous nodules
2. use the characteristics of that segmentation to make predictions about the diagnosis of the scanned patient within a 12 month time frame


# Downloading Instructions
1. **pydicom** (dicom): type in anaconda command prompt: `pip install pydicom` ([reference](http://pydicom.readthedocs.io/en/latest/getting_started.html))
2. **SimpleITK**: type in anaconda command prompt: `conda install -c https://conda.anaconda.org/simpleitk SimpleITK` ([reference](https://itk.org/Wiki/SimpleITK/GettingStarted))
3. **xgboost**: type in anaconda command prompt: `pip install xgboost` ([reference](http://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/), [long version reference](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=en))

## Installing Keras, Tensorflow, CuDNN, Cuda Tool Kit
*how to install keras, and the gpu supported version of tensorflow, as well as the entire GPU computing library*

**Follow the instructions [here](https://github.com/3-musketeers/kaggle-dsb/blob/master/pipeline/build-simple-model/rough-draft/model_dependency_setup.md)**

## Downloading Data
**Follow the instructions [here](https://github.com/3-musketeers/kaggle-dsb/blob/master/pipeline/build-simple-model/rough-draft/model_data_setup.md)**

# Dependency Descriptions
1. **numpy**: an extension to the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays
2. **scikit-image** (skimage): collection of algorithms for image processing
3. **scikit-learn**: simple and efficient tools for data mining and data analysis
4. **keras** (tensorflow backend): high-level neural networks library, written in Python (runs on top of TensorFlow)
5. **matplotlib**: a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
6. **pydicom** (dicom): pydicom is a pure python package for working with DICOM files such as medical images, reports, and radiotherapy objects
7. **SimpleITK**: an open-source, cross-platform system that provides developers with an extensive suite of software tools for image analysis 
8. **pandas**: providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
9. **glob**: a module that finds all the pathnames matching a specified pattern according to the rules used by the Unix shell (results returned in arbitrary order)
10. **csv**: a module that implements classes to read and write tabular data in CSV format
11. **os**: a module that provides a portable way of using operating system dependent functionality
12. **xgboost**: a library designed and optimized for boosting trees algorithms
13. **pickle**: standard mechanism for object serialization

## Details:
1. U-Net style convolutional network: to identify regions with nodules (U-net was designed for segmenting neuronal structures)
2. appearance on nodules within the CT scan: indicate the possibility of cancer
3. Lung Nodule Analysis 2016 (LUNA2016):
   1. provides training examples with marked nodules in order train the U Net to find these nodules (CT images with annotated nodule locations)
   2. use the LUNA data set to generate an appropriate training set for our U-Net
   3. use these examples to train our supervised segmenter

# Construct Training Set From LUNA16
*goal:*

Process:
1. use the nodule locations as given in annotations.csv and extract three transverse slices that contain the largest nodule from each patient scan
2. masks will be created for those slices based on the nodule dimensions given in annotations.csv
3. output of this file will be two files for each patient scan: a set of images and a set of corresponding nodule masks


* import tools
* find largest nodule in the patient scan
* use df_node (a pandas dataframe): to keep track of the case numbers and the node information (as there might be multiple nodule listings for some patients in annotations.csv)
* node information is an (x,y,z) coordinate in mm using a coordinate system defined in the .mhd file

In [None]:
import SimpleITK as sitk
import numpy as np
import csv
from glob import glob
import pandas as pd

# path constants
LUNA_DATA_PATH = '../../../../data/luna16/'
LUNA_SUBSET_PATH = LUNA_DATA_PATH + 'subset0/'

file_list = glob(LUNA_SUBSET_PATH + "*.mhd") # get all the mhd image files

# Helper function to get rows in data frame associated with each file
def get_filename(case):
    global file_list
    for f in file_list: # for every file in the list if the seriesuid is in the file name, return the file 
        if case in f:
            return(f)

# The locations of the nodes
df_node = pd.read_csv(LUNA_DATA_PATH + "annotations.csv")
df_node["file"] = df_node["seriesuid"].apply(get_filename) # for every rowsave file name to the 'file' column of the row
df_node = df_node.dropna() # if the seriesuid is not found in this subset, drop all the rows that have na as values for 'file' column

# Looping over the image files
fcount = 0
for img_file in file_list:
    print "Getting mask for image file %s" % img_file.replace(LUNA_SUBSET_PATH,"") # state the image file name (without path)
    mini_df = df_node[df_node["file"]==img_file] # get all nodules associate with file
    if len(mini_df)>0:       # some files may not have a nodule--skipping those 
        biggest_node = np.argsort(mini_df["diameter_mm"].values)[-1]   # just using the biggest node
        node_x = mini_df["coordX"].values[biggest_node]
        node_y = mini_df["coordY"].values[biggest_node]
        node_z = mini_df["coordZ"].values[biggest_node]
        diam = mini_df["diameter_mm"].values[biggest_node]
        
        itk_img = sitk.ReadImage(img_file) 
        img_array = sitk.GetArrayFromImage(itk_img) # indexes are z,y,x (notice the ordering)
        center = np.array([node_x,node_y,node_z])   # nodule center
        origin = np.array(itk_img.GetOrigin())      # x,y,z  Origin in world coordinates (mm)
        spacing = np.array(itk_img.GetSpacing())    # spacing of voxels in world coor. (mm)
        v_center =np.rint((center-origin)/spacing)  # nodule center in voxel space (still x,y,z ordering)
        
        def make_mask(center,diam,z,width,height,spacing,origin):
        ...
        for v_x in v_xrange:
            for v_y in v_yrange:
                p_x = spacing[0]*v_x + origin[0]
                p_y = spacing[1]*v_y + origin[1]
                if np.linalg.norm(center-np.array([p_x,p_y,z]))<=diam:
                    mask[int((p_y-origin[1])/spacing[1]),int((p_x-origin[0])/spacing[0])] = 1.0
        return(mask)
        
        i = 0
        for i_z in range(int(v_center[2])-1,int(v_center[2])+2):
            mask = make_mask(center,diam,i_z*spacing[2]+origin[2],width,height,spacing,origin)
            masks[i] = mask
            imgs[i] = matrix2int16(img_array[i_z])
            i+=1
        np.save(output_path+"images_%d.npy" % (fcount) ,imgs)
        np.save(output_path+"masks_%d.npy" % (fcount) ,masks)

In [1]:
import tensorflow as tf

In [1]:
import keras

Using TensorFlow backend.
