# [U-Net Segmentation Approach to Cancer Diagnosis](https://www.kaggle.com/c/data-science-bowl-2017#tutorial)
*approach to predicting whether a CT scan is of a patient who either has or will develop cancer within the next 12 months or not*

General Approach:
1. train a network to segment out potentially cancerous nodules
2. use the characteristics of that segmentation to make predictions about the diagnosis of the scanned patient within a 12 month time frame

# Downloading Instructions
1. **pydicom** (dicom): type in anaconda command prompt: `pip install pydicom` ([reference](http://pydicom.readthedocs.io/en/latest/getting_started.html))
2. **SimpleITK**: type in anaconda command prompt: `conda install -c https://conda.anaconda.org/simpleitk SimpleITK` ([reference](https://itk.org/Wiki/SimpleITK/GettingStarted))
3. **xgboost**: type in anaconda command prompt: `pip install xgboost` ([reference](http://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/), [long version reference](https://www.ibm.com/developerworks/community/blogs/jfp/entry/Installing_XGBoost_For_Anaconda_on_Windows?lang=en))
4. **tqdm**: type in anaconda command prompt: `pip install tqdm` ([reference](https://pypi.python.org/pypi/tqdm#usage))

## Installing Keras, Tensorflow, CuDNN, Cuda Tool Kit
*how to install keras, and the gpu supported version of tensorflow, as well as the entire GPU computing library*

**Follow the instructions [here](https://github.com/3-musketeers/kaggle-dsb/blob/master/pipeline/build-simple-model/rough-draft/model_dependency_setup.md)**

## Downloading Data
**Follow the instructions [here](https://github.com/3-musketeers/kaggle-dsb/blob/master/pipeline/build-simple-model/rough-draft/model_data_setup.md)**

# Dependency Descriptions
1. **numpy**: an extension to the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large library of high-level mathematical functions to operate on these arrays
2. **scikit-image** (skimage): collection of algorithms for image processing
3. **scikit-learn**: simple and efficient tools for data mining and data analysis
4. **keras** (tensorflow backend): high-level neural networks library, written in Python (runs on top of TensorFlow)
5. **matplotlib**: a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms
6. **pydicom** (dicom): pydicom is a pure python package for working with DICOM files such as medical images, reports, and radiotherapy objects
7. **SimpleITK**: an open-source, cross-platform system that provides developers with an extensive suite of software tools for image analysis 
8. **pandas**: providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language
9. **glob**: a module that finds all the pathnames matching a specified pattern according to the rules used by the Unix shell (results returned in arbitrary order)
10. **csv**: a module that implements classes to read and write tabular data in CSV format
11. **os**: a module that provides a portable way of using operating system dependent functionality
12. **xgboost**: a library designed and optimized for boosting trees algorithms
13. **pickle**: standard mechanism for object serialization
14. **tqdm**: instantly make your loops show a smart progress meter

## Details:
1. U-Net style convolutional network: to identify regions with nodules (U-net was designed for segmenting neuronal structures)
2. appearance on nodules within the CT scan: indicate the possibility of cancer
3. Lung Nodule Analysis 2016 (LUNA2016):
   1. provides training examples with marked nodules in order train the U Net to find these nodules (CT images with annotated nodule locations)
   2. use the LUNA data set to generate an appropriate training set for our U-Net
   3. use these examples to train our supervised segmenter

# Construct Training Set From LUNA16
*goal:*

Process:
1. use the nodule locations as given in annotations.csv and extract three transverse slices that contain the largest nodule from each patient scan
2. masks will be created for those slices based on the nodule dimensions given in annotations.csv
3. output of this file will be two files for each patient scan: a set of images and a set of corresponding nodule masks


* import tools
* find largest nodule in the patient scan
* use df_node (a pandas dataframe): to keep track of the case numbers and the node information (as there might be multiple nodule listings for some patients in annotations.csv)
* node information is an (x,y,z) coordinate in mm using a coordinate system defined in the .mhd file

In [3]:
from tqdm import tqdm # will tqdm slow down the program significantly? if not then use it
for i in tqdm(range(10000000)):
    pass

100%|█████████████████████████████████████████████████████████████████| 10000000/10000000 [00:03<00:00, 2974410.93it/s]


In [7]:
import SimpleITK as sitk
import numpy as np
import csv
from glob import glob
import pandas as pd

# path constants
LUNA_DATA_PATH = '../../../../data/luna16/'
LUNA_SUBSET_PATH = LUNA_DATA_PATH + 'subset0/'

file_list = glob(LUNA_SUBSET_PATH + "*.mhd") # get all the mhd image files

# Helper function to get rows in data frame associated with each file
def get_filename(case):
    global file_list
    for f in file_list: # for every file in the list if the seriesuid is in the file name, return the file 
        if case in f:
            return(f)

# The locations of the nodes
df_node = pd.read_csv(LUNA_DATA_PATH + "annotations.csv")
df_node["file"] = df_node["seriesuid"].apply(get_filename) # for every rowsave file name to the 'file' column of the row
df_node = df_node.dropna() # if the seriesuid is not found in this subset, drop all the rows that have na as values for 'file' column

In [2]:
img_file = file_list[2]
itk_img = sitk.ReadImage(img_file) # using sitk to read a .mhd image
print(itk_img)

Image (00000214B1B0DA50)
  RTTI typeinfo:   class itk::Image<short,3>
  Reference Count: 1
  Modified Time: 1060
  Debug: Off
  Object Name: 
  Observers: 
    none
  Source: (none)
  Source output name: (none)
  Release Data: Off
  Data Released: False
  Global Release Data: Off
  PipelineMTime: 1035
  UpdateMTime: 1059
  RealTimeStamp: 0 seconds 
  LargestPossibleRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [512, 512, 161]
  BufferedRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [512, 512, 161]
  RequestedRegion: 
    Dimension: 3
    Index: [0, 0, 0]
    Size: [512, 512, 161]
  Spacing: [0.548828, 0.548828, 1.25]
  Origin: [-187.7, -108.3, -194]
  Direction: 
1 0 0
0 1 0
0 0 1

  IndexToPointMatrix: 
0.548828 0 0
0 0.548828 0
0 0 1.25

  PointToIndexMatrix: 
1.82206 0 0
0 1.82206 0
0 0 0.8

  Inverse Direction: 
1 0 0
0 1 0
0 0 1

  PixelContainer: 
    ImportImageContainer (00000214B1C224D0)
      RTTI typeinfo:   class itk::ImportImageContainer<unsigned __int

In [3]:
# get the associated 3d pixel array for the .mhd image
img_array = sitk.GetArrayFromImage(itk_img) # indexes are z,y,x (notice the ordering)
print(img_array)

[[[-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  ..., 
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]]

 [[-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  ..., 
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]]

 [[-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  ..., 
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]]

 ..., 
 [[-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  [-3024 -3024 -3024 ..., -3024 -3024 -3024]
  ..., 
  [-3024 -

In [11]:
mini_df = df_node[df_node["file"]==file_list[2]] # get all nodules associate with file
for node_idx, cur_row in mini_df.iterrows():
    print(node_idx)
    print(cur_row)

25
seriesuid      1.3.6.1.4.1.14519.5.2.1.6279.6001.109002525524...
coordX                                                   46.1885
coordY                                                   48.4028
coordZ                                                  -108.579
diameter_mm                                              13.5965
file           ../../../../data/luna16/subset0\1.3.6.1.4.1.14...
Name: 25, dtype: object
26
seriesuid      1.3.6.1.4.1.14519.5.2.1.6279.6001.109002525524...
coordX                                                    36.392
coordY                                                   76.7717
coordZ                                                  -123.322
diameter_mm                                               4.3432
file           ../../../../data/luna16/subset0\1.3.6.1.4.1.14...
Name: 26, dtype: object


In [10]:
biggest_node = np.argsort(mini_df["diameter_mm"].values)[-1]
node_x = mini_df["coordX"].values[biggest_node]
node_y = mini_df["coordY"].values[biggest_node]
node_z = mini_df["coordZ"].values[biggest_node]
diam = mini_df["diameter_mm"].values[biggest_node]
print(biggest_node)
print(node_x)
print(node_y)
print(node_z)
print(diam)
print('\n')

center = np.array([node_x,node_y,node_z])   # nodule center
origin = np.array(itk_img.GetOrigin())      # x,y,z  Origin in world coordinates (mm)
spacing = np.array(itk_img.GetSpacing())    # spacing of voxels in world coor. (mm)
v_center =np.rint((center-origin)/spacing)  # nodule center in voxel space (still x,y,z ordering)

print(center)
print(origin)
print(spacing)
print(v_center)

0
46.18853869
48.40280596
-108.5786324
13.59647134


[  46.18853869   48.40280596 -108.5786324 ]
[-187.699997 -108.300003 -194.      ]
[ 0.54882801  0.54882801  1.25      ]
[ 426.  286.   68.]


In [1]:
import tensorflow as tf

In [2]:
import keras

Using TensorFlow backend.
