Skip to content

Doc-Ix/histopathology

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Histopathology

In this repository you find the capstone project of my Machine Learning Nanodegree at Udacity. Based on the kaggle challenge Histopathologic Cancer Detection - Identify metastatic tissue in histopathologic scans of lymph node sections, the goal of this computer-vision project was an algorithm that is able to identify metastatic cancer in small image patches taken from larger digital pathology scans. Using digital whole slide images (WSI) of lymph nodes sections as input, algorithms were trained to identify tumerous tissue in the center of the WSI provided. The best CNN, based on a Xception network, received a ROC-AUC score of 0.974 on the testing set. Please see section Project Details for further explanation.

Please feel free to use the files provided for further testing/training/processing and experimentation. 😃

Table of Contents

Content of Repository & Downloads

  • Histo-App-Final.ipynb: A jupyter notebook file with the main code of the project
  • utils.py: A python file with supplementary functions
  • environment.yml: Conda environment file

The following files can be downloaded via google drive:

File Description Link
train.zip, train_labels.csv Original dataset and labels from kaggle challenge Link
weights.best.model_Xception_full.h5 Weights of best perfoming model, based on Xception Link
test_data_lables_and_prediction_NASNetmobile_full.pkl Data frame with the true labels of the testing set as well as the prediction results of NASNetmobile, stored as pickle file. Link
test_data_lables_and_prediction_Xception_full.pkl Data frame with the true labels of the testing set as well as the prediction results of Xception, stored as pickle file. Link

Enviroment

Create a new Anaconda environment from file.

NAME: tensorflow_p36
conda env create -f environment.yml

Project Details

Datasets and Inputs

The dataset was provided in the kaggle challenge Histopathologic Cancer Detection - Identify metastatic tissue in histopathologic scans of lymph node sections. Based on the PatchCamelyon (PCam) benchmark dataset, the data for the competition was slightly modified. Due to probabilistic sampling, the original dataset contained duplicates that were removed in the Kaggle dataset. The dataset itself contains a very large number of small histopathologic WSIs for classification, taken from lymph node sections. Every file is labeled with an id, and a csv-file provides the information for all image-ids, weather there is at least one cell with tumor tissue in the inner image region or not. If the label is positive, at least one cell in the 32x32px center of the picture is with tumor tissue. The outer region does not influence the label, it was provided to enable fully-convolutional models that do not use zero-padding. Each file is 96x96px large with three color channels each.

Evaluation Metrics

For evaluation the receiver operating characteristic (ROC) curve will be used, plotting the true positive rate against the false positive rate. It illustrates the performance of the classifier at varying thresholds and gives an overall performance parameter, without incorporating any decisions about the optimal threshold for this particular disease resp. treatment.

Benchmark

In the conference paper, which presents the underlying PCam dataset, the best CNN implemented by the authors (P4M-DenseNet) scores 0.963 (AUC), which can be seen as a fair benchmark for this project.

Data Preparation and Labelling

The original dataset from the kaggle challenge comes as a zip-folder, containing all image data as well as a csv-file with the corresponding labels. In total there are 220.025 single tiff-images, from which a dataframe was created and data was splitted in training/testing (80/20) and then in training/validation (80/20). The data was then stored in labeled folders for further processing. As a result, three folders with labeled subfolders were created, containing 140,816 training, 35,204 validation, and 44,005 testing images.

Data Distribution and Sample Illustration

Total Positive Negative Ratio
Training 140,816 57,035 83,781 approx. 40/60
Validation 35,204 14,259 20,945 approx. 40/60
Test 44,005 17,823 26,182 approx. 40/60

The following two figures illustrate examples from the dataset labeled as non-cancerous in the 30x30 px center and as cancerous in its center. For non-trained ordinary persons, it is hardly possible to identify any differences.

Sample images labeled non-cancerous: Non-cancerous

Sample images labeled cancerous: Cancerous

Data Augmentation and Pipeline

Although the dataset is quite large, an additional data augmentation was implemented. Using the Keras module, horizontal as well as vertical flips were implemented as well as rotations and zooms. Furthermore, shifts in height, width, channel as well as shear modulations were realized. To facilitate CNN training, all images were rescaled to values between 0 and 1. During training and validation, the images should be directly streamed from their corresponding folders, keeping their original size of 96x96 px. With a batch size of 200, data generators for training and validation were initialized.

Model Building

Building on pre-trained models in the Keras library and inspired by different blogs about this topic, the following model for training was chosen and implemented, based on a blog post by Youness Mansar. The core of the network is the NASNetmobile model, since it is credited to be fast and still high performant in image recognition. Three parallel layers follow the core model, a global-max-pooling a global-average-pooling and a flatten layer. After that a dropout layer (rate=0.5) is installed and the final dense layer with a sigmoid activation function builds the end of the model. The resulting CNN has 4,281,333 parameters from which 4,244,595 were trainable in the configuration defined.

CNN 1, building on NASNetmobile:

As a second network, an Xception model was embedded in a model structure with a global-average-pooling layer and a dropout layer (rate=0.5) following the core model. The final layer is also a dense layer with a sigmoid activation function. With 20,809,001 trainable parameters (out of 20,863,529) the resulting model is much bigger then the NasNetmobile, however much closer to the standard library model.

CNN 2, building on Xception:

Training

The whole models were trained on an AWS EC2 p2.xlarge instance. The training of the models was tracked and the configurations with the currently best validation results were automatically stored to dedicated folders as h5-files.

For detailed training logs, please see the jupyter notebook file.

Testing

In order to make predictions with the trained models, data frames with the paths to the single testing images and their corresponding labels were created. The predictions of the models were then added to a new column of the data frame and the data frames were stored as pickle files, after deleting unnecessary columns, in order to reduce file size. The data frames for the two models can be found in the Downloads section. After that the data frames were optimized for further processing and an additional column with a binary prediction value was added, using a threshold of 0.5.

Results

Model Sensitivity Specitivity False Positive Rate ROC-AUC
CNN 1, building on NASNetmobile 81.06 % 90.22 % 9.78 % 0.937
CNN 2, building on Xception 83.23 % 97.78 % 2.22 % 0.974

With an ROC-AUC score of 0.974 the CNN 2 model was able to reach the goal of the project, to build a binary image classifier for histopathologic cancer detecting, reaching an ROC-AUC value above 0.95.

Plot of ROC-AUC for Xception model:

ROC-AUC_Xception

License

This repository is under the MIT License.

Acknowledgments

  • The dataset of this project was provided by kaggle, building on the dataset of Bas Veeling, with additional input from Babak Ehteshami Bejnordi, Geert Litjens, and Jeroen van der Laak.

  • Thanks to kaggle for hosting this challenge

About

Based on Kaggle Challenge: Histopathologic Cancer Detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors