# Histology Tissue Classification Project (HTCP)



(C) [K. Mader](https://www.linkedin.com/in/kevinmader/?originalSubdomain=ch) / [U. Michelucci 2018-2019](https://www.linkedin.com/in/umbertomichelucci/?originalSubdomain=ch)

*Teaching Assistant:* [Khaled Mohamad](https://www.linkedin.com/in/khaled-mohamad-45071a24b/).

# Overview

The dataset serves as a much more interesting MNIST or CIFAR10 problem for biologists by focusing on histology tiles from patients with colorectal cancer. In particular, the data has 8 different classes of tissue (but Cancer/Not Cancer can also be an interesting problem).

The dataset has been adapted for the course by K. Mader (kevin.mader@gmail.com), and is available on kaggle: https://goo.gl/26zj41

# Challenge

* Classify tiles correctly into one of the eight classes
* Which classes are most frequently confused?
* What features can be used (like texture, see scikit-image) to improve classification?
* How can these models be applied to the much larger 5000x5000 models? How can this be done efficiently

# Acknowledgements


The dataset has been copied from Zenodo: https://zenodo.org/record/53169#.W6HwwP4zbOQ

made by: Kather, Jakob Nikolas; Zöllner, Frank Gerrit; Bianconi, Francesco; Melchers, Susanne M; Schad, Lothar R; Gaiser, Timo; Marx, Alexander; Weis, Cleo-Aron

The copy here is to make it more accessible to Kaggle users and allow kernels providing basic analysis of the data

Content This data set represents a collection of textures in histological images of human colorectal cancer. It contains two files:

>     Kather_texture_2016_image_tiles_5000.zip": a zipped folder containing 5000 
>     histological images of 150 * 150 px each (74 * 74 µm). Each image belongs 
>     to exactly one of eight tissue categories (specified by the folder name). 

>     Kather_texture_2016_larger_images_10.zip": a zipped folder containing 10 
>     larger histological images of 5000 x 5000 px each. These images contain 
>     more than one tissue type. Image format


All images are RGB, 0.495 µm per pixel, digitized with an Aperio ScanScope 
(Aperio/Leica biosystems), magnification 20x. Histological samples are fully 
anonymized images of formalin-fixed paraffin-embedded human colorectal 
adenocarcinomas (primary tumors) from our pathology archive (Institute of Pathology, 
University Medical Center Mannheim, Heidelberg University, Mannheim, Germany).

Additionally the files has been prepared to resemble the MNIST dataset, meaning that you will also find the following files

- HTCP_8_8_L - 
- HTCP_8_8_RGB -
- HTCP_28_28_L -
- HTCP_28_28_RGB - 
- HTCP_64_64_L

# Ethics statement
All experiments were approved by the institutional ethics board (medical ethics board II, University Medical Center Mannheim, Heidelberg University, Germany; approval 2015-868R-MA). The institutional ethics board waived the need for informed consent for this retrospective analysis of anonymized samples. All experiments were carried out in accordance with the approved guidelines and with the Declaration of Helsinki.

# More information / data usage
For more information, please refer to the following article. Please cite this article when using the data set.

Kather JN, Weis CA, Bianconi F, Melchers SM, Schad LR, Gaiser T, Marx A, Zollner F: Multi-class texture analysis in colorectal cancer histology (2016), Scientific Reports (in press)

# Contact
For questions, please contact: Dr. Jakob Nikolas Kather http://orcid.org/0000-0002-3730-5348 ResearcherID: D-4279-2015


# Download the data

The dataset is composed of two datasets:

- The small images that will be used to test the classification models
- The big microscope images (5000x5000)

The first dataset is quite small and can be found in the same github repository where you find this file. The second are much bigger (250 Mb and 700 Mb) and cannot be uploaded on github, so you can get them on kaggle: https://goo.gl/hkRSke

# Ideas for the project

The project can be tackled in several ways and at several levels. Here are some ideas for you to tackle at different difficulty levels.

A few general hints:

- Accuracy is a nice metric, but in this case the confusion matrix is more useful. Check which metric is the most ideal for this problem (you could use others)
- If detecting TUMOR proces too hard, try to detect other tissue types. For example ADIPOSE. Some are much easier to detect than others. 
- __REMEMBER__: detecting __ONE__ type of tissue does not necessarly mean being able to detec __ALL__ type of tissues well ;-)
- __REMEMBER__: getting a high accuracy is __NOT__ the goal of the project. The goal is to put you in a real-life situation where you have to be creative to solve a relevant problem. Is not easy and there are not easy ways of solving it.

__OVER ALL REMEMBER: HAVE FUN!__


## Medium
- Use the gray level 28x28 images and consider all the classes. Try to build a classifier using a neural network with several layers. After a first test, with hyperparameter tuning try to find the best model for the problem. Consider the following hyperparameters
    - learning rate
    - number of layers / Number of neurons in each layer
    - mini-batch size
    - number of epochs
    - activation function (maybe swish helps?)
- After having tried what just described in the previous point, try to see if you get better results with the 64x64 gray images. And try with the 8x8 to see if they are usable to get better results (they are not, but try).






##__STARTING WITH__:
## 2. Medium!

# Helper Functions (Python)

In [None]:
# A function for plotting images

def plot_image(some_image):
    
    some_digit_image = some_image.values.reshape(28,28)

    plt.imshow(some_digit_image, cmap = matplotlib.cm.binary, interpolation = "nearest")
    plt.axis("off")
    plt.show()
    

In [None]:
# A function to get Label names(are eight)

def get_label_name(idx):
    
    if (idx == 1):
        return '(1) TUMOR'
    elif (idx == 2):
        return '(2) STROMA'
    elif (idx == 3):
        return '(3) COMPLEX'
    elif (idx == 4):
        return '(4) LYMPHO'
    elif (idx == 5):
        return '(5) DEBRIS'
    elif (idx == 6):
        return '(6) MUCOSA'
    elif (idx == 7):
        return '(7) ADIPOSE'
    elif (idx == 8):
        return '(8) EMPTY'

# Load & Importing Libraries

In [None]:
# Load python libraries

%matplotlib inline
from glob import glob
import os
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
from random import randint

# Read images from files and plot
from skimage.io import imread 
import seaborn as sns

# Tensorflow & Keras is imported for building and training models
import tensorflow as tf

# Keras 
from tensorflow.keras.models import Sequential # for building the  layers
from tensorflow.keras.optimizers import SGD    # Optimizer 
from tensorflow.keras.layers import Dense      # Connected network

from tensorflow.keras import layers
import tensorflow.keras as keras
from sklearn.metrics import confusion_matrix, accuracy_score   # Measure performance of your classifier and accuracy
import time

# Checking your TensorFlow Version
##### __Uncomment__ the cell below and run it

In [None]:

# ver_1= '2.10.0'

# def tf_version(tf):
#   tf_v = tf.__version__
#   if tf_v >= ver_1:
#     print(f" Your veriosn of TensorFlow is a:{tf_v}  satisfied!")
#   else:
#     print("Your new version of TensroFlow updating ....")
#     !pip3 install --upgrade tensorflow

# tf_version(tf)


# Load the Data
#### __Replace__ your current directory path with '/content/data
#### __Uncomment__ the last two lines of code below USING ( Ctrl + / ) and run your cell

In [None]:
# Know Image Dimensions
def know_image_dim(in_shape):

    side_len = int(np.sqrt(in_shape))
    abs_value = np.abs(in_shape-side_len*side_len)<2
    negative_value = side_len = int(np.sqrt(in_shape/3))

    if abs_value:
        return (int(side_len), int(side_len))
    else:
        negative_value
        return (side_len, side_len, 3)
        
# csv_dir = os.path.join('.', '/content/data')
# print(f"My current working directory is: {csv_dir} ")

# Print your Vector size for evey dataset
#### __Uncomment__ code below and run your cell

In [None]:
# Return all file names from current directory and sort in ascending orders(shaps)
all_files = sorted(glob(os.path.join(csv_dir, 'HTCP*.csv')), 
                   key=lambda x: os.stat(x).st_size)



# all_df_dict = {os.path.splitext(os.path.basename(x))[0]: pd.read_csv(x) for x in all_files}
# print("VECTOR SIZE FOR EVERY DATASET:\n")
# for c_key in all_df_dict.keys():
#     print(c_key, 'vector length:',  
#           all_df_dict[c_key].shape[1], '->', 
#             know_image_dim(all_df_dict[c_key].shape[1]))

# Print the folder and data paths
#### __Uncomment__ all_files and run

In [None]:
# all_files

In [None]:
# Read csv file from list(all_files)
data = pd.read_csv(all_files[2])

Let's create an array with labels (not yet one-encoded) and one for the images.

# Get the labels from the data
#### __Uncomment__  code below and run your cell

In [None]:

# labels = data['label']
# data = data.drop(['label'], axis = 1)


# Baseline models with 28x28 gray images


### Import python libraries
##### __Uncomment__ code below and run your cell


In [None]:
# from sklearn.model_selection import train_test_split
# sample_id_count = list(all_df_dict.values())[0].shape[0]
# train_ids, test_ids = train_test_split(range(sample_id_count), 
#                                        test_size=0.25, 
#                                        random_state=2018)

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

def evaluate_models(in_model_maker):
    fig, m_axs = plt.subplots(1, 5, figsize = (25, 5))
    for c_ax, c_key in zip(m_axs, all_df_dict.keys()):
        # c_key is for example HTCP_8_8_L (the file/type name)
        c_df = all_df_dict[c_key].copy()
        c_label = c_df.pop('label') # return column and drop from dataframe
        c_model = in_model_maker() # function of the model
        c_model.fit(c_df.iloc[train_ids, :], c_label.iloc[train_ids]) # fit of the model
        c_pred = c_model.predict(c_df.iloc[test_ids, :]) # prediction
        sns.heatmap(confusion_matrix(c_label.iloc[test_ids], c_pred), 
                    annot=True, cbar=False, fmt='d', ax=c_ax)
        c_ax.set_title(f'Accuracy: {accuracy_score(c_label[test_ids],c_pred)*100:2.2f}%\n{c_key}')

# Example of a network with 4 layers, each with 10 neurons

In [None]:
class NBatchLogger(keras.callbacks.Callback):
    """
    A Logger that log average performance per `display` steps.
    """
    def __init__(self, display):
        self.step = 0
        self.display = display
        self.metric_cache = {}

    def on_batch_end(self, batch, logs=None):
        self.step += 1
        for k in self.model.metrics:
            if k.name not in self.metric_cache.keys():
                self.metric_cache[k.name] = 0.0
            self.metric_cache[k.name] += logs.get(k.name)
        if self.step % self.display == 0:
            metrics_log = ''
            for (k, v) in self.metric_cache.items():
                val = v / self.display
                if abs(val) > 1e-3:
                    metrics_log += ' - %s: %.4f' % (k, val)
                else:
                    metrics_log += ' - %s: %.4e' % (k, val)
            print('step: {}/{} ... {}'.format(self.step,
                                          self.params['steps'],
                                          metrics_log))
            self.metric_cache.clear()

# Make a copy of Data and Labels.
#### __Uncomment__ code below and run your cell

In [None]:

# c_df = all_df_dict['HTCP_28_28_L'].copy()
# c_label = c_df.pop('label')


### Split your data inot train and test set
#### __Uncomment__ code below and run your cell

In [None]:

# X_train = c_df.iloc[train_ids, :]
# y_train = c_label.iloc[train_ids]-1

In [None]:
# X_test = c_df.iloc[test_ids, :]
# ytest = c_label.iloc[test_ids]-1

# Print shape of your dataset
#### __Uncomment__ code below and run your cell

In [None]:
# X_train.shape

In [None]:
# convert class vectors to binary class matrices One Hot Encoding
num_of_classes = 8
y_train = keras.utils.to_categorical(y_train, num_of_classes)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, y_train, 
                                                  test_size = 0.1, random_state=42)

X_train = X_train / 255.0 / 1.0
X_val = X_val / 255.0 / 1.0

# Building Neural Network Phases
#### __Your task__ change activation function to relu:    (activation='relu')
######1. __Uncomment__ the 3 lines of the code below first!
######2. __Change__ your optimizers to Adam : optimizer=tf.optimizers.Adam(0.01)

In [None]:
n = 15
model = tf.keras.Sequential()
model.add(layers.Dense(n, input_dim=X_train.shape[1], activation='relu'))
# model.add(layers.Dense(n, activation='relu'))
model.add(layers.Dropout(0.40))
# model.add(layers.Dense(n, activation='relu'))
model.add(layers.Dropout(0.40))
# model.add(layers.Dense(n, activation='relu'))
model.add(layers.Dropout(0.40))
model.add(layers.Dense(num_of_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              # optimizer=tf.optimizers,
              metrics=['accuracy'])
    


#  Run Model Summary


In [None]:
# model.summary()

# Train dataset with 1000 iterations
##### __Uncomment__ the code below and run

In [None]:
out_batch = NBatchLogger(display=1000)
model.fit(X_train, Y_train, epochs=1000, batch_size=250,verbose = 0,
             callbacks=[out_batch])



# Measure your final loss and accuracy model
##### __Uncomment__ the code below and run the cell

In [None]:
final_loss, final_acc = model.evaluate(X_val, Y_val, verbose=1)
pred = np.argmax(model.predict(X_val), axis=-1)
pred_train = np.argmax(model.predict(X_train), axis=-1)
print("Final loss: {0:.6f}, final accuracy: {1:.6f}".format(final_loss, final_acc))

# Visualize your Confusion Matrix for trained dataset
##### __Uncomment__ the code below and run the cell

In [None]:
# sfig = plt.figure(figsize=(7,7))
# ax = fig.add_subplot(1, 1, 1)
# sns.heatmap(confusion_matrix(np.argmax(Y_train, axis = 1), pred_train), 
#                     annot=True, cbar=False, fmt='d', ax=ax)

# Congratulation ! end of this exercise(2), We hope you enjoyed! 
### __You have learned __

1. Import the dataset
2. Find your working directory path
3. Split the data to training set (x) data, and test set labels(y)
4. Build Neural Network phases
5. Fit your model (training data)
6. Print loss and accurcy of your model
7. Visualize your Confusion Matrix


