# Model analysis

This notebook allows you to analyse your classification model and more specificaly the features importance.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from helpers import *

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Using TensorFlow backend.


## Data loading

The loading is designed for csv file generated by the Collect Information option of the ImageJ plugin. The class of the nuclei is missing must be indicated (classes mask option).

[TODO] write the path of your measurements files, create your dataset and write the possible classes

In [2]:
mouse_path = "../../data/measurements/mouse features.csv"
human_path = "../../data/measurements/human features.csv"
human_tumor_path = "../../data/measurements/human tumor features.csv"

mouse = get_data(mouse_path, "Mouse")
human = get_data(human_path, "Human")
human_tumor = get_data(human_tumor_path, "Human")

dataset = pd.concat([mouse, human, human_tumor], ignore_index=True)

classes = ["Mouse", "Human"]

In [3]:
dataset[CLASS_COLUMN].value_counts()

Human    47529
Mouse    41511
Name: Class, dtype: int64

## Data processing
Shuffle, normalize and split the data between inputs and targets.

In [4]:
inputs, targets = process_prediction_data(dataset, classes, 'models/classification/normalization.json', True)

CSBDeep need 3D images with more than 1 element in each dimension, we create an image of size (2, 2, feature_size / 4). If the feature size is not a multiple of 4, it will be padded with 0. As we remove the reshape layer, we put back its size to feature_size (still a multiple of 4).

In [5]:
num_features = inputs.shape[1]
input_size = int(np.ceil(num_features/4) * 4)

inputs = resize_inputs(inputs).reshape((-1, input_size))

## Load model

[TODO] write the path of your model file

In [6]:
model = keras.models.load_model('models/classification/model.h5')

# remove first layer (reshape) as it created for CSBDeep and is not compatible with analysis
model_copy = get_model(input_size, len(classes), False)

for i, layer in enumerate(model.layers[1:]):
    model_copy.layers[i].set_weights(layer.get_weights())

optimizer = keras.optimizers.Adam()
loss_function = 'categorical_crossentropy'

model_copy.compile(optimizer=optimizer, loss=loss_function, metrics=['accuracy'])

model_copy.summary()

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 128)               62080     
_________________________________________________________________
batch_normalization_v1 (Batc (None, 128)               512       
_________________________________________________________________
re_lu (ReLU)                 (None, 128)               0         
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_______________________________________

## Analyse features importance

In [7]:
import eli5
from eli5.permutation_importance import get_score_importances

nb_observation = 500
inputs_obs = inputs[:nb_observation,:]
targets_obs = targets[:nb_observation,:]

def score(x, y):
    loss, accuracy = model_copy.evaluate(x, y, verbose=0)
    return accuracy

base_score, score_decreases = get_score_importances(score, inputs_obs, targets_obs, n_iter=10)
feature_importances = np.mean(score_decreases, axis=0)
feature_std = np.std(score_decreases, axis=0)

headers = dataset.columns.values

sort_index = np.argsort(feature_importances)[::-1]
print("Feature Importances:")
for i in range(input_size):
    j = sort_index[i]
    if(j < num_features):
        print(headers[j + 1] +":  "+ str(np.round(feature_importances[j],3)) + " +- " + str(np.round(feature_std[j],3)))



Feature Importances:
Mean cytoplasm variance value channel 1 up to 40 connected neighbours:  0.078 +- 0.011
Cytoplasm variance value channel 1:  0.025 +- 0.003
Mean cytoplasm sum of squares: variance up to 40 connected neighbours:  0.025 +- 0.005
Mean cytoplasm variance value channel 2 up to 40 connected neighbours:  0.023 +- 0.004
Cytoplasm mean value channel 1:  0.017 +- 0.005
Mean cytoplasm variance value channel 1 up to 20 connected neighbours:  0.017 +- 0.005
Mean cytoplasm variance value channel 3 up to 20 connected neighbours:  0.015 +- 0.005
Nucleus mean value channel 1:  0.014 +- 0.006
Mean cytoplasm difference variance up to 40 connected neighbours:  0.013 +- 0.003
Mean cytoplasm variance value channel 1 up to 5 connected neighbours:  0.013 +- 0.005
Mean cytoplasm variance value channel 1 up to 10 connected neighbours:  0.012 +- 0.005
Mean cytoplasm correlation up to 40 connected neighbours:  0.012 +- 0.004
Mean nucleus correlation up to 20 connected neighbours:  0.008 +- 0.0