In [None]:
from camelyonpatch import CamelyOnPatch
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import numpy as np


This image in full Resolution is too large to fit in memory, but it doesn't mean we can't still use what we've learned to start analyzing it

# Reminder our Dataset

Instead of looking at the whole image we can look at one small piece say 96x96 pixels worth. 
* Ask are any of the pixels in the middle 32x32 cancerous
    * If yes label it with a 1
    * If no label it with a zero
    * Why only the middle 32x32 pixels
        * Allows network to have some context to understanding the center pixels
        * Otherwise one pixel in the top corner being cancerous would label the whole image cancerous
* This turns one picture into 1000s of trainning examples

* **Important Detail** about the data split
    * Having a lot of training examples is great
        * But what we really want to know is can we use this algorithm to identify tumors it hasn't seen before
        * The dataset used above splits the data by randomly assigning each tumor to a dataset rather than randomly assigning each image to a dataset
            * Make sure the model gernalizes to new tumors and not just those seen in the training set





## Using Our Image Dataset

We've placed a lot of small images in folders, how do we feed this to keras?
   * We'll use a data generator

# Python Generator

python has a neat concept called a generator that we can use in our ML models
* A function that generates data each time it is called
* Often a while loop the loops forever 
* uses the **yield** keyword
* Each new element can be grabbed with the next keyword


In [None]:
def five_random_numbers():
    while True:
        yield( np.random.uniform(size=5))
my_generator=five_random_numbers()

for i in range(10):
    print(next(my_generator))
    
    

# Give it a try
We'll use a generator to scan our input slide, try writing one that returns batch_sizex96x96x3 random images

`hint np.random.normal(size=(batch_size,96,96,3))`

In [None]:
"""def my_generator("Your code from here"""                 

## Keras Image Data Generator

This is a built in generator like the one we used to scan our full slide, but it has the ability to change images on the fly, so no image you use in training is exactly the same.

```
keras.preprocessing.image.ImageDataGenerator(
    featurewise_center=False,
    samplewise_center=False, 
    featurewise_std_normalization=False,
    samplewise_std_normalization=False,
    zca_whitening=False, 
    zca_epsilon=1e-06, 
    rotation_range=0,
    width_shift_range=0.0, 
    height_shift_range=0.0, 
    brightness_range=None, 
    shear_range=0.0,
    zoom_range=0.0, 
    channel_shift_range=0.0, 
    fill_mode='nearest', 
    cval=0.0, 
    horizontal_flip=False,
    vertical_flip=False,
    rescale=None, 
    preprocessing_function=None, 
    data_format=None, 
    validation_split=0.0, 
    dtype=None)
```

## Input

There are several ways of using this, but we're going to use raw (96x96x3) images stored in folders.
Classes are identifided in sub-folders

In [None]:
data_gen = tf.keras.preprocessing.image.ImageDataGenerator(
            rescale=1./255.,
            width_shift_range=0,  # randomly shift images horizontally
            height_shift_range=0,  # randomly shift images vertically 
            horizontal_flip=True,  # randomly flip images
            vertical_flip=True,
            shear_range=1,
            zoom_range=.05,
            rotation_range=15                           
            )  # randomly flip images
train_generator=data_gen.flow_from_directory('/projects/bgmp/shared/2019_ML_workshop/datasets/pcamv1/images/train',
                                            target_size=(96,96), 
                                            color_mode='rgb', 
                                            classes=['normal','tumor'],
                                            class_mode='binary',
                                            batch_size=32,
                                            shuffle=True)

develop_gen = tf.keras.preprocessing.image.ImageDataGenerator(
            rescale=1./255.,
            width_shift_range=0,  # randomly shift images horizontally
            height_shift_range=0,  # randomly shift images vertically 
            horizontal_flip=False,  # randomly flip images
            vertical_flip=False,
            shear_range=0,
            zoom_range=.00,
            rotation_range=0                          
            )  # randomly flip images

develop_generator=develop_gen.flow_from_directory('/projects/bgmp/shared/2019_ML_workshop/datasets/pcamv1/images/develop',
                                            target_size=(96,96), 
                                            color_mode='rgb', 
                                            classes=['normal','tumor'],
                                            class_mode='binary',
                                            batch_size=32,
                                            shuffle=False)

test_generator=develop_gen.flow_from_directory('/projects/bgmp/shared/2019_ML_workshop/datasets/pcamv1/images/test',
                                            target_size=(96,96), 
                                            color_mode='rgb', 
                                            classes=['normal','tumor'],
                                            class_mode='binary',
                                            batch_size=32,
                                            shuffle=False)
     

In [None]:
cpd_image,cpd_labels=next(train_generator)

for image in cpd_image:
    plt.imshow(image)
    plt.show()



# This is just like the examples from before, but with slightly larger images

## Exercise

Try writing your own model, and training it 
* Start with an input layer
* Add Convolutional layers
    * Remember to add an activation
* Downsample with either pooling or striding
* Make a keras model, call it **model**
    * compile the model including accuracy as a metric 
 




In [None]:
"Create your model here"

In [None]:
"Add model training code here"

# Answer these Questions

* How many parameters were in your model?
* How long did an epoch take?
* What accuracy were you able to reach?





# My Simple CNN

* If you want to use your own model written above, go for it! (just skip the cell below)
* Otherwise use the network below

In [None]:
ans=input("Are you sure you want to use my model? Y/N")
if ans.lower() == 'y':
    cnn_input=tf.keras.layers.Input( shape=(96,96,3) ) # Shape here does not including the batch size 
    cnn_layer1=tf.keras.layers.Convolution2D(64, (2,2),strides=2,padding='same')(cnn_input) 
    cnn_activation=tf.keras.layers.LeakyReLU()(cnn_layer1) 
    cnn_activation=tf.keras.layers.Dropout(0.3)(cnn_activation) 

    cnn_layer2=tf.keras.layers.Convolution2D(126, (2,2),strides=2,padding='same')(cnn_activation) 
    cnn_activation=tf.keras.layers.LeakyReLU()(cnn_layer2) 
    cnn_activation=tf.keras.layers.Dropout(0.3)(cnn_activation) 

    cnn_layer3=tf.keras.layers.Convolution2D(256, (2,2),strides=2,padding='same')(cnn_activation) 
    cnn_activation=tf.keras.layers.LeakyReLU()(cnn_layer3) 
    cnn_activation=tf.keras.layers.Dropout(0.3)(cnn_activation) 


    flat=tf.keras.layers.Flatten()(cnn_activation) 

    drop=tf.keras.layers.Dropout(0.3)(flat) 

    dense_layer=tf.keras.layers.Dense(1)(drop) 
    output=tf.keras.layers.Activation('sigmoid')(dense_layer)

    model=tf.keras.models.Model([cnn_input],[output])
    model.summary()


    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    model.load_weights('/projects/bgmp/shared/2019_ML_workshop/datasets/pcamv1/cnn_1.h5')

    if False:
        history=model.fit(cpd.X_train, cpd.Y_train, 
              batch_size=100, epochs=10, verbose=1,
             validation_data=(cpd.X_develop,cpd.Y_develop),
                      shuffle='batch'
             )
        model.save_weights('/projects/bgmp/shared/2019_ML_workshop/datasets/pcamv1/cnn_1.h5')\



# Make a plot of our predictions

Let's check to see how well our model is working
 * We'll look at the same histogram we did in previous lectures



In [None]:
y_pred=np.squeeze(model.predict_generator(develop_generator))
y_develop=develop_generator.classes[0:len(y_pred)]
print("done")


In [None]:
develop_generator.classes.shape

In [None]:
f=plt.figure(figsize=(10,10))
tumor=[pred for pred,truth in zip(y_pred,y_develop) if truth  ]
fine=[pred for pred,truth in zip(y_pred,y_develop) if not truth  ]

plt.hist(tumor,range=(0,1),bins=200,density=True,histtype='step',label='Tumor')  
plt.hist(fine,range=(0,1),bins=200,density=True,histtype='step',label='Normal')
plt.xlabel('Neural Network Response')
plt.ylabel('Number of Images')
plt.legend()
plt.show()


# Hopefully there is a clear separation between tumor and normal

## Is it good enough?

In order to answer this question it's important to decide how we want to use it

**Goal**: Assist in diagnosis by highlighting potential tumors.

Let's see if it's good enough by giving it a try









# Using our model

We will use open-slide to grab 96x96x3 pixel chunks and feed them to our model

Our training Data was split into 96x96 pixel images at level-2
 * Let's check to make sure that's not crazy

In [None]:
import openslide
tiff_file='/projects/bgmp/shared/2019_ML_workshop/datasets/pcamv1/tumor_001.tif'
slide_image=openslide.OpenSlide(tiff_file)
# This is the resolution at level 2
initial_dimension=slide_image.level_dimensions[2]

# This is the resolution after we've turned each 32x32 pixel box into a predictions
final_dimension= np.array(initial_dimension)/32.

print("prescan dimension",initial_dimension,"final scan",final_dimension)
n_predictions=np.product(final_dimension)
print(n_predictions/1e6," Million Predictions Required")


# This is a lot of predictions

It's possible to do the whole slide, but it takes awhile (30 minutes or so...) let's try a smaller section

In [None]:


#These coordinates are at level 0
width=6400
height=6400

x_start=60000
y_start=120000




print("Predictions Required",width//32*height//32)


#Covert these for later use to level2
level0_dimension=slide_image.level_dimensions[0]
level2_dimension=slide_image.level_dimensions[2]

sfactor_x=level0_dimension[0]/level2_dimension[0]
sfactor_y=level0_dimension[1]/level2_dimension[1]

x_stop=x_start+width*4
y_stop=y_start+height*4


print('Scaling factor to level 2',sfactor_x,sfactor_y)

# It's just a factor of 4
f=plt.figure(figsize=(50,50))
plt.imshow(slide_image.read_region( (x_start,y_start),2,(height,width)))
plt.show()


### Not bad - let's scan this image
How can we feed this to our model?


# A Generator to load our data

In [None]:

def scan_image(image_file,batch_size,x_range,y_range):
    slide_image=openslide.OpenSlide(image_file)
    res_x,res_y=slide_image.level_dimensions[2]
    
    coord_x,coord_y=slide_image.level_dimensions[0] 
    #This is factor we need to scale the pixels at resolution 2 to the coordinates in resolution 0

    sfactor_x=coord_x/res_x  
    sfactor_y=coord_y/res_y 
    
    batch=[]
    index=0
    for x in range(x_range[0]//4,x_range[1]//4,32):
        for y in range(y_range[0]//4,y_range[1]//4,32):
            image=np.asarray(slide_image.read_region( (int(x*sfactor_x),int(y*sfactor_y)),2,(96,96)  ))/255.
            batch.append(np.expand_dims(image[:,:,0:3],0))                                
            if len(batch)==batch_size:
                yield(np.concatenate(batch,0))
                batch=[]

In [None]:
batch_size=10
x_range=[x_start,x_stop]
y_range=[y_start,y_stop]

n_predictions=width//32*height//32
steps=n_predictions//batch_size

generator=scan_image('/projects/bgmp/shared/2019_ML_workshop/datasets/pcamv1/tumor_001.tif',batch_size,x_range,y_range)
steps=n_predictions//batch_size

output_scan=model.predict_generator(generator,steps)
output_map=np.zeros((height//32,width//32))
print(output_scan)
for index,v in enumerate(output_scan):
    
    y=index%(width//32)
    x=index//(width//32)
    output_map[y,x]=v
    
plt.imshow(output_map)
plt.show()

## These are probabilites
Let's check which are greater than 50%



In [None]:
f=plt.figure(figsize=(10,10))
plt.imshow(output_map > 0.5)
plt.show()

## Practice Manipulating Data

1. Use Open Slide to find the coordinates of a different region of cells
2. Use the coordinates, The generator code above, and our model to predict a new region
3. What diagnosis would you make (or would you?)

# Hmmm
The full slide looks like this (as you can guess from the file name it has a tumor)
<img src='../assets/full_slide_scan_1.png'>

But wait a slide without a tumor looks like this

<img src='../assets/full_slide_scan_normal.png'>


The results above don't look great, lots of likely Tumors even in places without cells!

* Our training data didn't include things like slide edges and slide area's without tissues
    * The results on these can be fairly random
* A bigger issue is a problem with statistics

* Remember we are approximating    
    $P(y|x)$

* Which using Bayes’ theorem

    $P(y|x)=\frac{P(x|y)P(y)}{P(x)}$

* What is P(y) in our dataset?
    * This the probability an image contained at least one tumor pixel
    * Our dataset was artifically built so $P(y=tumor)=1/2$ and $P(y=healthy)=1/2$
        * This makes training easier
        * This is called class Balancing
    * However, looking at cells in the wild it is vastly more likely that they are non-cancerous
* How do we use these predictions?
    * We know $P(y)$ is quite a bit smaller in the real world
    * We $P(y|x)$ is porportional to $P(y)$
        * The real world $P(y|x)$ is smaller than the one measured in our experiment
    * An easy way to reduce false positives is to increase the threshold we used from 50% to something larger
        * How high is a judgment cell
        * Too high could miss tumors
        * Too low and there are fake tumors everywhere
    

    



In [None]:
for threshhold in [0.5,0.7,0.8,0.9,0.95,0.99]:    
    print('Threshhold',threshhold)
    f=plt.figure(figsize=(5,12))
    plt.imshow(output_map > threshhold)
    plt.show()