<a href="https://colab.research.google.com/github/Peter-Herrmann/cv-vgg/blob/main/VGG_Experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Classification with BHI Dataset and VGG-style network
In this experiment you will set up a VGG-style network to classify histopathologic scans of breast tissue from the [BHI](https://www.kaggle.com/paultimothymooney/breast-histopathology-images) dataset.

In [None]:
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, Dense, Flatten, MaxPooling2D
from tensorflow.keras.optimizers import SGD, Adam
from matplotlib import pyplot as plt
import numpy as np

Here we use a Keras utility function to load the dataset.  I already organized the data into HDF5 files which are a good format for storing array data.

In [None]:
from tensorflow.keras.utils import get_file
x_train_path = get_file('idc_train.h5','https://storage.googleapis.com/data401-datasets/idc_train.h5')
x_test_path = get_file('idc_test.h5','https://storage.googleapis.com/data401-datasets/idc_test.h5')

We read the data from the HDF5 files into Numpy arrays.

I crop the images so they are all 48x48.

In [None]:
import h5py as h5
with h5.File(x_train_path,'r') as f:
  x_train = f['X'][:,1:49,1:49]
  y_train = f['y'][:]
with h5.File(x_test_path,'r') as f:
  x_test = f['X'][:,1:49,1:49]
  y_test = f['y'][:]

In [None]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape

Showing a few images from the dataset.

In [None]:
for i in range(5):
  plt.imshow(np.squeeze(x_train[i]))
  plt.title(y_train[i])
  plt.show()

## Data preprocessing

1. Convert the train and test images to floating point and divide by 255.
2. Compute the average value of the entire training image set.
3. Subtract the average value from the training and testing images.


Build a VGG-style binary classifier model.  For example, your network could contain the following:
1. 32 convolutional filters of size 3x3, zero padding, ReLU activation
2. 2x2 max pooling with stride 2
3. 64 filters
4. max pool
5. 128 filters
5. max pool
6. 256 filters
7. max pool
8. flatten
9. Fully-connected layer with 128 outputs
10. Final binary classification layer

In [None]:
model = Sequential([
    Input(x_train.shape[1:]),
    #...
])
model.summary()

Set up the model to optimize the sparse categorical cross-entropy loss using Adam optimizer and learning rate of $.0003$.  Calculate accuracy metrics during training.

Now `fit` the model to the data using a batch size of 32 and 10% validation split over 10 epochs.

Plot loss and accuracy over the training run.

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['Training Loss','Validation Loss'])
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.legend(['Training Accuracy','Validation Accuracy'])
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.show()

Compute accuracy of the model on the training and testing sets.

Try a different setting to see if you can improve the test set accuracy at all.  Write about the results here.