# Deep Learning with Python

# Example 5.3 -  VGG16

## Feature Extraction
- Each ConvNet-based model will consist of a series of convolution and pooling layers followed by a densely connected classifier.
- The conv and pooling layers form the model's convolutional base, which acts as the preprocessor for the densely connected classifier.
- It is possible to use the convolutional base of a pretrained model as the first part of a new model, and to use the outputs produced by this pretrained convolutional base to act as the input of a new densely connected classifier that is being trained from scratch.
- This is feature extraction: the process of using a pre-trained convolutional base to extract useful features from new samples which can then be used to train a new classifier from scratch.
- This is because if the convolutional base was trained on a sufficiently large dataset the spatial hierarchy of featurs it has learned can act as a model of the visual world.

## Why Reuse Conv Base and Not Classifier?
- Features learnt by the convolutional base are more likely to be generic and therefore reusable. 
- Feature maps of convnets are presence maps of generic concepts over a picture - these presence maps are likely to be useful regardless of what the picture is.
- Classifier's representations will be specific to the set of classes  on which the model was being trained. 
- Furthermore, information about an object's position or location in an image is lost in a densely connected classifier. 
- Level of generality depends on the depth of the layer in the model - the deeper the layer, the more specific the feature maps. So if we're using a ConvModel to predict a completely different set of classes than the ones in the data it was trained on, it is better to use the first few layers in the model.
- These layers will have learnt generic features that will still be applicable to the new problem.

In [1]:
# Importing the VGG16 model
from tensorflow.keras.applications import VGG16

In [None]:
# Instantiate the convolutional base using ImageNet weights
# Also specify the input shape for this base 
conv_base = VGG16(weights='imagenet', 
                 include_top=False,
                 input_shape=(150, 150, 3))

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
 1433600/58889256 [..............................] - ETA: 4:42

## Method 1: Feature Extraction w/o Data Augmentation
- Create a subset of the original training data that will be used for the model.
- Extract numpy arrays for each image and its labels in the dataset.
- Extract features from these images by calling the `predict` method of the `conv_base` model.
- This saves the features to a `numpy` array on disk.

In [2]:
import os
import numpy as np
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [4]:
# Path for the folder with our train/test/validation data from previous example
base_dir = '/Users/saads/OneDrive/Desktop/DL-Python/chapter-5/cats_and_dogs_small'

In [5]:
# Path for training st
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')

In [6]:
# Setting up a data generator to rescale all pixels in all images
datagen = ImageDataGenerator(rescale=1./255)

# Data generator will iterate over directory and output batches of 20 samples
batch_size = 20

In [8]:
def extract_features(directory, sample_count):
    # Output of convent layer will be a batch of 4 x 4 feature maps with 512 filters
    features = np.zeros(shape=(sample_count, 4, 4, 512))
    
    # One label for each sample
    labels = np.zeros(shape=(sample_count))
    
    # Initializing the generator to output batches of images from the specified directory
    generator = datagen.flow_from_directory(
        directory,                    # path to the directory that the gen will output images from
        target_size=(150, 150),       # size of each image will be 150 x 150px
        batch_size=batch_size,        # Images will be output in batches - 20 samples per batch
        class_mode='binary',          # Labels will belong to one of two classes for binary crossentropy
    )
    
    # Counter variable that will be used to keep track of batches output by data generator
    i = 0
    
    # For each batch of inputs and labels output by the generator
    for inputs_batch, labels_batch in generator:
        # Pass the inputs through the convolution base and record its output
        features_batch = conv_base.predict(inputs_batch)
        
        # Append the output featurs to a list
        features[i * batch_size : (i + 1) * batch_size] = features_batch
        
        # Do the same for labels
        labels[i * batch_size : (i + 1) * batch_size] = labels_batch
        
        # Increment batch counter
        i += 1
        
        # if total number of samples output by generator in batches so far
        # exceeds the total number of samples to be processed
        # Break the control structure - no more output from data generator required
        if i * batch_size >= sample_count:
            break
    
    # Return the list of features and labels extracted from the conv base
    return features, labels
    
    

In [None]:
# Get the features and labels output by the conv base for training data
train_features, train_labels = extract_features(train_dir, 2000)

# Get the same for the validation data
validation_features, validation_labels = extract_features(validation_dir, 1000)

# Same for test 
test_features, test_labels = extract_features(test_dir, 1000)

### Reshaping For Input to Classifier
The output of the convolutional base will be batches of three dimensional image tensors of shape `(4, 4, 512)` i.e. feature maps of 4 px by 4px containing of activations over 512 different filters.

Before this data can be input to a densely connected classifier, it must be reshaped into a 2D tensor of `(samples, 8192)` by flattening each output feature map into a vector. 

### Densely Connected Classifier

In [11]:
from tensorflow.keras import models, layers, optimizers

In [12]:
model = models.Sequential()

# input dim vs input shape? dim is dimension of each vector - not specifying sample axis
model.add(layers.Dense(256, activation='relu', input_dim=4 * 4 * 512))

# Half of the outputs of this layer will randomly be changed to 0
# This will minimise noise due to random patterns in the layer's activation outputs
model.add(layers.Dropout(0.5))

# Output layer will again have a single unit that will predict probability
# That the output belongs to one of two classes
model.add(layers.Dense(1, activation='sigmoid'))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [13]:
# Compile the model
model.compile(optimizer=optimizers.RMSprop(lr=2e-5), 
             loss='binary_crossentropy', 
             metrics=['acc'])

In [None]:
# Train the model 
history = model.fit(train_features,
                   train_labels,
                   epochs=30, 
                   batch_size=20,
                   validation_data=(validation_features, validation_labels))

Training this model will be very fast, because we aren't performing gradient descent on the convolutional base. Those weights are optimized and will not change. We are only optimizing the weights of the densely connected classifier. 

### Visualizing Loss and Accuracy

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Extracting data from History object
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(val_loss) + 1)

In [None]:
# Plotting the accuracy
plt.plot(epochs, acc, 'bo', label='Training Accuracy')
plt.plot(epochs, val_acc, 'b-', label='Validation Accuracy')
plt.legend(); plt.grid(True); 
plt.title('Training and Validation Accuracy - VGG16, No Augmentation');
plt.show()

In [None]:
# Plotting the loss
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b-', label='Validation Loss')
plt.title('Training and Validation Loss - VGG16, No Augmentation')
plt.legend(); plt.grid(True); 
plt.show()

### Book's Plots
- The plots show that the validation accuracy has increased from ~70% for the baseline model to ~90% for the model with VGG16 base, showing that using the model's convolutional base has indeed improved the model's accuracy.
- This means that the VGG16 model was indeed trained on a large enough dataset for the convnet to learn abstract, generic, spatial representations of data (through its weights) that were generic enough for us to be able to extract meaningful features for our specific problem.
- However, the plots also show that we are overfitting almost from the beginning
    - The training loss decreases exponentially from the beginning, and the validation loss reaaches its minimum value almost at 3 epochs.
    - This is why the validation accuracy peaks at around 90% by epoch and then degrades/does not exceed this in future epochs.
- This is because we are using very few samples for training and are not using data augmentation.
- We are not using data augmentation with this pretrained classifier approach because that would require computing augmented images for each epoch, feeding them into the convnet, computing the convnet's example, and then feeding the resulting data into the classifier. This would be an extremely memory-intensive and slower process. 

## Method 2: Feature Extraction with Data Augmentation

We add a densely connected classifier on top of the VGG16 convolutional base. Then, for each forward pass in each epoch, we compute a new augmented variant of the data - 30 epochs so 30 different augmentations/variants of each image - and pass this image data to the convnet.

The convent's weights are frozen, so they aren't optimised during gradient descent. 

In [15]:
from tensorflow.keras import layers, models

In [None]:
# Instantiate a model
model = models.Sequential()

# Add the convolutional base to the model
model.add(conv_base)

# Flatten the model's output for input into the densely connected classifier
model.add(layers.Flatten())

# Create a densely connected classifier
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [None]:
# Overview of the model's layers, output shapes, and parameters
model.summary()

The convolutional base has 14.7 million parameters, and the densely connected classifier we hae added on top has an additional 2M parameters. 

It would be prohibitively expensive to perform gradient descent on the entire 16M parameters. It would also defeat the purpose of this exericse - the whole point is to reuse the features learned by the convolutional base so that we can achieve high classification accuracy without the computational cost of optimising the parameter weights of the convolutional model. 

To do this, we `freeze` the convolutional base's parameters. This means they will not be optimised during gradient descent.

In [None]:
print('This is the number of trainable weights ', 
      'before freezing the conv base: ', len(model.trainable_weights))

# After the convolutional base of the model has been explicitly frozen
conv_base.trainable = False 

print("This is the number of trainable weights ", 
     "adter freezing the conv base: ", len(model.trainable_weights))

With the new setup, only the weights from the two densely connected layers that we added will be trained. That is a total of 4 weight tensors (Two per layer: the main weight matrix and the bias vector).

In [16]:
# Must compile the model after weight trainability has been modified

### Preparing the Data

In [17]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [19]:
from tensorflow.keras import optimizers

In [20]:
# Setting up a training data generator
# Now training data can be augmented
train_datagen = ImageDataGenerator(
    rescale=1./255,  
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
)

In [21]:
# Test data can only be scaled, not augmented
test_datagen = ImageDataGenerator(rescale=1./255)

In [22]:
# Specifying the flow of data for the training data generator
train_generator = train_datagen.flow_from_directory(
    train_dir, 
    target_size=(150, 150), 
    batch_size=20, 
    class_mode='binary',
)

Found 2000 images belonging to 2 classes.


In [23]:
# Same for the validation generator
validation_generator = test_datagen.flow_from_directory(
    validation_dir, 
    target_size=(150, 150), 
    batch_size=20, 
    class_mode='binary',
)

Found 1000 images belonging to 2 classes.


In [25]:
# Compile the model - this is important because we have modified weight trainability
model.compile(loss='binary_crossentropy', 
             optimizer=optimizers.RMSprop(lr=2e-5), 
             metrics=['acc'])

### Training the Model with Data Augmentation

In [None]:
history = model.fit_generator(
    train_generator, 
    steps_per_epoch=100, 
    epochs=30, 
    validation_data=validation_generator,
    validation_steps=50)

### Plotting the Model's Performance

In [26]:
# Extracting data from History object
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(val_loss) + 1)

NameError: name 'history' is not defined

In [None]:
# Plotting the accuracy
plt.plot(epochs, acc, 'bo', label='Training Accuracy')
plt.plot(epochs, val_acc, 'b-', label='Validation Accuracy')
plt.legend(); plt.grid(True); 
plt.title('Training and Validation Accuracy - VGG16 with Augmentation');
plt.show()

In [None]:
# Plotting the loss
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b-', label='Validation Loss')
plt.title('Training and Validation Loss - VGG16 with Augmentation')
plt.legend(); plt.grid(True); 
plt.show()

### Book Results
- The model's validation accuracy has increased from 90% to 95% or 96%. It also doesn't overfit as quickly as before.
- This shows that we have indeed managed to successfully decrease overfitting - at least to some extent.

In [None]:
model.save('cats_and_dogs_small_vgg16_augment.h5')

# Deep Learning with Python

# Section 5.4 - Fine Tuning

- Similar (but not identical) to feature extraction.
- Still involves reusing a pre-trained convolutional base.
- But the top layers of a frozen model base are unfrozen and trained with the densely connected classifier.
- Adjusts the more abstract representations of the model being reused by reoptimizing the weights of the top layers in the convolutional base.
- The layers near the beginning of the convolutional model learn very generic features such as edges, and successive layers form an increasingly complex spatial hierarchy: they will learn increasingly abstract features made up more generic, lower-level features.

### Steps in Fine-Tuning
1. Add custom network on top of an already-trained base network.
2. Freeze the base network.
3. Train the part of the network that you added.
4. Unfreeze some layers in the base network.
5. Jointly train both these layers and the part you added. 


### Training Sequence
- Must train the densely connected classifier on top of the convolutional base before the convolutional layers can be fine-tuned.
- This is because if the model on the top of the base has not already been optimized, the error signal propagating through the network during training will be too large. 
- This will cause representations previously learned by the layers beign fine-tuned to be destroyed.
- **We don't want to completely reinitialize the layers in the top** of the convolutional base; we just want to fine-tune them i.e. nudge their weights in the right direction ever so slightly.

### Why not fine-tune more layers?
- Earlier layers in the convolutional base encode more generic, reusable features whereas layers higher up encode more specialized features. It makes more sense to fine-tune the specialized features as they are the ones that need to be repurposed to a new problem. 
- The more layers we fine-tune, the more parameter we will have to train, and the more we will be at the risk of overfitting. 

In [27]:
# Check its shape
conv_base.summary()

NameError: name 'conv_base' is not defined

In [28]:
# Freezing all layers upto a specific one

# First configure the convolutional base to modify its weights
conv_base.trainable = True

# Flag that will be changed to true as soon as we enter block 5
set_trainable = False 

# Parse all the layers in the base 
for layer in conv_base.layers:
    # As soon as you encounter the first layer in block 5
    if layer.name == 'block5_conv1':
        set_trainable = True                 # flag state changes
    
    # If the flag is true, the current layer's `trainable` parameter is also true
    if set_trainable:
        layer.trainable = True
        
    # Otherwise, the layer cannot be modified
    else:
        layer.trainable = False

NameError: name 'conv_base' is not defined

Will be using `RMSprop` optimizer with a very low learning rate because we do not want the gradient descent updates to be too large - we want to limit the magnitudes of the modifications we make to the representations of the three layers we are fine tuning. Updates that are too large may harm these representations.

## Fine-tuning the Model

In [29]:
model.compile(loss='binary_crossentropy', 
             optimizer=optimizers.RMSprop(lr=1e-5), 
             metrics=['acc'])

In [None]:
history = model.fit_generator(
    train_generator, 
    steps_per_epoch=100,
    epochs=100,
    validation_data=validation_generator,
    validation_steps=50
)

### Smoothing Loss/Accuracy Curves

In [30]:
def smooth_curves(points, factor=0.8):
    smoothed_points = []
    for point in points:
        if smoothed_points:
            previous = smoothed_points[-1]
            smoothed_points.append(previous * factor + point * (1 - factor))
        else:
            smoothed_points.append(point)
    return smoothed_points

In [31]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Extracting data from History object
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(val_loss) + 1)

In [None]:
# Plotting the accuracy
plt.plot(epochs, acc, 'bo', label='Training Accuracy')
plt.plot(epochs, val_acc, 'b-', label='Validation Accuracy')
plt.legend(); plt.grid(True); 
plt.title('Training and Validation Accuracy - VGG16 with Augmentation');
plt.show()

In [None]:
# Plotting the loss
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training Loss')
plt.plot(epochs, val_loss, 'b-', label='Validation Loss')
plt.title('Training and Validation Loss - VGG16 with Augmentation')
plt.legend(); plt.grid(True); 
plt.show()

In [None]:
plt.plot(epochs, 
        smooth_curve(acc), 'bo', label='Smoothed Training Accuracy')
plt.plot(epochs, 
        smooth_curve(val_acc), 'b-', label='Smoothed Validation Accuracy')
plt.title('Training and Validation Accuracy - Fine Tuned Model')
plt.legend(); plt.grid(True); plt.xlabel('Epochs'); plt.ylabel('Accuracy')

Accuracy improves even though loss does not because while loss is displayed as a point-wise average, what matters for accuracy is the distribution of the loss values and not their average.

Accuracy is the result of binary thresholding of the class probability predicted by the model so it is the distribution of the loss values that matters, and not their average.

## Testing the Final Model

In [None]:
test_generator = test_datagen.flow_from_directory(
    test_dir, 
    target_size=(150, 150), 
    batch_size=20,
    class_mode'binary')

In [None]:
test_loss, test_acc = model.evaluate_Generator(test_generator, steps=50)

In [None]:
print('Test Accuracy: ', test_acc)