# ConvNets and Image Data

Image data is one area where neural networks stand head and shoulder over other model types.

This is because of their ability to do complex operations directly on image data.

### Convolution

In a CNN, the input is a tensor with shape 

`(number of images) x (image height) x (image width) x (input channels)`. 

After passing through a convolutional layer, the image becomes abstracted to a feature map, with shape 

`(number of images) x (feature map height) x (feature map width) x (feature map channels)`

A convolutional layer within a neural network should have the following attributes:

- Convolutional filters/kernels defined by a width and height (hyper-parameters).
- The number of input channels and output channels (hyper-parameter).
- The depth of the convolution kernel/filter (the input channels) must equal the number channels (depth) of the input feature map.
- The hyperparameters of the convolution operation, like padding size and stride.

Convolutional layers convolve the input and pass its result to the next layer. 

Although fully connected feedforward neural networks can be used to learn features and classify data, this architecture is impractical for images. It would require a very high number of neurons, even in a shallow architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. 

For instance, a fully connected layer for a (small) image of size 100 x 100 has 10,000 weights for each neuron in the second layer. Instead, convolution reduces the number of free parameters, allowing the network to be deeper. For example, regardless of image size, tiling 5 x 5 region, each with the same shared weights, requires only 25 learnable parameters. Using regularized weights over fewer parameters avoids the vanishing gradient and exploding gradient problems seen during backpropagation in traditional neural networks.

This means that the network learns to optimize the filters or convolution kernels that in traditional algorithms are hand-engineered. This independence from prior knowledge and human intervention in feature extraction is a major advantage. 

### Pooling layers

Pooling layers reduce the dimensions of the data by combining the outputs of neuron clusters at one layer into a single neuron in the next layer. 

This is a way to reduce the image size -- leaving less computation to be done.

Local pooling combines small clusters, typically 2 x 2. Global pooling acts on all the neurons of the convolutional layer. 

**Max pooling** uses the maximum value of each cluster of neurons at the prior layer

**Average pooling** instead uses the average value.

In [1]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

num_classes = 10
### NOTE: The input is 28x28x1
### It's not a vector anymore, it's a matrix!
input_shape = (28, 28, 1)

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

model.summary()

x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 26, 26, 32)        320       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 32)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 11, 11, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dropout (Dropout)            (None, 1600)              0         
_________________________________________________

In [2]:
batch_size = 128
epochs = 15

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test loss: 0.02434978447854519
Test accuracy: 0.9908000230789185


# Pre-Trained Models

Because Neural Nets are used with image and text data, it's common to pick up pre-trained models in those two domains (as we've already done with word embedding models). Here we'll use [VGG16](https://arxiv.org/abs/1409.1556) whose architecture is straightforward:

![](vgg16.png)

It's trained on the large [ImageNet](http://image-net.org/) dataset, however, so it's a good starting point for images in general and can classify images by types (imagenet is a labeled image classification dataset):

In [4]:
# prepare an image
from tensorflow import keras
from tensorflow.keras.applications import VGG16
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.preprocessing.image import load_img
from tensorflow.keras.preprocessing.image import img_to_array
from tensorflow.keras.applications.vgg16 import preprocess_input
from tensorflow.keras.applications.vgg16 import decode_predictions
from tensorflow.keras.applications.vgg16 import VGG16

# load model without output layer
model = VGG16(include_top=False)

# load model and specify a new input shape for images
new_input = keras.Input(shape=(640, 480, 3))
model = VGG16(include_top=False, input_tensor=new_input)

#### LOAD IMAGES HERE
# example of using a pre-trained model as a classifier
# load an image from file
image = load_img('dog.jpg', target_size=(224, 224))
# convert the image pixels to a numpy array
image = img_to_array(image)
# reshape data for the model
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
# prepare the image for the VGG model
image = preprocess_input(image)
# load the model
model = VGG16()
# predict the probability across all output classes
yhat = model.predict(image)
# convert the probabilities to class labels
label = decode_predictions(yhat)
# retrieve the most likely result, e.g. highest probability
label = label[0][0]
# print the classification
print('%s (%.2f%%)' % (label[1], label[2]*100))

Doberman (35.42%)


# Fine Tuning a model

By keeping all layers except the last one in a pre-trained model, we can "fine-tune" it to our purposes.

![](finetune.jpg)

You can find an example of this [here](https://keras.io/examples/vision/image_classification_efficientnet_fine_tuning/)