# Neural Networks for Images


## Introduction to convnets

### The conolution operation

The fundamental difference between a densely connected layer and a convolution
layer is this: Dense layers learn global patterns in their input feature space (for example, for a MNIST digit, patterns involving all pixels), whereas convolution layers learn
local patterns (see figure 5.1): in the case of images, patterns found in small 2D windows of the inputs. In the previous example, these windows were all 3 × 3.

convnets two interesting properties:

- The patterns they learn are translation invariant. After learning a certain pattern in the lower-right corner of a picture, a convnet can recognize it anywhere: for example, in the upper-left corner. A densely connected network would have to learn the pattern anew if it appeared at a new location. This makes convnets data efficient when processing images (because the visual world is fundamentallytranslation invariant): they need fewer training samples to learn representations that have generalization power
- They can learn spatial hierarchies of patterns (see figure 5.2). A first convolution layer will learn small local patterns such as edges, a second convolution layer will learn larger patterns made of the features of the first layers, and so on. This allows convnets to efficiently learn increasingly complex and abstract visual concepts (because the visual world is fundamentally spatially hierarchical).

Cat Image example where nn would first find small pattern such as edges and shapes and then it would combine it to make better visuals.



Convolutions operate over 3D tensors, called feature maps, with two spatial axes (height
and width) as well as a depth axis (also called the channels axis).

The convolution operation extracts patches from its input feature
map and applies the same transformation to all of these patches, producing an output
feature map. This output feature map is still a 3D tensor: it has a width and a height. Its
depth can be arbitrary, because the output depth is a parameter of the layer, and the different channels in that depth axis no longer stand for specific colors as in RGB
input; rather, they stand for filters. Filters encode specific aspects of the input data: at a
high level, a single filter could encode the concept “presence of a face in the input,”
for instance.

In the MNIST example, the first convolution layer takes a feature map of size (28,
28, 1) and outputs a feature map of size (26, 26, 32): it computes 32 filters over its
input. Each of these 32 output channels contains a 26 × 26 grid of values, which is a
response map of the filter over the input, indicating the response of that filter pattern at
different locations in the input (see figure 5.3). That is what the term feature map
means: every dimension in the depth axis is a feature (or filter), and the 2D tensor
output[:, :, n] is the 2D spatial map of the response of this filter over the input.

Convolutions are defined by two key parameters:
- Size of the patches extracted from the inputs—These are typically 3 × 3 or 5 × 5. In the example, they were 3 × 3, which is a common choice.
- Depth of the output feature map—The number of filters computed by the convolution. The example started with a depth of 32 and ended with a depth of 64.

Conv2D(output_depth, (window_height, window_width)).
 A convolution works by sliding these windows of size 3 × 3 or 5 × 5 over the 3D input
feature map, stopping at every possible location, and extracting the 3D patch of surrounding features (shape (window_height, window_width, input_depth)). Each
such 3D patch is then transformed (via a tensor product with the same learned weight
matrix, called the convolution kernel) into a 1D vector of shape (output_depth,). All of
these vectors are then spatially reassembled into a 3D output map of shape (height,
width, output_depth). Every spatial location in the output feature map corresponds
to the same location in the input feature map (for example, the lower-right corner of
the output contains information about the lower-right corner of the input). For
instance, with 3 × 3 windows, the vector output[i, j, :] comes from the 3D patch
input[i-1:i+1, j-1:j+1, :].

![con](img/convwork.PNG)

Note that the output width and height may differ from the input width and height.
They may differ for two reasons:
- Border effects, which can be countered by padding the input feature map
- The use of strides, which I’ll define in a second

- UNDERSTANDING BORDER EFFECTS AND PADDING
- UNDERSTANDING CONVOLUTION STRIDES


##  The max-pooling operation

Max pooling consists of extracting windows from the input feature maps and outputting the max value of each channel. It’s conceptually similar to convolution, except
that instead of transforming local patches via a learned linear transformation (the convolution kernel), they’re transformed via a hardcoded max tensor operation. A big difference from convolution is that max pooling is usually done with 2 × 2 windows and stride 2, in order to downsample the feature maps by a factor of 2. On the other hand,
convolution is typically done with 3 × 3 windows and no stride (stride 1).

![](img/maxpool.PNG)

In short, the reason to use downsampling is to reduce the number of feature-map
coefficients to process, as well as to induce spatial-filter hierarchies by making successive convolution layers look at increasingly large windows (in terms of the fraction of
the original input they cover).

Other ways of downsampling:
- striding in convolution layer
- using average pool instead of max pooling

Max pooling tends to work better than these alternative solutions. In a nutshell, the reason is that features tend to encode the spatial presence of some pattern
or concept over the different tiles of the feature map (hence, the term feature map),
and it’s more informative to look at the maximal presence of different features than at
their average presence.



## convnet from scratch 

- The relevance of deep learning for small-data problems
- Dogs vs Cats Dataset 
    -  This dataset contains 25,000 images of dogs and cats (12,500 from each class) and is 543 MB (compressed). After downloading and uncompressing it, you’ll create a new dataset containing three subsets: a training set with 1,000 samples of each class, a validation set with 500 samples of each class, and a test set with 500 samples of each class.
- Building your network
    - the convnet will be a stack of alternated Conv2D (with relu activation) and MaxPooling2D layers
    ![](img/scratchconvnet.PNG)

- Data preprocessing
    1. Read the picture files.
    2. Decode the JPEG content to RGB grids of pixels.
    3. Convert these into floating-point tensors.
    4. Rescale the pixel values (between 0 and 255) to the [0, 1] interval (as you know, neural networks prefer to deal with small input values).

- using data augmentation
    - Data augmentation takes the approach of generating more training data from existing training samples, by augmenting the samples via a number of random transformations that yield believable-looking images. The goal is that at training time, your model will never see the exact same picture twice. This helps expose the model to more aspects of the data and generalize better.

![](img/dataaugkeras.PNG)

> If you train a new network using this data-augmentation configuration, the network
will never see the same input twice. But the inputs it sees are still heavily intercorrelated, because they come from a small number of original images—you can’t produce new information, you can only remix existing information. As such, this may not
be enough to completely get rid of overfitting. To further fight overfitting, you’ll also
add a Dropout layer to your model, right before the densely connected classifier.

## Using a pretrained convnet

