## A first look at a neural network

In [1]:
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

The core building block of neural networks is the layer, a data-processing module that
you can think of as a filter for data. Some data goes in, and it comes out in a more useful form. Specifically, layers extract representations out of the data fed into them—hopefully, representations that are more meaningful for the problem at hand. Most of
deep learning consists of chaining together simple layers that will implement a form
of progressive data distillation.

In [2]:
from keras import models
from keras import layers
network = models.Sequential()
network.add(layers.Dense(512, activation='relu', input_shape=(28 * 28,)))
network.add(layers.Dense(10, activation='softmax'))

To make the network ready for training, we need to pick three more things, as part
of the compilation step:
* A loss function—How the network will be able to measure its performance on
the training data, and thus how it will be able to steer itself in the right direction.
* An optimizer—The mechanism through which the network will update itself
based on the data it sees and its loss function.
* Metrics to monitor during training and testing—Here, we’ll only care about accuracy (the fraction of the images that were correctly classified).

In [3]:
network.compile(optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy'])

Before training, we’ll preprocess the data by reshaping it into the shape the network
expects and scaling it so that all values are in the [0, 1] interval. Previously, our training images, for instance, were stored in an array of shape (60000, 28, 28) of type
uint8 with values in the [0, 255] interval. We transform it into a float32 array of
shape (60000, 28 * 28) with values between 0 and 1.

In [4]:
train_images = train_images.reshape((60000, 28 * 28))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28 * 28))
test_images = test_images.astype('float32') / 255

We also need to categorically encode the labels:

In [6]:
from tensorflow.keras.utils import to_categorical
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)

We’re now ready to train the network, which in Keras is done via a call to the network’s `fit` method:

In [7]:
network.fit(train_images, train_labels, epochs=5, batch_size=128)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x23f1924e640>

In [8]:
test_loss, test_acc = network.evaluate(test_images, test_labels)



## Data representations for neural networks

In the previous example, we started from data stored in multidimensional Numpy
arrays, also called tensors. In general, all current machine-learning systems use tensors
as their basic data structure.

### Scalars (0D tensors)

A tensor that contains only one number is called a scalar

In [10]:
import numpy as np
x = np.array(12)

In [11]:
x.ndim

0

### Vectors (1D tensors)

An array of numbers is called a vector, or 1D tensor. A 1D tensor is said to have exactly
one axis. 

In [12]:
x = np.array([12, 3, 6, 14])

In [13]:
x.ndim

1

This vector has five entries and so is called a 5-dimensional vector. Don’t confuse a 5D
vector with a 5D tensor! A 5D vector has only one axis and has five dimensions along its
axis, whereas a 5D tensor has five axes (and may have any number of dimensions
along each axis). Dimensionality can denote either the number of entries along a specific axis (as in the case of our 5D vector) or the number of axes in a tensor (such as a
5D tensor), which can be confusing at times. In the latter case, it’s technically more
correct to talk about a tensor of rank 5 (the rank of a tensor being the number of axes).

### Matrices (2D tensors)


An array of vectors is a matrix, or 2D tensor. A matrix has two axes (often referred to
rows and columns). You can visually interpret a matrix as a rectangular grid of numbers.

In [14]:
x = np.array([[5, 78, 2, 34, 0],
    [6, 79, 3, 35, 1],
    [7, 80, 4, 36, 2]])
x.ndim

2

### 3D tensors and higher-dimensional tensors

If you pack such matrices in a new array, you obtain a 3D tensor, which you can visually
interpret as a cube of numbers.

In [15]:
x = np.array([[[5, 78, 2, 34, 0],
    [6, 79, 3, 35, 1],
    [7, 80, 4, 36, 2]],
    [[5, 78, 2, 34, 0],
    [6, 79, 3, 35, 1],
    [7, 80, 4, 36, 2]],
    [[5, 78, 2, 34, 0],
    [6, 79, 3, 35, 1],
    [7, 80, 4, 36, 2]]])

In [16]:
x.ndim

3

### Key attributes

A tensor is defined by three key attributes:
* **Number of axes (rank)**—For instance, a 3D tensor has three axes, and a matrix has
two axes. This is also called the tensor’s ndim in Python libraries such as Numpy.
* **Shape**—This is a tuple of integers that describes how many dimensions the tensor has along each axis. For instance, the previous matrix example has shape
(3, 5), and the 3D tensor example has shape (3, 3, 5). A vector has a shape
with a single element, such as (5,), whereas a scalar has an empty shape, ().
* **Data type (usually called dtype in Python libraries)**—This is the type of the data
contained in the tensor; for instance, a tensor’s type could be float32, uint8,
float64, and so on. On rare occasions, you may see a char tensor. Note that
string tensors don’t exist in Numpy (or in most other libraries), because tensors
live in preallocated, contiguous memory segments: and strings, being variable
length, would preclude the use of this implementation.

In [18]:
from keras.datasets import mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images.ndim

3

In [19]:
train_images.shape

(60000, 28, 28)

In [20]:
train_images.dtype

dtype('uint8')

### Manipulating tensors in Numpy

Selecting specific elements in a tensor is called tensor slicing.
Let’s look at the tensor-slicing operations you can do on Numpy arrays.
The following example selects digits #10 to #100 (#100 isn’t included) and puts
them in an array of shape (90, 28, 28):

In [21]:
my_slice = train_images[10:100]
my_slice.shape

(90, 28, 28)

It’s equivalent to this more detailed notation, which specifies a start index and stop
index for the slice along each tensor axis. Note that : is equivalent to selecting the
entire axis:

In [23]:
my_slice = train_images[10:100, :, :]
my_slice.shape

(90, 28, 28)

In [24]:
my_slice = train_images[10:100, 0:28, 0:28]
my_slice.shape

(90, 28, 28)

In general, you may select between any two indices along each tensor axis. For
instance, in order to select 14 × 14 pixels in the bottom-right corner of all images, you
do this:

In [26]:
my_slice= train_images[:, 14:, 14:]
my_slice = train_images[:, 7:-7, 7:-7]
my_slice.shape

(60000, 14, 14)

### The notion of data batches

In general, the first axis (axis 0, because indexing starts at 0) in all data tensors you’ll
come across in deep learning will be the samples axis (sometimes called the samples
dimension). In the MNIST example, samples are images of digits.

In addition, deep-learning models don’t process an entire dataset at once; rather,
they break the data into small batches. When considering such a batch tensor, the first axis (axis 0) is called the batch axis or
batch dimension.

## Real-world examples of data tensors

The data you’ll manipulate will almost always fall into one of the following categories:
* **Vector data**—2D tensors of shape *(samples, features)*
* **Timeseries data or sequence data**—3D tensors of shape *(samples, timesteps,
features)*
* **Images**—4D tensors of shape *(samples, height, width, channels)* or *(samples,
channels, height, width)*
* **Video**—5D tensors of shape *(samples, frames, height, width, channels)* or
*(samples, frames, channels, height, width)*

### Vector data

This is the most common case. In such a dataset, each single data point can be encoded
as a vector, and thus a batch of data will be encoded as a 2D tensor (that is, an array of
vectors), where the first axis is the samples axis and the second axis is the features axis.
Let’s take a look at two examples:
* An actuarial dataset of people, where we consider each person’s age, ZIP code,
and income. Each person can be characterized as a vector of 3 values, and thus
an entire dataset of 100,000 people can be stored in a 2D tensor of shape
(100000, 3).
* A dataset of text documents, where we represent each document by the counts
of how many times each word appears in it (out of a dictionary of 20,000 common words). Each document can be encoded as a vector of 20,000 values (one
count per word in the dictionary), and thus an entire dataset of 500 documents
can be stored in a tensor of shape (500, 20000).

### Timeseries data or sequence data

Whenever time matters in your data (or the notion of sequence order), it makes sense
to store it in a 3D tensor with an explicit time axis. Each sample can be encoded as a
sequence of vectors (a 2D tensor), and thus a batch of data will be encoded as a 3D
tensor.

The time axis is always the second axis (axis of index 1), by convention. Let’s look at a
few examples:
* A dataset of stock prices. Every minute, we store the current price of the stock,
the highest price in the past minute, and the lowest price in the past minute.
Thus every minute is encoded as a 3D vector, an entire day of trading is
encoded as a 2D tensor of shape (390, 3) (there are 390 minutes in a trading
day), and 250 days’ worth of data can be stored in a 3D tensor of shape (250,
390, 3). Here, each sample would be one day’s worth of data.
* A dataset of tweets, where we encode each tweet as a sequence of 280 characters
out of an alphabet of 128 unique characters. In this setting, each character can
be encoded as a binary vector of size 128 (an all-zeros vector except for a 1 entry
at the index corresponding to the character). Then each tweet can be encoded
as a 2D tensor of shape (280, 128), and a dataset of 1 million tweets can be
stored in a tensor of shape (1000000, 280, 128).

### Image data

Images typically have three dimensions: height, width, and color depth. Although
grayscale images (like our MNIST digits) have only a single color channel and could
thus be stored in 2D tensors, by convention image tensors are always 3D, with a onedimensional color channel for grayscale images. A batch of 128 grayscale images of
size 256 × 256 could thus be stored in a tensor of shape (128, 256, 256, 1), and a
batch of 128 color images could be stored in a tensor of shape (128, 256, 256, 3).

There are two conventions for shapes of images tensors: the channels-last convention
(used by TensorFlow) and the channels-first convention (used by Theano). The TensorFlow machine-learning framework, from Google, places the color-depth axis at the
end: (samples, height, width, color_depth). Meanwhile, Theano places the color
depth axis right after the batch axis: (samples, color_depth, height, width). 

### Video data

Video data is one of the few types of real-world data for which you’ll need 5D tensors.
A video can be understood as a sequence of frames, each frame being a color image.
Because each frame can be stored in a 3D tensor *(height, width, color_depth)*, a
sequence of frames can be stored in a 4D tensor *(frames, height, width, color_
depth)*, and thus a batch of different videos can be stored in a 5D tensor of shape
*(samples, frames, height, width, color_depth)*.

## Tensor Operations

All transformations learned
by deep neural networks can be reduced to a handful of tensor operations applied to
tensors of numeric data. For instance, it’s possible to add tensors, multiply tensors,
and so on. Some tesnor operatations are element-wise, and others are matrix-wise.


### Broadcasting

When possible, and if there’s no ambiguity, the smaller tensor will be broadcasted to
match the shape of the larger tensor. Broadcasting consists of two steps:
1. Axes (called broadcast axes) are added to the smaller tensor to match the ndim of
the larger tensor.
2. The smaller tensor is repeated alongside these new axes to match the full shape
of the larger tensor.

Consider X with shape (32, 10) and y with shape
(10,). First, we add an empty first axis to y, whose shape becomes (1, 10). Then, we
repeat y 32 times alongside this new axis, so that we end up with a tensor Y with shape
(32, 10), where `Y[i, :] == y for i in range(0, 32)`. At this point, we can proceed to
add X and Y, because they have the same shape.

With broadcasting, you can generally apply two-tensor element-wise operations if one
tensor has shape (a, b, … n, n + 1, … m) and the other has shape (n, n + 1, … m). The
broadcasting will then automatically happen for axes a through n - 1.

### Tensor dot

dot product between two vectors is a scalar and that only
vectors with the same number of elements are compatible for a dot product.
You can also take the dot product between a matrix x and a vector y, which returns
a vector where the coefficients are the dot products between y and the rows of x.

The most common applications may be the dot product between two matrices. You
can take the dot product of two matrices x and `y (dot(x, y))` if and only if
`x.shape[1] == y.shape[0]`. The result is a matrix with shape `(x.shape[0],
y.shape[1])`, where the coefficients are the vector products between the rows of x
and the columns of y. 

### Tensor reshaping

Reshaping a tensor means rearranging its rows and columns to match a target shape.
Naturally, the reshaped tensor has the same total number of coefficients as the initial
tensor. Reshaping is best understood via simple examples:

In [32]:
x = np.array([[0., 1.],
    [2., 3.],
    [4., 5.]])
x.shape

(3, 2)

In [33]:
x

array([[0., 1.],
       [2., 3.],
       [4., 5.]])

In [35]:
x = x.reshape((6, 1))
x.shape

(6, 1)

In [36]:
x

array([[0.],
       [1.],
       [2.],
       [3.],
       [4.],
       [5.]])

A special case of reshaping that’s commonly encountered is transposition. Transposing a
matrix means exchanging its rows and its columns, so that `x[i, :]` becomes `x[:, i]`

In [37]:
x = np.zeros((300, 20))
x = np.transpose(x)
x.shape

(20, 300)

>You can interpret a neural network as a very complex geometric transformation in a high-dimensional space, implemented via a long series of simple steps. Imagine two sheets of colored
paper: one red and one blue. Put one on top of the other. Now crumple them
together into a small ball. That crumpled paper ball is your input data, and each sheet
of paper is a class of data in a classification problem. What a neural network (or any
other machine-learning model) is meant to do is figure out a transformation of the
paper ball that would uncrumple it, so as to make the two classes cleanly separable
again. With deep learning, this would be implemented as a series of simple transformations of the 3D space, such as those you could apply on the paper ball with your fingers, one movement at a time.

## Gradient-based Optimization

each neural layer from our first network example
transforms its input data as follows:<br>
`output = relu(dot(W, input) + b)`<br>
In this expression, W and b are tensors that are attributes of the layer. They’re called
the weights or trainable parameters of the layer (the kernel and bias attributes, respectively). 

Initially, these weight matrices are filled with small random values (a step called random initialization). Of course, there’s no reason to expect that the output, will yield any useful representations. The resulting representations are meaningless—but they’re a starting point. What comes next is to gradually
adjust these weights, based on a feedback signal. This gradual adjustment, also called
training, is basically the learning that machine learning is all about.
This happens within what’s called a training loop, which works as follows. Repeat
these steps in a loop, as long as necessary:
1. Draw a batch of training samples x and corresponding targets y.
2. Run the network on x (a step called the forward pass) to obtain predictions y_pred.
3. Compute the loss of the network on the batch, a measure of the mismatch
between y_pred and y.
4. Update all weights of the network in a way that slightly reduces the loss on this
batch.

To upadet the weights, one naive solution would be to freeze all weights in the network except the one
scalar coefficient being considered, and try different values for this coefficient. But such an approach would be horribly inefficient, because you’d need to compute two forward passes (which are expensive) for every individual coefficient (of
which there are many, usually thousands and sometimes up to millions). A much better approach is to take advantage of the fact that all operations used in the network
are differentiable, and compute the gradient of the loss with regard to the network’s
coefficients.

### Stochastic gradient descent

Given a differentiable function, it’s theoretically possible to find its minimum analytically: it’s known that a function’s minimum is a point where the derivative is 0, so all
you have to do is find all the points where the derivative goes to 0 and check for which
of these points the function has the lowest value. However, solving for the zeros directly is not very easy. Instead, you can use the four-step algorithm outlined at the beginning of this section: modify the parameters little by little based on the current loss value on a random batch of data. Because you’re dealing with a differentiable function, you can
compute its gradient, which gives you an efficient way to implement step 4. If you
update the weights in the opposite direction from the gradient, the loss will be a little
less every time:

5. Move the parameters a little in the opposite direction from the gradient—fo
example $W -= step * gradient$ —thus reducing the loss on the batch a bit.

### The Backpropagation algorithm

A neural network function
consists of many tensor operations chained together, each of which has a simple,
known derivative. Applying the chain rule to the
computation of the gradient values of a neural network gives rise to an algorithm called Backpropagation (also sometimes called reverse-mode differentiation). Backpropagation starts with the final loss value and works backward from the top layers to the bottom layers, applying the chain rule to compute the contribution that each parameter
had in the loss value.

>* Learning means finding a combination of model parameters that minimizes a loss function for a given set of training data samples and their corresponding targets.
>* Learning happens by drawing random batches of data samples and their
targets, and computing the gradient of the network parameters with
respect to the loss on the batch. The network parameters are then moved
a bit (the magnitude of the move is defined by the learning rate) in the
opposite direction from the gradient.
>* The entire learning process is made possible by the fact that neural networks are chains of differentiable tensor operations, and thus it’s possible
to apply the chain rule of derivation to find the gradient function mapping the current parameters and current batch of data to a gradient value.
>* Two key concepts you’ll see frequently in future chapters are loss and optimizers. These are the two things you need to define before you begin feeding data into a network.
>* The loss is the quantity you’ll attempt to minimize during training, so it
should represent a measure of success for the task you’re trying to solve.
>* The optimizer specifies the exact way in which the gradient of the loss will
be used to update parameters: for instance, it could be the RMSProp optimizer, SGD with momentum, and so on.