<a href="https://colab.research.google.com/github/Ali623/practice/blob/master/deep_learning_with_python_chapter3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with neural networks (Chapter3)
_____
## This chapter covers
- Core components of neural networks
- An introduction to Keras
- Setting up a deep-learning workstation
- Using neural networks to solve basic
- Classification and regression problems<br><br>
the three most common use cases of neural networks: binary classification,
multiclass classification, and scalar regression
<br><br>
 we’ll take a closer look at the core components of neural networks
that we introduced in chapter 2: layers, networks, objective functions, and optimizers.

 Classifying movie reviews as positive or negative (binary classification)
 Classifying news wires by topic (multiclass classification)
 Estimating the price of a house, given real-estate data (regression)

## 3.1 Anatomy of a neural network

training a neural network revolves around the following
objects:

◘ Layers, which are combined into a network (or model)
 The input data and corresponding targets
 The loss function, which defines the feedback signal used for learning
 The optimizer, which determines how learning proceeds

Figure 3.1 <br>
You can visualize their interaction as illustrated in figure 3.1: the network, composed
of layers that are chained together, maps the input data to predictions. The loss function
then compares these predictions to the targets, producing a loss value: a measure
of how well the network’s predictions match what was expected. The optimizer uses
this loss value to update the network’s weights.
<img src='images/f3.1.png'>

### 3.1.1 Layers: the building blocks of deep learning
-  Some layers are stateless, but
more frequently layers have a state: the layer’s weights, one or several tensors learned
with stochastic gradient descent, which together contain the network’s knowledge.
- Different layers are appropriate for different tensor formats and different types of data
processing. For instance, simple vector data, stored in 2D tensors of shape (samples,
features), is often processed by densely connected layers, also called fully connected or dense
layers (the Dense class in Keras). Sequence data, stored in 3D tensors of shape (samples,
timesteps, features), is typically processed by recurrent layers such as an LSTM layer.
Image data, stored in 4D tensors, is usually processed by 2D convolution layers (Conv2D).mm
-  The notion of layer compatibility here refers specifically to the fact that every layer
will only accept input tensors of a certain shape and will return output tensors of a certain
shape. Consider the following example

In [0]:
from keras import layers
layer = layers.Dense(32, input_shape=(784,))

Above layer accept (784,) shape and return in (32,) shape

 Thus this layer can only be connected to a downstream layer that expects 32-
dimensional vectors as its input. When using Keras, you don’t have to worry about
compatibility, because the layers you add to your models are dynamically built to
match the shape of the incoming layer. For instance, suppose you write the following:

In [0]:
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(32, input_shape=(784,)))
model.add(layers.Dense(32))


The second layer didn’t receive an input shape argument—instead, it automatically
inferred its input shape as being the output shape of the layer that came before. 

### 3.1.2 Models: networks of layers

much broader variety of network
topologies. Some common ones include the following
- Two-branch networks
- Multihead networks
- Inception blocks
By choosing a network topology, you constrain your space of possibilities
(hypothesis space) to a specific series of tensor operations, mapping input data to output
data. What you’ll then be searching for is a good set of values for the weight tensors
involved in these tensor operations.

### 3.1.3 Loss functions and optimizers:<br>keys to configuring the learning process
Once the network architecture is defined, you still have to choose two more things:
-  Loss function (objective function)—The quantity that will be minimized during
training. It represents a measure of success for the task at hand
-  Optimizer—Determines how the network will be updated based on the loss function.
It implements a specific variant of stochastic gradient descent (SGD).<br><br><br>
A neural network that has multiple outputs may have multiple loss functions (one per
output). But the gradient-descent process must be based on a single scalar loss value;
so, for multiloss networks, all losses are combined (via averaging) into a single scalar
quantity.
-  Choosing the right objective function for the right problem is extremely important:
### <span style='color:green'>you’ll use binary crossentropy for a two-class classification<br>problem, categorical crossentropy for a many-class classification problem, meansquared<br>error for a regression problem, connectionist temporal classification (CTC)<br>for a sequence-learning problem, and so on</span>

# 3.2 Introduction to Keras

 It allows the same code to run seamlessly on CPU or GPU.
 It has a user-friendly API that makes it easy to quickly prototype deep-learning
models.
 It has built-in support for convolutional networks (for computer vision), recurrent
networks (for sequence processing), and any combination of both.
 It supports arbitrary network architectures: multi-input or multi-output models,
layer sharing, model sharing, and so on. This means Keras is appropriate for
building essentially any deep-learning model, from a generative adversarial network
to a neural Turing machine.

 Keras
is used at Google, Netflix, Uber, CERN, Yelp, Square, and hundreds of startups working
on a wide range of problems.<img src='images/f3.2.png'>

### 3.2.1 Keras, TensorFlow, Theano, and 
Keras is a model-level library, providing high-level building blocks for developing
deep-learning models. It doesn’t handle low-level operations such as tensor manipulation
and differentiation<br><br>
Keras handles the
problem in a modular way (see figure 3.3); <br><br>
<b> three existing backend implementations
are the TensorFlow backend, the Theano backend, and the Microsoft Cognitive
Toolkit (CNTK) backend.</b>
<img src='images/f3.3.png'>

### 3.2.2 Developing with Keras: a quick overview

1 Define your training data: input tensors and target tensors.
2 Define a network of layers (or model ) that maps your inputs to your targets.
3 Configure the learning process by choosing a loss function, an optimizer, and
some metrics to monitor.
4 Iterate on your training data by calling the fit() method of your model.

There are two ways to define a model: using the Sequential class (only for linear
stacks of layers, which is the most common network architecture by far) or the functional
API (for directed acyclic graphs of layers, which lets you build completely arbitrary
architectures).

In [0]:
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(32, activation='relu', input_shape=(784,)))
model.add(layers.Dense(10, activation='softmax'))

<b>And here’s the same model defined using the functional API:</b>

In [0]:
input_tensor = layers.Input(shape=(784,))
x = layers.Dense(32, activation='relu')(input_tensor)
output_tensor = layers.Dense(10, activation='softmax')(x)
model = models.Model(inputs=input_tensor, outputs=output_tensor) 

With the functional API, you’re manipulating the data tensors that the model processes
and applying layers to this tensor as if they were functions.

<b>NOTE</b><br>
A detailed guide to what you can do with the functional API can be
found in chapter 7. Until chapter 7, we’ll only be using the Sequential class
in our code examples.

The learning process is configured in the compilation step, where you specify the
optimizer and loss function(s) that the model should use, as well as the metrics you
want to monitor during training. Here’s an example with a single loss function, which
is by far the most common case:

In [0]:
from keras import optimizers
model.compile(optimizer=optimizers.RMSprop(lr=0.001),
loss='mse',
metrics=['accuracy'])


Finally, the learning process consists of passing Numpy arrays of input data (and the
corresponding target data) to the model via the fit() method, similar to what you
would do in Scikit-Learn and several other machine-learning libraries:


model.fit(input_tensor, target_tensor, batch_size=128, epochs=10)

three basic examples in sections 3.4, 3.5, and 3.6: a two-class classification
example, a many-class classification example, and a regression example. 

#### 3.3 Setting up a deep-learning workstation

### 3.3.1 Jupyter notebooks: the preferred way to run deep-learning experiments

### 3.3.2 Getting Keras running: two options

### 3.3.3 Running deep-learning jobs in the cloud: pros and cons

__________
# 3.4 Classifying movie reviews: <br> a binary classification example
__________
Two-class classification, or binary classification, may be the most widely applied kind
of machine-learning problem. In this example, you’ll learn to classify movie reviews as
positive or negative, based on the text content of the reviews

### 3.4.1 The IMDB dataset
You’ll work with the IMDB dataset: a set of 50,000 highly polarized reviews from the
Internet Movie Database. They’re split into 25,000 reviews for training and 25,000
reviews for testing, each set consisting of 50% negative and 50% positive reviews.<br><br>
 Just like the MNIST dataset, the IMDB dataset comes packaged with Keras. It has
already been preprocessed: the reviews (sequences of words) have been turned into
sequences of integers, where each integer stands for a specific word in a dictionary.
 The following code will load the dataset (when you run it the first time, about
80 MB of data will be downloaded to your machine).

 ### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.1 Loading the IMDB dataset</span>

In [38]:
from keras.datasets import imdb

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

ValueError: ignored

train_labels and test_labels are
lists of 0s and 1s, where 0 stands for negative and 1 stands for positive:

In [0]:
train_data[0]

In [0]:
train_labels[0]

Because you’re restricting yourself to the top 10,000 most frequent words, no word
index will exceed 10,000:


In [0]:
max([max(sequence) for sequence in train_data])

For kicks, here’s how you can quickly decode one of these reviews back to English
words:

In [37]:
import collections

word_index = imdb.get_word_index()
reverse_word_index = dict(
    [(value, key) for (value, key) in word_index.items()])

decoded_review = ' '.join(
    [reverse_word_index.get(i - 3, '?') for i in train_data[0]])

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


NameError: ignored

### 3.4.2 Preparing the data You can’t feed lists of 

You can’t feed lists of integers into a neural network. You have to turn your lists into
tensors. There are two ways to do that:
 Pad your lists so that they all have the same length, turn them into an integer
tensor of shape (samples, word_indices), and then use as the first layer in
your network a layer capable of handling such integer tensors (the Embedding
layer, which we’ll cover in detail later in the book).
 One-hot encode your lists to turn them into vectors of 0s and 1s. This would
mean, for instance, turning the sequence [3, 5] into a 10,000-dimensional vector
that would be all 0s except for indices 3 and 5, which would be 1s. Then you
could use as the first layer in your network a Dense layer, capable of handling
floating-point vector data.
Let’s go with the latter solution to vectorize the data, which you’ll do manually for
maximum clarity.

 ### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.2 Encoding the integer sequences into a binary matrix
</span>

In [0]:
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

In [0]:
x_train[0]

You should also vectorize your labels, which is straightforward:

In [0]:
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

Now the data is ready to be fed into a neural network.

### 3.4.3 Building your network


simple stack of fully connected (Dense) layers with relu activations: Dense(16,
activation='relu').<br>
     The argument being passed to each Dense layer (16) is the number of hidden
units of the layer. A hidden unit is a dimension in the representation space of the layer.
You may remember from chapter 2 that each such Dense layer with a relu activation
implements the following chain of tensor operations:<br><br>
    `output = relu(dot(W, input) + b)`<br><br>
There are two key architecture decisions to be made about such a stack of Dense layers:
-  How many layers to use
-  How many hidden units to choose for each layer
In chapter 4, you’ll learn formal principles to guide you in making these choices. For
the time being, you’ll have to trust me with the following architecture choice:
-  Two intermediate layers with 16 hidden units each
-  A third layer that will output the scalar prediction regarding the sentiment of
the current review
<br><br>
The intermediate layers will use relu as their activation function, and the final layer
will use a sigmoid activation so as to output a probability (a score between 0 and 1,indicating how likely the sample is to have the target “1”: how likely the review is to be
positive). A relu (rectified linear unit) is a function meant to zero out negative values
(see figure 3.4), whereas a sigmoid “squashes” arbitrary values into the [0, 1] interval
(see figure 3.5), outputting something that can be interpreted as a probability.
<img src='images/f3.4.png'>
<img src='images/f3.5.png'>
<img src='images/f3.6.png'>

Figure 3.6 shows what the network looks like. And here’s the Keras implementation,
similar to the MNIST example you saw previously.

 ### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.2 Encoding the integer sequences into a binary matrix
</span>

In [0]:
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

<div style='background-color:#ccc;padding:20px;'>
What are activation functions, and why are they necessary?
Without an activation function like relu (also called a non-linearity), the Dense layer
would consist of two linear operations—a dot product and an addition<b><br>
`output = dot(W, input) + b`<br><br>
So the layer could only learn linear transformations (affine transformations) of the
input data: the hypothesis space of the layer would be the set of all possible linear
transformations of the input data into a 16-dimensional space. Such a hypothesis
space is too restricted and wouldn’t benefit from multiple layers of representations,
because a deep stack of linear layers would still implement a linear operation: adding
more layers wouldn’t extend the hypothesis space.<br><br>
In order to get access to a much richer hypothesis space that would benefit from
deep representations, you need a non-linearity, or activation function. relu is the
most popular activation function in deep learning, but there are many other candidates,
which all come with similarly strange names: prelu, elu, and so on.
</div>

Finally, you need to choose a loss function and an optimizer. Because you’re facing a
binary classification problem and the output of your network is a probability (you end
your network with a single-unit layer with a sigmoid activation), it’s best to use the binary_crossentropy loss. It isn’t the only viable choice: you could use, for instance,
mean_squared_error. But crossentropy is usually the best choice when you’re dealing
with models that output probabilities. Crossentropy is a quantity from the field of Information
Theory that measures the distance between probability distributions or, in this
case, between the ground-truth distribution and your predictions.
 Here’s the step where you configure the model with the rmsprop optimizer and
the binary_crossentropy loss function. Note that you’ll also monitor accuracy
during training

 ### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.4 Compiling the model
</span>


In [0]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

You’re passing your optimizer, loss function, and metrics as strings, which is possible
because `rmsprop, binary_crossentropy`, and `accuracy` are packaged as part of Keras.
Sometimes you may want to configure the parameters of your optimizer or pass a custom
loss function or metric function. The former can be done by passing an optimizer
class instance as the optimizer argument, as shown in listing 3.5; the latter can be
done by passing function objects as the `loss` and/or `metrics` arguments, as shown in
listing 3.6.


 ### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.5 Configuring the optimizer
</span>


In [0]:
from keras import optimizers

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss='binary_crossentropy',
              metrics=['accuracy'])

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.6 Using custom losses and metrics</span>


In [0]:
from keras import losses
from keras import metrics

model.compile(optimizer=optimizers.RMSprop(lr=0.001),
              loss=losses.binary_crossentropy,
              metrics=[metrics.binary_accuracy])

### 3.4.4 Validating your approach
In order to monitor during training the accuracy of the model on data it has never
seen before, you’ll create a validation set by setting apart 10,000 samples from the
original training data.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.6 Using custom losses and metrics</span>

In [0]:
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

You’ll now train the model for 20 epochs (20 iterations over all samples in the
x_train and y_train tensors), in mini-batches of 512 samples. At the same time,
you’ll monitor loss and accuracy on the 10,000 samples that you set apart. You do so by
passing the validation data as the validation_data argument.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.8 Training your model
</span>

In [0]:
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))


On CPU, this will take less than 2 seconds per epoch—training is over in 20 seconds.
At the end of every epoch, there is a slight pause as the model computes its loss and
accuracy on the 10,000 samples of the validation data.<br>
 Note that the call to model.fit() returns a History object. This object has a member
history, which is a dictionary containing data about everything that happened
during training. Let’s look at it:


In [0]:
history_dict = history.history
history_dict.keys()


In [0]:
history_dict.items()

 Matplotlib to plot
the training and validation loss side by side (see figure 3.7), as well as the training and
validation accuracy (see figure 3.8). Note that your own results may vary slightly due to
a different random initialization of your network

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.9 Plotting the training and validation loss
</span>

In [0]:
import matplotlib.pyplot as plt
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
acc = history_dict['acc']
epochs = list(range(1, len(acc) + 1))
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
print("Figure 3.7 Training and validation loss")

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.10 Plotting the training and validation accuracy</span>

In [0]:
plt.clf()
acc_values = history_dict['acc']
val_acc_values = history_dict['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc_values, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
print("Figure 3.8 Training and validation accuracy")

 Let’s train a new network from scratch for four epochs and then evaluate it on the
test data.

### <div style='color:white;background-color:skyblue;padding:10px;'>Listing 3.11 Retraining a model from scratch</div>

In [0]:
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=4, batch_size=512)
results = model.evaluate(x_test, y_test)

This fairly naive approach achieves an accuracy of 88%. With state-of-the-art
approaches, you should be able to get close to 95%.

### 3.4.5 Using a trained network to generate predictions on new data

After having trained a network, you’ll want to use it in a practical setting. You can generate
the likelihood of reviews being positive by using the predict method:

In [0]:
model.predict(x_test)

As you can see, the network is confident for some samples (0.99 or more, or 0.01 or
less) but less confident for others (0.6, 0.4).

#### 3.4.6 Further experiments

 You used two hidden layers. Try using one or three hidden layers, and see how
doing so affects validation and test accuracy.
 Try using layers with more hidden units or fewer hidden units: 32 units, 64 units,
and so on.
 Try using the mse loss function instead of binary_crossentropy.
 Try using the tanh activation (an activation that was popular in the early days of
neural networks) instead of relu

### 3.4.7 Wrapping up

-  You usually need to do quite a bit of preprocessing on your raw data in order to
be able to feed it—as tensors—into a neural network. Sequences of words can
be encoded as binary vectors, but there are other encoding options, too.
-  Stacks of Dense layers with relu activations can solve a wide range of problems
(including sentiment classification), and you’ll likely use them frequently.
-  In a binary classification problem (two output classes), your network should
end with a Dense layer with one unit and a sigmoid activation: the output of
your network should be a scalar between 0 and 1, encoding a probability.
- With such a scalar sigmoid output on a binary classification problem, the loss
function you should use is binary_crossentropy.
-  The rmsprop optimizer is generally a good enough choice, whatever your problem.
That’s one less thing for you to worry about.
-  As they get better on their training data, neural networks eventually start overfitting
and end up obtaining increasingly worse results on data they’ve never
seen before. Be sure to always monitor performance on data that is outside of
the training set. 

_____
# 3.5 Classifying newswires: a multiclass classification example
_____
If each data point could belong to multiple categories (in this case, topics), you’d be
facing a multilabel, multiclass classification problem.

### 3.5.1 The Reuters dataset
You’ll work with the Reuters dataset, a set of short newswires and their topics, published
by Reuters in 1986. It’s a simple, widely used toy dataset for text classification. There
are 46 different topics; some topics are more represented than others, but each topic
has at least 10 examples in the training set.<br>
 Like IMDB and MNIST, the Reuters dataset comes packaged as part of Keras. Let’s
take a look.

### <div style='color:white;background-color:skyblue;padding:10px;'>Listing 3.12 Loading the Reuters dataset</div>

In [0]:
from keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

As with the IMDB dataset, the argument <b>num_words=10000</b> restricts the data to the
<b>10,000</b> most frequently occurring words found in the data.
 You have <b>8,982 training examples and 2,246 test examples</b>:

In [0]:
len(train_data)

In [0]:
len(test_data)

As with the IMDB reviews, each example is a list of integers (word indices):

In [0]:
train_data[10]

Here’s how you can decode it back to words, in case you’re curious.

### <div style='color:white;background-color:skyblue;padding:10px;'>Listing 3.13 Decoding newswires back to text</div>


In [0]:
word_index = reuters.get_word_index()
reverse_word_index = dict([(value, key) 
                           for (key, value) in word_index.items()])
decoded_newswire = ' '.join([reverse_word_index.get(i - 3, '?') 
                             for i in train_data[0]])

The label associated with an example is an integer between 0 and 45—a topic index:

In [0]:
 train_labels[10]

In [0]:
### 3.5.2 Preparing the data
You can vectorize the data with the exact same code as in the previous example.


### <div style='color:white;background-color:skyblue;padding:10px;'>Listing 3.14 Encoding the data</div>


In [0]:
import numpy as np

def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)


<b>one-hot encoding</b> of
the labels consists of embedding each label as an all-zero vector with a 1 in the place of
the label index. Here’s an example:

In [0]:
def to_one_hot(labels, dimension=46):
    results = np.zeros((len(labels), dimension))
    for i, label in enumerate(labels):
        results[i, label] = 1.
    return results

one_hot_train_labels = to_one_hot(train_labels)
one_hot_test_labels = to_one_hot(test_labels)

<b>Note</b> that there is a built-in way to do this in Keras, which you’ve already seen in action
in the MNIST example:

In [0]:
from keras.utils.np_utils import to_categorical
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

### 3.5.3 Building your network
the number of output classes has gone from 2 to 46. The
dimensionality of the output space is much larger.<br>
: such small layers may act as information
bottlenecks, permanently dropping relevant information.
 For this reason you’ll use larger layers. Let’s go with 64 units.
 
### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.15 Model definition
</span>


In [0]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

There are two other things you should note about this architecture:

 You end the network with a Dense layer of size 46. This means for each `
sample, the network will output a 46-dimensional vector. Each entry in this vector
(each dimension) will encode a different output class.
 The last layer uses a softmax activation. You saw this pattern in the MNIST
example. It means the network will output a probability distribution over the 46
different output classes—for every input sample, the network will produce a 46-
dimensional output vector, where output[i] is the probability that the sample
belongs to class i. The 46 scores will sum to 1.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.15 Model definition
</span>

In [0]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

### 3.5.4 Validating your approach
Let’s set apart 1,000 samples in the training data to use as a validation set

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.17 Setting aside a validation set
</span>

In [0]:
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

Now, let’s train the network for 20 epochs.
### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.18 Training the model
</span>

In [0]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

And finally, let’s display its loss and accuracy curves (see figures 3.9 and 3.10).

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.19 Plotting the training and validation loss
</span>

In [0]:
import matplotlib.pyplot as plt
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()


### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.20 Plotting the training and validation accuracy</span>

In [0]:
plt.clf()
acc = history.history['acc']
val_acc = history.history['val_acc']
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()


The network begins to overfit after nine epochs. Let’s train a new network from
scratch for nine epochs and then evaluate it on the test set.
### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.21 Retraining a model from scratch
</span>

In [0]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(partial_x_train,
          partial_y_train,
          epochs=9,
          batch_size=512,
          validation_data=(x_val, y_val))

results = model.evaluate(x_test, one_hot_test_labels)


Here are the final results:

In [0]:
results

This approach reaches an accuracy of ~80%. With a balanced binary classification
problem, the accuracy reached by a purely random classifier would be 50%. But in
this case it’s closer to 19%, so the results seem pretty good, at least when compared to
a random baseline:

In [0]:
import copy
test_labels_copy = copy.copy(test_labels)
np.random.shuffle(test_labels_copy)
hits_array = np.array(test_labels) == np.array(test_labels_copy)
float(np.sum(hits_array)) / len(test_labels)


### 3.5.5 Generating predictions on new data

You can verify that the predict method of the model instance returns a probability
distribution over all 46 topics. Let’s generate topic predictions for all of the test data.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.22 Generating predictions for new data
</span>

In [0]:
predictions = model.predict(x_test)

In [0]:
 predictions[0].shape

The coefficients in this vector sum to 1:

In [0]:
np.sum(predictions[0])

The largest entry is the predicted class—the class with the highest probability:

In [0]:
np.argmax(predictions[0])

### 3.5.6 A different way to handle the labels and the loss
We mentioned earlier that another way to encode the labels would be to cast them as
an integer tensor, like this:

In [0]:
y_train = np.array(train_labels)
y_test = np.array(test_labels)

The only thing this approach would change is the choice of the loss function. The loss
function used in listing 3.21, categorical_crossentropy, expects the labels to follow
a categorical encoding. With integer labels, you should use sparse_categorical_
crossentropy:

In [0]:
model.compile(optimizer='rmsprop',
              loss='sparse_categorical_crossentropy',
              metrics=['acc'])

This new loss function is still mathematically the same as categorical_crossentropy;
it just has a different interface.

### 3.5.7 The importance of having sufficiently large intermediate layers

We mentioned earlier that because the final outputs are 46-dimensional, you should
avoid intermediate layers with many fewer than 46 hidden units. Now let’s see what
happens when you introduce an information bottleneck by having intermediate layers
that are significantly less than 46-dimensional: for example, 4-dimensional.


### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.23 A model with an information bottleneck
</span>

In [0]:
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(partial_x_train,
          partial_y_train,
          epochs=20,
          batch_size=128,
          validation_data=(x_val, y_val))

The network now peaks at <b>~71% </b>validation accuracy, an <b>8%</b>absolute drop. This drop
is mostly due to the fact that you’re trying to compress a lot of information (enough
information to recover the separation hyperplanes of 46 classes) into an intermediate
space that is too low-dimensional. The network is able to cram most of the necessary
information into these eight-dimensional representations, but not all of it.

### 3.5.8 Further experiments

-  Try using larger or smaller layers: 32 units, 128 units, and so on.
-  You used two hidden layers. Now try using a single hidden layer, or three hidden
layers.

### 3.5.9 Wrapping up

-  If you’re trying to classify data points among N classes, your network should end
with a Dense layer of size N
- In a single-label, multiclass classification problem, your network should end
with a softmax activation so that it will output a probability distribution over the
N output classes.
- Categorical crossentropy is almost always the loss function you should use for
such problems. It minimizes the distance between the probability distributions
output by the network and the true distribution of the targets.
- There are two ways to handle labels in multiclass classification:<br>
  – Encoding the labels via categorical encoding (also known as one-hot encoding)
and using categorical_crossentropy as a loss function<br>
   – Encoding the labels as integers and using the sparse_categorical_crossentropy
loss function
- If you need to classify data into a large number of categories, you should avoid
creating information bottlenecks in your network due to intermediate layers
that are too small. 

___
# 3.6 Predicting house prices: a regression example
___

<b>NOTE</b> Don’t confuse regression and the algorithm logistic regression. Confusingly,
logistic regression isn’t a regression algorithm—it’s a classification
algorithm

### 3.6.1 The Boston Housing Price dataset

You’ll attempt to predict the median price of homes in a given Boston suburb in the
mid-1970s, given data points about the suburb at the time, such as the crime rate, the
local property tax rate, and so on. The dataset you’ll use has an interesting difference
from the two previous examples. It has relatively few data points: only 506, split
between 404 training samples and 102 test samples. And each feature in the input data
(for example, the crime rate) has a different scale. For instance, some values are proportions,
which take values between 0 and 1; others take values between 1 and 12, others
between 0 and 100, and so on.

### <span style='color:white;background-color:skyblue;padding:10px;'> Listing 3.24 Loading the Boston housing dataset</span>

In [0]:
from keras.datasets import boston_housing
(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

In [0]:
 train_data.shape

In [0]:
test_data.shape

As you can see, you have 404 training samples and 102 test samples, each with 13
numerical features, such as per capita crime rate, average number of rooms per dwelling,
accessibility to highways, and so on.<br>
 The targets are the median values of owner-occupied homes, in thousands of
dollars:

In [0]:
train_targets

### 3.6.2 Preparing the data

 feature in the input
data (a column in the input data matrix), you subtract the mean of the feature and
divide by the standard deviation, so that the feature is centered around 0 and has a
unit standard deviation. This is easily done in Numpy.
### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.25 Normalizing the data</span>

In [0]:
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

### 3.6.3 Building your network

Because so few samples are available, you’ll use a very small network with two hidden
layers, each with 64 units. In general, the less training data you have, the worse overfitting
will be, and using a small network is one way to mitigate overfitting.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.26 Model definition</span>

In [0]:
from keras import models
from keras import layers
def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu',
    input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

### 3.6.4 Validating your approach using K-fold validation

To evaluate your network while you keep adjusting its parameters (such as the number
of epochs used for training), you could split the data into a training set and a validation
set, as you did in the previous examples. But because you have so few data points,
the validation set would end up being very small (for instance, about 100 examples).
As a consequence, the validation scores might change a lot depending on which data
points you chose to use for validation and which you chose for training: the validation
scores might have a high variance with regard to the validation split. This would prevent
you from reliably evaluating your model.<br>
 The best practice in such situations is to use K-fold cross-validation (see figure 3.11).
It consists of splitting the available data into K partitions (typically K = 4 or 5), instantiating
K identical models, and training each one on K – 1 partitions while evaluating on
the remaining partition. The validation score for the model used is then the average of
the K validation scores obtained. In terms of code, this is straightforward.
<img src='images/f3.11.png'>

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.27 K-fold validation</span>

In [0]:
import numpy as np
k=4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []

In [0]:
for i in range(k):
    print('processing fold #', i)
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)
    model = build_model()
    model.fit(partial_train_data, partial_train_targets,
              epochs=num_epochs, batch_size=1, verbose=0)
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)

In [0]:
 all_scores

In [0]:
np.mean(all_scores)

The different runs do indeed show rather different validation scores, from 2.6 to 3.2.
The average (3.0) is a much more reliable metric than any single score—that’s the
entire point of K-fold cross-validation. In this case, you’re off by $3,000 on average,
which is significant considering that the prices range from $10,000 to $50,000.
 Let’s try training the network a bit longer: 500 epochs. To keep a record of how
well the model does at each epoch, you’ll modify the training loop to save the perepoch
validation score log.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.28 Saving the validation logs at each fold</span>

In [0]:
num_epochs = 500
all_mae_histories = []
for i in range(k):
    print('processing fold #', i)
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)
    model = build_model()

    history = model.fit(partial_train_data, partial_train_targets,
                        validation_data=(val_data, val_targets),
                        epochs=num_epochs, batch_size=1, verbose=0)
    mae_history = history.history['val_mean_absolute_error']
    all_mae_histories.append(mae_history)

You can then compute the average of the per-epoch MAE scores for all folds.


### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.29 Building the history of successive mean K-fold validation scores</span>

In [0]:
average_mae_history = [
    np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

Let’s plot this; see figure 3.12.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.30 Plotting validation scores</span>

In [0]:
import matplotlib.pyplot as plt
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()


 Omit the first 10 data points, which are on a different scale than the rest of the curve.
 Replace each point with an exponential moving average of the previous points,
to obtain a smooth curve

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.31 Plotting validation scores, excluding the first 10 data points</span>

In [0]:
def smooth_curve(points, factor=0.9):
    smoothed_points = []
    for point in points:
        if smoothed_points:
            previous = smoothed_points[-1]
            smoothed_points.append(previous * factor + point * (1 - factor))
        else:
            smoothed_points.append(point)
    return smoothed_points

smooth_mae_history = smooth_curve(average_mae_history[10:])
plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

 Once you’re finished tuning other parameters of the model (in addition to the
number of epochs, you could also adjust the size of the hidden layers), you can train a
final production model on all of the training data, with the best parameters, and then
look at its performance on the test data.

### <span style='color:white;background-color:skyblue;padding:10px;'>Listing 3.32 Training the final model</span>

In [0]:
model = build_model()
model.fit(train_data, train_targets,
          epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)

Here’s the final result:

In [0]:
 test_mae_score

You’re still off by about $2,550.

### 3.6.5 Wrapping up

Here’s what you should take away from this example:

 Regression is done using different loss functions than what we used for classification.
Mean squared error (MSE) is a loss function commonly used for regression.
 Similarly, evaluation metrics to be used for regression differ from those used for
classification; naturally, the concept of accuracy doesn’t apply for regression. A
common regression metric is mean absolute error (MAE).
 When features in the input data have values in different ranges, each feature
should be scaled independently as a preprocessing step.
 When there is little data available, using K-fold validation is a great way to reliably
evaluate a model.
 When little training data is available, it’s preferable to use a small network with
few hidden layers (typically only one or two), in order to avoid severe overfitting. 

# Chapter summary

<div style='background-color:#ccc; padding:20px'>
<li> You’re now able to handle the most common kinds of machine-learning
tasks on vector data: binary classification, multiclass classification, and scalar
regression. The “Wrapping up” sections earlier in the chapter summarize
the important points you’ve learned regarding these types of tasks.
<li> You’ll usually need to preprocess raw data before feeding it into a neural
network.
<li> When your data has features with different ranges, scale each feature
independently as part of preprocessing.
<li> As training progresses, neural networks eventually begin to overfit and
obtain worse results on never-before-seen data.
<li> If you don’t have much training data, use a small network with only one or
two hidden layers, to avoid severe overfitting.
<li> If your data is divided into many categories, you may cause information
bottlenecks if you make the intermediate layers too small.
<li> Regression uses different loss functions and different evaluation metrics
than classification.
<li> When you’re working with little data, K-fold validation can help reliably
evaluate your model.

</div>