# CHAPTER - 20: Neural Networks

At the heart of neural networks is the unit(called a node or neuron).

A unit takes one more inputs, multiplies each input by a parameter(called weight), sums the weighted inputs values along with some bias value, then feeds the value into an activation function. This output is then sent forward to other neurals deeper in the neural network.

Feedforward neural networks - also called multilayer perceptron - are the simplest artifical neural network used in any real-world setting.

Neural network can be visualized as a series of connected layers that form a network connecting an obserevation's feature values at one end, and target value at the other end.

The name feedforward comes from the fact that an observations feature values are fed "forward" through the network, with eac layer successively transforming the feature values with the goal that the output at the end is the same as the targe's value.

Feedforward neural networks contain three types of layers of units:
1. At the start of neural network -> input layer.(if an observation has 100 features, the input layer has 100 nodes.)
2. At the end of the neural network -> output layer with.(transforms the o/p of the hidden layers into values useful for task at hand, like for binary classification to scale its output to 0 or 1)
3. In between input and output layers -> hidden layers(which aren't hidden at all, these transforms the input feature values to output layers.)

Neural networks with many hidden layers(eg., 10, 100, 1000) are considered ***"deep networks"*** and their applications are called deep learning.

Neural networks are typically created with all parammeters initialized as small random values from a a Gaussian or normal uniform. Once the observations are fed through the network, the outputted value is compared with the obswervations true value using a loss function, this is called ***forward propogation***.

Next an algorithm goes backwards through the network identifying how much each parameter contributed to the error between the predicted and true values, called ***backpropogation***. At each parameter the optimization algorithm determines how much each weight should be adjusted to improve the output.

Neural networks learn by repeating this process of forward propogation and backpropogation for every observation in the training data multiple times, iteratively updating the values of the parameters.

Each time all the observations have been sent through the network is called an ***epoch*** and training typically consists of multiple epochs.

Neural networks created using Keras code can be trained using both CPUs and GPU. Whe we have larger networks and more training data, training uusing CPUs is significantly slower than training using GPUs.

## 20.1 Preprocessing Data for Neural Networks

Standardizing each feature using StandardScalar:

In [1]:
# Loading libraries

from sklearn import preprocessing
import numpy as np

In [2]:
# creating features
features = np.array([[-100.1, 3240.1],
                    [-200.2, -234.1],
                    [5000.5, 150.1],
                    [6000.6, -125.1],
                    [9000.9, -673.1]])

In [4]:
# Creating scalar

scaler = preprocessing.StandardScaler()

In [5]:
# transforming the features

features_standardized = scaler.fit_transform(features)

In [8]:
# show features

features_standardized

array([[-1.12541308,  1.96429418],
       [-1.15329466, -0.50068741],
       [ 0.29529406, -0.22809346],
       [ 0.57385917, -0.42335076],
       [ 1.40955451, -0.81216255]])

In [9]:
print("Mean: ", round(features_standardized[:,0].mean()))
print("Standard deviation: ", features_standardized[:,0].std())

Mean:  0
Standard deviation:  0.9999999999999999


Typically, a neural network's parameters are initialized(i.e., created) as small random numbers. Neural networks often behavae poorly when the feature values are much larger than parameter values. Since an observations feature values are combined as they pass through individual units, it is important that all features have the same scacle. For all these reasons it is a best practice to standardize the each feature such that the faeture's values have a mean of 0 and a standard deviation of 1.

## 20.2 Designing a Neural Neetwork

Designing a neural network using Kera's Sequential model:

In [1]:
# load the libraries

from keras import models
from keras import layers

In [3]:
# starting a neural network

network = models.Sequential()

In [4]:
# Adding fully connected layer with a ReLU activation function

network.add(layers.Dense(units=16, activation="relu", input_shape=(10,)))

In [5]:
# Adding fully connected layer with a ReLU activation function

network.add((layers.Dense(units=16, activation="relu")))

In [6]:
# Adding fully connected layer with sigmoid activation function

network.add(layers.Dense(units=1, activation="sigmoid"))

In [7]:
# compiling the neural network

network.compile(loss = "binary_crossentropy",  # Cross-entropy
               optimizer = "rmsprop", # Root Mean Square Propogation
               metrics=["accuracy"]) #Accuracy performance metric

In the above code we have a two-layer neural network, while counting the layers we don't include the input layer because it does not have any parameters to learn, using keras sequential model. 

Each layer is "dense"(fully connected) means all the units in the previous layer are connected to all the neurals in the next layer.

In the first hidden layer we set units = 16 => 

layer contains 16 units with ReLU activation function, in keras the first hidden layer of any network must include input_shape parameter(shape of the feature), in this example it tells the first layer to expect each observation to have 10 feature values.

Second layer is same as the first, without input_shape paramater.

Here the network is designed for binary classification so the output layer has only one unit with sigmoid activation function, which contraints the output to between 0 and 1.

Finally, we can train our model(by telling Keras how we want our network to learn) by using compile method:

with optimization algorithm - RMSProp

loss function - binary_crossentropy and one or more performance metrics.

There's a lot of varieties in the types of layers and how they are combined to form the network's architecture.

Selecting the right architecture is mostly an art and the topic of research.

For ***feedforward neural network*** in Keras, we need to make a number of choices about:
1. Network architecture and 
2. Training process.

Each **unit*** in the hidden layers:
1. Receives a number of inputs.
2. Weights each input by a parameter value.
3. Sums together all weighted inputs along with some bias(typically 1).
4. Most often then applies some function(called an activation function).
5. Sends the output on to units in the next layer.

I:

1. For each layer in the hidden and output layers we must define the no.of units to include in the layer and the activation function.
2. More the no.of units we have in a layer, the more our network is able to learn complex patterns.
3. More units might make network overfit the training data in a way detrimental to the performance on the test data.

for hidden layers, activation function used here is the rectified linear unit(ReLU).

II:

1. We should then define the number of hidden layers to use in the network.
2. More layers allow network to learn more complex relationships.

III:

We have to define the structure of the activation function of output layer. Some of the output layer patterns are:
Binary Classification(one unit with sigmoid activation), Multiclass calssification(k units and a softmax activation), Regression(one unit with no activation function).

IV:

We then define a loss function.

Binary classification => Binary cross-entropy
Multiclass calssification => Categorical cross-entropy
Regression => Mean square error

V:

Define a optimizer, this is like a strategy "walking around" the loss function to find the parameter values that produce the lowest error. Some of the optimizers are: stochastic gradient descent, stochastic gradient descent with momentum, root mean square propogation, and adaptive moment estimation.

6:

we can select one or more metrics to evaluate the performance such as accuracy.


Keras offers two ways for creating neural networks. 
1. Keras sequential model creates neural networks by stacking together layers.
2. By using functional API.

## 20.3 Training a Binary Classifier

Training a binary classifier neural network.
Using Keras to construct a feedforward neural network and trainig using the fit method:

In [8]:
# Loading libraries

import numpy as np
from keras.datasets import imdb
from keras.preprocessing.text import Tokenizer
from keras import models
from keras import layers

In [9]:
# Setting random seed

np.random.seed(0)

In [10]:
# setting the no.of features we want

number_of_features = 1000

In [11]:
# loading data and target vector from movie review data

(data_train, target_train), (data_test, target_test) = imdb.load_data(num_words = number_of_features)

In [12]:
# converting movie review data to one-hot encoded feature matrix

tokenizer = Tokenizer(num_words = number_of_features)
features_train = tokenizer.sequences_to_matrix(data_train, mode = "binary")
features_test = tokenizer.sequences_to_matrix(data_test, mode = "binary")

In [13]:
# starting a neural network

network = models.Sequential()

In [14]:
# adding a fully connected layer with ReLU activation function

network.add(layers.Dense(units = 16, activation = "relu", input_shape = (number_of_features,)))

In [15]:
# adding fully connected layer with a ReLU activation function

network.add(layers.Dense(units = 16, activation = "relu"))

In [16]:
# adding fully connected layer with a sigmoid activation function

network.add(layers.Dense(units = 1, activation = "sigmoid"))

In [17]:
network.compile(loss = "binary_crossentropy", # Cross-entropy
               optimizer = "rmsprop", # Root Mean Square Proppogation
               metrics = ["accuracy"]) # Accuracy performance metric

In [18]:
# training a neural network

history = network.fit(features_train, # Features
                     target_train, # Target vector
                     epochs = 3, # No.of epochs
                     verbose = 1, # Print description after each epoch
                     batch_size = 100, # No.of observations per batch
                     validation_data = (features_test, target_test)) # test data

Epoch 1/3
Epoch 2/3
Epoch 3/3


Here we are training neural network with real data: 50,000 movie reviews(25,000 for training - 25,000 for testing).

We convert text of reviews into 5000 binary features with 1000 most frequent words.

Here the neural network uses 25000 observations each with 1000 features to predict if a movie review is positive or negative.

In Keras we train our neural network using fit method. The 6 important features to define are: first 2 are features and target vector of training data.

Epochs: Defines how many epochs to use when we train the data. 

Verbose: How much information is outputted during the training process.(0 -> no output, 1 -> outputs the progress bar, 2-> one log line per epoch)

batch_size: no.of observations to propagate through the network before updating the parameters.

Finally, we held a test set of data to evaluate the model, these test features and target can be given as arguments to validation_data, we can also use validation_split to define the fraction of training data we want for evaluation.

In scikit-learn fit method returns a trained model, but in Keras the fit method returns a history object with loss values and performance metrics at each epoch.

In [20]:
features_train.shape

(25000, 1000)