training a neural network revolves around the following objects:
- the 'layers' - which are combined into a network(or model)
- the 'input data' and corresponding 'targets'
- the 'loss function' - which defines the feedback signal used for learning
- the 'optimiser' - which determines how learning proceeds

![Layer Loss Optimiser](./layer_loss_opt.png)

"the network, composed of layers that are chained together, maps the input data to predictions, the loss function then compares these predictions to targets, producing a loss value: a measure of how well the network's predicitons match what was expected. the optimiser uses this loss value to update the network's weights"

##### Layers: the Building Blocks of Deep Learning

layer, the fundamental data structure in neural networks

a layer is a data-processing module that takes as input one or more tensors and that outputs one or more tensors

some layers are stateless, but morefrequently layers have a state: the layer's weights. one or several tensors learned with stochastic gradient descent, which together contain the network's 'knowledge'

different layers are appropriate for different tensor formats and different types of data processing
- simple vector data: stored in 2D tensors of shape (samples, features), is often processed by 'densely connected layers', also called 'fully connected' or 'dense layers'
- sequence data: stored in 3D tensors of shape (samples, timesteps, features), is typically processed by 'recurrent layers' such as an LSTM layer
- image data: stored in 4D tensors, is usually processed by 2D concolution layers (Conv2D)

building deep-learning models in Keras is done by clipping together compatible layersto form useful data-transformation piplines

'layer compatibility' - every layer will only accept input tensors of a certain shape and will return output tensors of a certain shape

In [None]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(32, input_shape=(784,)))
model.add(layers.Dense(32))

the first layer will only accept as input 2D tensors where the first dimension is 784 (axis 0, the batch dimension, is unspecified, and thus any value would be accepted), the layer will return a tensor where the first dimension has been transformed to be 32 

thus this layer can only be connected to a downstream layer that expects 32 dimensional vectors as its input

keras are built to dynamically match the shape of the incoming layer, the second layer did not receive an input shape arguement - instead, it automatically inferred its input shape as being the output shape of the layer that came before

##### Models: Networks of Layers

a deep learning model is a directed, acyclic graph of layers, the most common instance is a linear stack of layers, mapping a single input to a single output

variety of network topologies:
- two-branch networks
- multihead networks
- inception blocks

the topology of a network defines a 'hypothesis space', we defined machine learning as "searching for useful representations of some input data, within a predefined space of possibilities, using fuidance from a feedback signal", by choosing a network topology, you constrain your 'space of possibilities' (hypothesis space) to a specific series of tensor operations, mapping input data to output data, what then will be searching is a good set of values for the weight tensors involved in these tensor operations

picking the right network architecture is more an art than a scinece

##### Loss Function and Optimisers: Keys to Configuring the Learning Process

there are two more things once the network architecture is defined:
- loss function -- the quantity that will be minimised during training, it represents a measure of success for the task at hand
- optimizer -- Determines how the network will be updated base on the loss function, it implements a specific variant of stochastic gradient descent (SGD)

 a neural network that has multiple outputs may have multiple loss functoins (one per output), but the gradient-descent process must be based on a single scalar loss value; so for multiloss networks, all losses are combined (via averaging) into a single scalar quantity

choosing the right objective function for the right problemis extermely importanat, if the objective doesn't fully correlate with success for the task at hand, the network will end up doing things you may not have wanted

choose the objective wisely, or you will have to face unintended side effects

there are simple guidelines to choose the correct loss when it comes to common problemssuch as classification, regression, and sequence prediction:
- binary crossentropy for a two-class classification problem
- categorical crossentropy for a many-class classification problem
- mean-squared error for a regression problem
- connectionist temporal classification (CTC) for a sequence-learning problem