# Better Learning

## Diagnostic Learning Curves


* Learning curves are plots that show changes in learning performance over time in terms of experience.
* Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model.
* Learning curves of model performance can be used to diagnose whether the train or validation datasets are not relatively representative of the problem domain.

* Optimization Learning Curves: Learning curves calculated on the metric by which the parameters of the model are being optimized, e.g. loss.
* Performance Learning Curves: Learning curves calculated on the metric by which the model will be evaluated and selected, e.g. accuracy.

## Underfit Learning Curves
* Underfitting occurs when the model is not able to obtain a sufficiently low error value on the training set.
* An underfit model can be identified from the learning curve of the training loss only. It may show a flat line or noisy values of relatively high loss, indicating that the model was unable to learn the training dataset at all.
* A plot of learning curves shows underfitting if:
* The training loss remains flat regardless of training.
* The training loss continues to decrease until the end of training.


## Training Loss Examples

![Training Loss 1](images/training-loss.png)



## Training Loss Example
![Training Loss 2](images/training-loss-2.png)

## Overfit Learning Curves

* Overfitting refers to a model that has learned the training dataset too well, including the statistical noise or random fluctuations in the training dataset.
* fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.
* The problem with overfitting, is that the more specialized the model becomes to training data, the less well it is able to generalize to new data, resulting in an increase in generalization error. This increase in generalization error can be measured by the performance of the model on the validation dataset.

* This often occurs if the model has more capacity than is required for the problem, and, in turn, too much flexibility. It can also occur if the model is trained for too long. A plot of learning curves shows overfitting if:
* The plot of training loss continues to decrease with experience.
* The plot of validation loss decreases to a point and begins increasing again.

* The inflection point in validation loss may be the point at which training could be halted as experience after that point shows the dynamics of overfitting. The example plot below demonstrates a case of overfitting.

## Overfitting Curves

![Training Loss 1](images/overfitt-img.png)

## Good Fit Learning Curves

### A good fit is the goal of the learning algorithm and exists between an overfit and underfit model. A good fit is identified by a training and validation loss that decreases to a point of stability with a minimal gap between the two final loss values.

* The plot of training loss decreases to a point of stability.
* The plot of validation loss decreases to a point of stability and has a small gap with the
training loss.

## Good Fitting Curves

![Training Loss 1](images/Good-fit.png)

## Diagnosing Unrepresentative Datasets

* An unrepresentative dataset means a dataset that may not capture the statistical characteristics relative to another dataset drawn from the same domain, such as between a train and a validation dataset. 
* This can commonly occur if the number of samples in a dataset is too small, relative to another dataset.

### There are two common cases that could be observed; they are:

* Training dataset is relatively unrepresentative.
* Validation dataset is relatively unrepresentative.

### An unrepresentative training dataset means that the training dataset does not provide sufficient information to learn the problem, relative to the validation dataset used to evaluate it. This may occur if the training dataset has too few examples as compared to the validation dataset. This situation can be identified by a learning curve for training loss that shows improvement and similarly a learning curve for validation loss that shows improvement, but a large gap remains between both curves.

## Underepresentative Training set Curves

![Training Loss 1](images/under-rep.png)

## Unrepresentative Validation Dataset

### An unrepresentative validation dataset means that the validation dataset does not provide sufficient information to evaluate the ability of the model to generalize. This may occur if the validation dataset has too few examples as compared to the training dataset.

## Under Representing Validation set

![Training Loss 1](images/val-rep.png)

### It may also be identified by a validation loss that is lower than the training loss. In this case, it indicates that the validation dataset may be easier for the model to predict than the training dataset.

![Training Loss 1](images/train-val-loss.png)

## Summary

* Learning curves are plots that show changes in learning performance over time in terms of experience.
* Learning curves of model performance on the train and validation datasets can be used to diagnose an underfit, overfit, or well-fit model.
* Learning curves of model performance can be used to diagnose whether the train or validation datasets are not relatively representative of the problem domain.

## Neural Nets Learn a Mapping Function

* A neural network model uses the examples to learn how to map specific sets of input variables to the output variable. It must do this in such a way that this mapping works well for the training dataset, but also works well on new examples not seen by the model during training. This ability to work well on specific examples and new examples is called the ability of the model to generalize.
* A multilayer perceptron is just a mathematical function mapping some set of input values to output values.
* A feedforward network defines a mapping and learns the value of the parameters that result in the best function approximation.
* As such, we can describe the broader problem that neural networks solve as function approximation. They learn to approximate an unknown underlying mapping function given a training dataset.
*  A point on the landscape is a specific set of weights for the model, and the elevation of that point is an evaluation of the set of weights, where valleys represent good models with small values of loss. This is a common conceptualization of optimization problems and the landscape is referred to as an error surface.

## High-Dimensional

* The problem of navigating a high-dimensional space is that the addition of each new dimension dramatically increases the distance between points in the search space. This is often referred to as the curse of dimensionality.

## Components of the Learning Algorithm
* Network Topology.
* Loss Function.
* Weight Initialization. 
* Batch Size.
* Learning Rate.
* Epochs.
* Data Preparation.

* Network Topology. The number of nodes (or equivalent) in the hidden layers and the number of hidden layers in the network
* Loss Function. The function used to measure the performance of a model with a specific set of weights on examples from the training dataset
* Weight Initialization. The procedure by which the initial small random values are assigned to model weights at the beginning of the training process.
* Batch Size. The number of examples used to estimate the error gradient before updating the model parameters
* Learning Rate: The amount that each model parameter is updated per iteration of the learning algorithm
* Epochs. The number of complete passes through the training dataset before the training process is terminated
* Data Preparation. The schemes used to prepare the data prior to modeling in order to ensure that it is suitable for the problem and for developing a stable model 

## Configure Capacity with Nodes and Layers

* Neural network model capacity is controlled both by the number of nodes and the number of layers in the model.
* A model with a single hidden layer and sufficient number of nodes has the capability of learning any mapping function, but the chosen learning algorithm may or may not be able to realize this capability.
* Increasing the number of layers provides a short-cut to increasing the capacity of the model with fewer resources, and modern techniques allow learning algorithms to successfully train deep models.


### A model with less capacity may not be able to sufficiently learn the training dataset. A model with more capacity can model more different functions and may be able to learn a function to sufficiently map inputs to outputs in the training dataset. Whereas a model with too much capacity may memorize the training dataset and fail to generalize or get lost or stuck in the search for a suitable mapping function. Generally, we can think of model capacity as a control over whether the model is likely to underfit or overfit a training dataset.

* We can control whether a model is more likely to overfit or underfit by altering its capacity.
#### The capacity of a neural network can be controlled by two aspects of the model:
* Number of Nodes.
* Number of Layers.

## Width
* The number of nodes in a layer is referred to as the width. Developing wide networks with one layer and many nodes was relatively straightforward. In theory, a network with enough nodes in the single hidden layer can learn to approximate any mapping function, although in practice, we don’t know how many nodes are sufficient or how to train such a model.

## Depth
* The number of layers in a model is referred to as its depth. Increasing the depth increases the capacity of the model. Training deep models, e.g. those with many hidden layers, can be computationally more efficient than training a single layer network with a vast number of nodes.