# Deep Belief Networks

#### (Artificial) Neural Networks
- A class of mathematical models that train to produce and optimize a definition for a function (or distribution) over a set of input features.  
- The specific objective of a given neural network application can be defined by the operator using a performance measure (typically a cost function); in this way, neural networks may be used to classify, predict, or transform their inputs.
- Composed of the following elements:
    - A learning process
        - Learns by adjusting parameters within the weight function of its nodes to minimize the cost function
    - A set of neurons or weights
        - The weight/activation function that manipulates input data - weights must be adaptive
        - Uses both visible and hidden units
    - Connectivity functions
        - Control which nodes can relay data to which other nodes

### Restricted Boltzmann Machine (RBM)

#### Boltzmann Machine
- A Boltzmann machine is a particular type of stochastic, recurrent neural network.  It is an energy-based model, which means that it uses an energy function to associate an energy value with each configuration of the network.  
- A Boltzmann machine is a directed cyclic graph, where every node is connected to all other nodes.  This property enables it to model in a recurrent fashion, such that the model's outputs evolve and can be viewed over time.
- The learning loop in a Boltzmann machine involves maximizing the probability of the training dataset, X.  As noted, the specific perfomrance measure used is energy, which is characterized as the negative log of the probability for dataset X, given a vectore of model parameters, theta.  This measure is calculated and used to update the network's weights in such a way as to minimize the free energy in the network.
- Advantages:
    - The Boltzmann machine has seen particular success in processing image data, including photographs, facial features, and handwriting classification contexts.
- Disadvantages:
    - The Boltzmann machine is not practical for more challenging ML problems.  This is due to the fact that there are challenges with the machine's ability to scale; as the number of nodes increases, the compute time grows exponentially, eventually ending up in a position where it is unable to compute the free energy of the network.

#### RBM Topology
The main topological change that delivers efficiency improvements is the restriction of connectivity between nodes.  First, one must prevent connection between nodes within the same layer.  Additionally, all skip-layer connections (that is, direct connections between non-consecutive layers) must be prevented.  A Boltzmann machine with this architecture is referred to as an Restricted Boltzmann Machine (RBM).
- One advantage of this topology is athat the hidden and visible layers are conditionally independent given one another.  As such, it is possible to sample from one layer using the activations of the other.

#### RBM Training
The RBM is typically trained using a procedure with a different learning algorithm at its heart, the Permanent Contrastive Divergence (PCD) algorithm, which provides an approximation of maximum likelihood.  PCD doesn't evaluate the engergy function itself, but instead allows us to estimate the gradient of the energy function.  With this information, we can proceed by making very small adjustments in the direction of the steepest gradeint via which we may progress, as desired, toward the local minimum.

#### PCD
The PCD algorithm is made up of two phases, referred to as the positive and negative phases, each of which has a corresponding effect on the energy of the model.
- The positive phase increases th probability of the training dataset X, thus reducing the energy of the model.
- Following this , the negative phase uses a sampling approach from the model to estimate the negative phase gradient.  The overall effect of the negative phase is to decrease the probability of samples generated by the model.
- Sampling in the  negative phase and throughout the update process is achieved using a form of sampling called Gibbs sampling.
    - This is a variant of the Markov Chain Monte Carlo family of algorithms, and samples from an approximated multivariate probability distribution