# Week 1: Intro

* **Feed-forward neural networks** are the most common type.
* A FF consists of an input layer, and output layer, and one or more hidden layers.
* If there is more than one hidden layer it is known as a deep neural network.
* **Recurrent neural networks** are networks incorporating directed cycles.
* RNNs make intrinsic sense for time-series data. They act like extremely deep feed-forward neural networks where each layer is responsible for one timestamp slice of the data, except with global weights.
* A **symmetrically connected network** is a recurrent neural network with the additional restriction that tensors in both directions are symmetrical, e.g. the forward and backward propogation have the same weights.


## Week 2: Perceptrons
* The standard perceptron is a feed-forward neural network without any hidden layers.
* The input layer assigns hand-coded (generally semi-randomized) weights to individual inputs, just to start off with.
* The feature layer learns weights for individual features and applies them to get a result.
* Perceptrons are still kind of useful for learning over large numbers of features.
* Perceptrons use a binary threshold activation function, true to their neuron-inspired design.


* How do you train a perceptron? Here's the example for a binary classification problem.
    * Add an extra component with the value 1 to each input vector. This is the bias term.
    * Pull the training samples, and run each one through the classifier.
    * If the output is correct, leave the weights alone.
    * If the output is incorrect, and a false negative (gives 0 when should give 1), add the input vector to the weights vector.
    * If the output is incorrect, and a false positive (gives 1 when it should give 0), subtract the input vector from the weights vector.
* This simple procedure is guaranteed to find the weights that will totally solve the problem, provides that such weights exist.
* Solving a perceptron (perfect classification) is seems to me to be equivalent to solving a linear support vector classifier.
* The geometric interpretation is that the set of weights that boundarize a correct solution for any given training sample form a separating hyperplane. Lots of training samples draw lots of separating hyperplanes.
* The solution space of acceptable weights is the set of coordinates in the affine space constructed by the half-planes defined by the hyperplanes. This is a convex optimization problem!


* The use of binary thresholds in perceptrons makes for simplicity, but there is a good amount of things that binary discrimintation cannot do. The most obvious example is `~XOR`: when ${(0, 0), (1, 1)}$ maps to 1 and ${(0, 1), (1, 0)}$ maps to 0. This is an inversion that cannot be captured using binary rules.
* To generalize, patterns that can "wrap around" are not capturable using binary activation.
* These limitations are devestating to what a perceptron can do.


## Week 3: Hidden layers
* So instead we focus on learning with hidden layers.
* Optimizing hidden layers is a hard computational problem that was, until recently (obviously), to a certain extent intractable.
* We cannot use the perceptron procedure. The perceptron learning procedure is provably optimal IFF the average of any two good solutions is another good solution, e.g. the problem is convex. Solving with hidden layers is non-convex by default.
* To solve the problem, reframe it. Convergence on the best set of weights in a multi-layer environment in the general case is non-convex. But convergence on the best set of outputs can still be convex, as long as you use a convex loss function.
* In the case of linear neurons, the error surface is a hyper-spheroid. Specific error-value slices of the spheroid are countours ellipsoid contours. More complex loss rules make for more complex error surfaces.
* Batch learning tends to progress smoothly towards the optimal solution in small increments. Most learning procedures are batch-based.
* Online learning tends to take a zigzag path towards the optimal solution. Recall that with respect to an individual observation, beating a certain (categorical!) loss is equivalent to pushing the weight into an affine space. In the linear case the dilineation is a hyperplane, and the area of optimality is half-space. Online learning pushes the outcome towards the boundary of that half-space by an amount dependent on the learning rate.
* Learning will be slow if the ellipsis is heavily squashed. This is because gradient maximization will push almost perpendicular to the elongated dimension.


* (short section of why the logistic activation is easy to solve for; the relationship between the logit and the gradient)


* How do you learn?
* You could learn by perturbing the initially set weights and seeing if the move improves the result or not. This is very simple but very inefficient. Technically this is an example of reinforcement learning.
* An interesting idea is perturbing many weights in parallel, then using statistical modeling to determine which changes were good or bad.
* Another even better idea is to randomly perturn the actions of the activation functions. Changes that produce improvements can be modeled by permanently shifting the weights. There are fewer activations (as in, fewer neurons) than there are weights, so this is faster than the other two ideas.
* However, backpropogation is still best. A backprop write-up that actually makes sense is [this one](https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/).
* We use reverse accumulation to get the derivative on the previous layer.

* Online learning (case-by-case) zig zags too much. Full batch learning always moves you the right way, but is computationally inefficienct. Thus mini-batch learning, a compromise between the two, is what is usually used to perform large-scale training.
* Guards against overfitting that are part of neural network methodoogy:
  * Weight decay: try and keep as many weights near zero as possible.
  * Weight sharing: insist that subsets of the weights are the same value.
  * Early stopping: create a validation holdout set and peek at it in between training runs. If you begin worsening performance on the holdout, stop training.
  * Model averaging: just like gradient boosting.
  * Bayesian fitting: model averaging, but fancier.
  * Droput: randomly eject hidden units during training, in the hopes of making the model more robust.
  * Generative pre-training: "somewhat more complicated".
  

## Week 4: Reconstructing sentences, motivating softmax

**Softmax**
* Squared error is the simplest loss function. But it imparts no knowledge of mutually exclusive classes to the neural network, making it an inappropriate choice for multiclass classification.
* **Softmax activation** addresses this problem.
* Softmax is an activation function that, unlike the logistic activation function, is dependent not just on that node's input but on the input of all other nodes in the layer (or softmax group) as well.
* Specifically. Suppose that we have a softmax layer with $n$ nodes, whose inputs are $z_i$ and whose outputs are $y_i$. The softmax activation is: 

$$y_i = \frac{\exp{z_i}}{\sum_{j=0}^n \exp{z_j}}$$

* The input $z_i$ are referred to as logit.
* The practical effect of the softmax function is that it emphasizes maximum and near-maximum values at the cost of values which are "far away" from the maximum. It is called softmax because this is akin to a "soft maximum". For example, given an input [1, 2, 3, 4, 1, 2, 3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].
* Softmax is not scale invariant. Larger values result in a larger spread of softmax values.
* The sum of all of the $y_i$ values generated by softmax will come to one. This makes it a great chocie for the last layer of your neural network model, when placed behind a probability calibration based loss function.
* Just like logistic activation, softmax has a convenient derivative rule. $\frac{\partial y_i}{\partial z_i} = y_i ( 1 - y_i)$.


* The best default choice for the cost function of a softmax-utilizing neural network is **cross-entropy**.
* Cross-entropy is the negative log probability of the correct target value. Mathematically:

$$C = - \sum_j t_j \log{y_j}$$

* As you go towards zero the log term explodes: $y_j \to 0 \implies \log{y_j} \to \infty$. But the result remains numerically stable due to the other term, $t_j$, which will be near-zero.
* You could see how this would be tricky to implement in a machine instruction.
* The practical effect of cross-entropy is that it is an exponential penalty on incorrectness. Notice how in predicting a $t_j = 1$, an incorrect $y_j = 0.1$ is penalized much less heavily than an incorrect $y_j = 0.01$.
* This results in extreme gradients on totally incorrect answers. This is good because it speeds convergence towards a correct result. But I immediately see cases where it can lead to convergence failure, as well.
* An alternative information-theoretic representation. Let $p(x)$ be the true distribution of the data, and $q(x)$ be some untrue distribution. Then cross-entropy may be defined:

$$H(p, q) = - \sum_x p(x) \log{q(x)}$$

* This is equivalent to the average number of bits necessary to represent an event drawn from a set under a coding scheme optimized for $q(x)$ instead of for $p(x)$ (wow).
* Why is cross-entropy the appropriate choice for softmax activation? Because the steepness of the cross-entropy derivative exactly matches the shallowness of the softmax derivative. In fact, when performing backprop we discover the following property:

$$\frac{\partial C}{\partial z_i} = \sum_j \frac{\partial C}{\partial y_j}\frac{\partial y_j}{\partial z_j} = y_i - t_i$$

* This slope is in the range $[-1, 1]$, and only very small of the difference between the true and the target is very small.
* Technical note: softmax is not a loss function but an activation function. Sometimes you may hear reference to "softmax loss". This is actually linear softmax activation with a cross-entropy loss. The two are so closely intertwined that this mix-up has entered the common lexicon!


**Speech recognition**
* At this point he introduces the neural network core architecture for speech recognition.
* Basically a word featurization layer, pointed at some number of layers or hidden units, pointed at a huge softmax, plus a skip-layer passing words from the featurization layer directly to the softmax.
* Principal limitation: 100,000 words means 100,000 weights on the softmax level.
* Not having enough data severely limits the size of the pre-softmax hidden layers we can use.
* We could make the last hidden layer smaller, but this squeezes less likely options into a constrained encoding space. In this problem, less likely words are still extremely important and relevant!
* An enormous amount of data is needed in order to make this practically possible.


**Serial architecture**
* One solution is to use a serial architecture.
* In a serial architecture we start with the same precursor inputs, but now also fit in a candidate word. The model is trained, and becomes good at determining where the candidate word likes to get placed. The output is unary: a logit score of candidate word likelihood, given precursor 2-grams. No huge softmax!
* Each candidate word now has its own model. This requires training a huge amount of models, but each of those models only needs to be trained once.
* After computing the logit scores for every candidate word, combine all of the candidate words in a big softmax. The difference between the word probabilities and the target probabilities is the cross-entropy error derivative, and this can be used to further refine the models. Clever!
* This architecture is most useful when paired with some other model for drawing candidate words. For example, a trigam model (which was the best known algo before NNs showed up) works well.
* He developed a methodology that transforms the problem from an $O(N)$ lookup to a $O(\log{n})$ lookup by placing words as leaves in a binary tree. I most jumped past this.

## Week 5: object recognition, motivating convolutional neural networks
* Object recognition is hard.
* One of the hardest things about it is achieving **viewpoint invariance**.


* One approach: extract a very large set of redudant viewpoint-invariant features. This is the "shape of a nose" example that I gave to Yumiko over a year ago. With enough features, you get to a point where there is only one way to assemble them into an object.
* Another approach is good preprocessing.
* Another approach is brute force normalization, by (at runtime) feeding a representative grid of transformations of the inputted images into the model and trying for the best fitting result.


**Convolutional neural networks**
* **Convolutional neural networks** are premised on the idea of replicated features.
* We set the same features up across multiple parts of the dataset. For example, we might have a "brightness scanner" feature built against each of 16 pixel collections in an image.
* The same features in different locations get the same weights. This greatly reduces the volume of weights that need to be learned.
* Modifying backpropogation to incorporate linear constraints between gradients is easy. If the weights started at the same value, and every update to the weights carries the same value, then the weights will stay in sync.
* We learn $\frac{\partial E}{\partial w_1}$ and $\frac{\partial E}{\partial w_2}$ as usual, but now instead of tuning the weight by that amount we tune it by $\frac{\partial E}{\partial w_1} + \frac{\partial E}{\partial w_2}$.
* This featurization achieves equivarance (insensitivity of the neural network under data translation; note that this is not "invariance" strictly speaking!) and invariant knowledge (features useful in one area of the image will be learned for all areas of the image).


* One useful optimization is pooling the output of translational feature detectors. An average works, a maximum works even better (by just a bit in practice). This allows more sophisticated learning of the actual features in further hidden layers, as the decreased raw feature node weight volume allows for increased feature utilization node and weight volume.
* However there is a tradeoff, as too much pooling will lose positional information.


* One of the first successful convolutional neural networks was a handwritten digit recognizer by Yann Le Cunn.
* It used many hidden layers, replicated features, and pooling. It was a wide net that was designed to deal with several characters at once even in the face of overlap.
* LeNet5 had a better than 99% accuracy, and was at one time run on 10% of all bank cheques in the US.


* How do you predispose a neural network towards making the right calls? By tuning connectivity, weight constraints, and neuron activation functions.
* Also, synthetic data is very effective, but harder to design right. In particular, "it allows optimization to discover clever ways of using the multi-layer network that we did not think of". And, "we might never understand how it does it". Hmm...


* The McNemar test may be used to test the significance of a seen change in a net's performance. Basically you take two models and put them side by side. A significant improvement would be a model that doesn't make the mistakes that the previous model makes, but also makes fewer mistakes total. A less significant (or at least, less straight) improvement would be a model that makes fewer mistakes, but also makes mistakes in areas that differ from the first model (this is where you break out model averaging to grind out those last problem spots...if you want to).


* The next chunk describes AlexNet, the neural network that made a splash in 2012 by winning the ImageNet competition by a huge margin.
* AlexNet is a more sophisticated version of LeNet. It has seven hidden layers, "competitive normalization", and uses rectified linear units in its hidden layers. Early layers are convolutional, the last two layers are globally connected.
* Some interesting tricks:
  * The use of competitive normalization helped with differences in intensity in the images.
  * He subsampled slightly smaller chunks of images, so he could oversample them (ten times, actually) and create more data. At test time NN results on each of the ten patches are merged.
  * He used dropout (half the hidden units were randomly removed for each training image) in the globally connected layer. This increased robustness, as it disallows "fixing up errors" and compensatory featurization in favor of "individualistic" nodes.
  * And he used a GPU, which was still a relatively new idea at the time.


## Week 6: stochastic gradient descent and mini-batch

**Techniques intro**
* The error surface of a complex NN can still be approximately with some level of accuracy by a "quadratic bowl" (vertical cross-sections are parabolas, and horizontal cross-sections are ellipses).
* The main issue faced by full batch learning is low convergence speed, due to squeezed quadratic bowl parameter spaces. There is no magic algorithm that avoids this problem entirely!
* The main issue faced by full-batch learning with too-large learning rates, and by online learning, is oscillations to and fro that do gradually move towards convergence, but do so slowly, and may have inconsistent gradients (the algorithm may "snap" between points, e.g. go from one side of the "bowl" to the other).
* We would like to have consistent gradients and steady improvement, along with relatively fast convergence.
* Mini-batches strike a nice balance. They are also easy to parallelize.


* True gradients on smooth non-linear functions are a well studied problem, one that has been a long-term focus of the optimization community. This has led to a good understanding of possible optimizations and approximations, but it often needs translation before it will work in the multilayer neural network environment, as NNs are not typical of the problems these folks have studied in the past.
* A basic mini-batch gradient descent algorithm starts by guessing a learning late. If the error keeps getting worse or oscillates, reduce the rate. If the error is falling consistency but slowly, increase the rate. At the end of the learning process, turn down the learning rate all the way until you get a convergent solution (this improves between-trainings model consistency).
* It's recommended to train based on the error in a separate validation set (but you knew this already, right?)


**ML tricks**
* A bag of tricks for learning. This lecture is a "peek into the black art of machine learning".
* Remember that it's the variation in the starting weights that determines what features get learned where. Two nodes with the same bias, input weights, output weights, and connectivity will always learn the same feature, making them redundant. It's the random weight initialization that breaks this symmetry and allows us to learn interesting things.
* A hidden unit with a big fan-in can overshoot their learning, because small changes in many incoming weights will overadjust the node. You want to start with smaller weights on these nodes. A good rule of thumb is to be proportional to the square root of the size of the fan-in.
* We can scale the learning rates for the weights the same way.
* It usually helps to shift each component of the input to mean zero. As best I understand, non zero means get transformed by non-linear activation functions in ways that may squeeze the axes of the solution space, making training harder (squished parabola bowls).
* The same can be said of scaling the input values. In this case it's more obvious to me, however. The softmax function for example is not scale invariant!
* Decorrelating the input, via e.g. PCA, is a more thorough possibility.
* Two common problems in multi-layer networks:
  * If we start with a very big learning rate the weights on the hidden units may become very large and positive or very large and negative. The error derivatives for the hidden units will become tiny. This is a plateau, which is easy to mistake for a local minimun.
  * In a squared error or cross entropy loss multiclassification scenario the network may find the "best guess" strategy very quickly, and plateau there. Again mistakable for a local minimum.
* Don't turn the learning rate down too soon. You can actually fall behind on your results!
* He details a few momentum-based and separate learning rate based ideas for increasing mini-batch convergence speed. 


**Momentum**
* Most large-scale neural networks are trained using a combination of mini-batches, stochastic gradient descent, and momentum convergence.
* Momentum methods do not follow the gradient exactly. Instead the application of force to the weights are preserved, with some decay.
* Consider for example a cost function with cross sections in the form of a squeezed ellipsis. We saw before that ordinary gradient descent will struggle to converge in this environment. It will only move on the major axis a little at a time, but bounce between the sides of the minor axis quite a lot.
* Momentum methods will preserve and enhance the major axis movement whilst dampening the minor axis movement. Ultimately it will end up moving mostly on the major axis and mostly through the center of the ellipsis!
* At the beginning of learning there tend to be very large gradients, so it helps to use a small momentum in that case (< 0.5). Once the large gradients have dissappeared, the learner may become stuck in a ravine, so momentum can be smoothly raised to a 0.9 or thereabout final value.
* Momentum actually allows us to raise the learning rate of the learner in general. It allows us to exit ravines in that pure gradient optimization would get oscillating within.
* The original momentum methods naively calculated from the current position. An improvement to momentum methods makes a gradient jump, *then* corrects itself using the gradient at the new position.


**Adaptive learning**
* The second tunable convergence enhancing technique described in the previous section is adaptive learning weights.
* Recall that the degree of fan-in and the stage of learning both move what an "ideal" learning rate is, and they do so on a node-by-node basis. Hence why adaptive weights is a good idea.
* We call the multiplicative value controlling the rate of convergence the **local gain**.
* An idea for gradient adaptation: start with a local gain of 1. So long as the sign of the gradient doesn't flip, increase the local gain linearly. As soon as the sign does flip, however, decrease it exponentially.
* This is basically exponential backoff, as in TCP/IP.


* Some tricks for adaptive learning.
* Limit the possible values to a certain reasonable range, to prevent the gradient from exploding.
* Use full batches or large mini batches, to obviate the chance of sampling error.
* You can also combine this method with momentum.


* Rmsprop is another adaptive learning rate algorithm, with GH cites as being his favorite default.
* Before rmsprop there is rprop. The fundamental issue is that the size of the gradients varies wildly. Rprop deals with this by looking solely at the sign of the gradient and adapting it by the step size for the given node. The local gain adapts over time.
* Rprop is better at getting out of plateaus.
* Rprop doesn't work well with mini-batches. The model case is a gradient of +0.1 nine times, and then a gradient of -0.9 once. SGD would nullify the gains from the first nine iterations, in this case. rprop meanwhile would not decrease the value nearly as much as it had already increased it.


* Rmsprop is an adaptation of rprop that works well in the mini-batch setting.
* Rmsprop keeps a moving average of the squared gradient for each weight. For instance:

$$\text{MeanSquare}(w, t) = 0.9 \cdot \text{MeanSquare}(w, t-1) + 0.1 \cdot (\frac{\partial w}{\partial w}(t))^2$$

* At runtime, we multiply each $w$ gradient by $\sqrt{\text{MeanSquare(w, t)}}$.
* Rmsprop is a nice compromise between the convergence speed of rprop and the convergence durability (especially in the face of sampling error) of ordinary SDG.


## Week 7 and 8: Sequences and recurrent neural networks
* A broad class of problems involve solving for sequences.


**Sequences and RNNs**
* The simplest kinds of sequence models are memoryless ones.
* **Autoregressive models** predict the next word from the previous N words.
* **Feed-forward neural networks** generalize autoregressive models by incorporating hidden layers between the precursor terms and the outputs.


* For best results, you want to go beyond memoryless models. These allow you to incorporate state information about the system.
* In the wild, linear dynamical systems are one type of system model (think: ODEs and PDEs; shooting down the missile). Another is hidden Markov models (think: word spaghetti and stochastic calculus).
* HMMs are less powerful than RNNs because modeling non-trivial scenarios requires simply too much transitional state-holding to be viable.
* RNNs can express an extremely broad range of state. But their power makes them hard to train.


* An RNN features loopbacks. The equivalence to a feedforward neural network: each timestep trained on is a new layer, and weights are adaptively reused between different layers.
* This illustrates why RNNs can be hard to train!
* Recall that it is easy to modify feedforward to maintain linear constraints on the weights.
* This is equivalent to how RNNs are trained:
  1. The forward pass builds a stack of activities at each time unit layer.
  2. The backwards pass peels layers back to calculate error derivatives at each time step (backpropogation through time).
  3. After the backwards step we add together all of the precursor derivatives for each layer weight, and change the "true" weight (the looped node weight) by an amount proportional to the average of these derivatives.
* A note: you also need to tune the initial states as part of backprop.


* There are several ways to provide input to an RNN.
* One way is to specify initial states of all units.
* Alternatively, specify initial states for a subset of the units.
* Alternatively, specify the state of a subset of the units *at each time step*. This is the most naturaly way to model most sequential data.


* Similarly, there are some options on how to provide targets to an RNN.
* You may specify that the very last layer must match the expected layer. This is good if latest-only information is desired.
* Or, you may specify that some number of near-final layers must match the expected layer. This is good if you desire a most recent subsequence
* Or, you may target outputs on a subset of nodes *at each time step*. This is a natural way of encoding an expecation of a continuously-outputing model.


* A recurrent neural network can emulate a finite state automoton (this is done in this lecture series via a toy example). Taking a binary example...an RNN has access to $2^N$ possible binary activation vectors, albeit expressed via only $N^2$ weights. This means that to represent two streams of information in an input stream, a finite state automoton would have to square its number of states. An RNN, meanwhile, merely needs to double its number of neurons.


**Exploding and vanishing gradients**
* RNNs are difficult to train not only because they are computationally expensive, but because of the problem of exploding and vanishing gradients.
* The essential problem is that the computation of the gradients is a linear operation. It's very easy to get a gradient value that is very large or very small.
* Feedforward networks do not have enough layers for this to become too much of a problem. By amplifying the virtual number of layers of neurons, RNNs create a really big vanishing and exploding problem.
* There are four effective ways to learn an RNN that deals with these problems:
  1. Use the long term short term memory structural recipe (this is an LSTM).
  2. Use a Hessian free optimizer, which deals with this problem mathematically.
  3. Echo state networks carefully structure weakly coupled hidden layers between input and output that give the network a large number of weak oscillators to train over.
  4. Good initialization values with momentum can also work.


**LSTMs**
* LSTMs depends on a memory cell which preserves data that is written into it at a certain time and doesn't release it back into the rest of the net until a certain later time.
* The rest of the net sees a neuron with a value of 1 (?) and a gradient of zero. Thus it makes no contribution to the gradients and learning of the rest of the network.
* At a certain iteration, the data in the memory cell is released back into the network. At this step the LSTM gets to start optimizing on that information as well. That is long-term memory!


* I skipped the rest of week 8. It felt like I'd need to implement these nets myself first to adequately understand the material!


## Week 9: Improving generalization
* NNs overfit like crazy.
* The learning capacity of the model can be tuned to prevent overfitting in four different ways (usually you use a combination of the four):
  1. Changing the architecture by e.g. limiting the number of nodes
  2. Stopping early
  3. Weight decay: L1 or L2 penalties on large weights.
  4. Adding noise to the weights or activities.


* Adding Gaussian noise is computationally very similar, at the end of the day, to performing L2 regularization. But it seems to work better with RNNs, for some reason.


* The Bayesian approach to weight tuning hinges on the fact that minimizing squared residual error is equivalent to maximizing the log probability under a Gaussian.
* The Bayesian method wants to find the "maximum a posteriori", which in full form requires finding the full posterior distribution over all possible weight vectors. This is of course computationally intractable in the direct sense.
* Bayesians have bags of tricks that they use to approximate this posterior distribution at a reasonable computational cost (but the cost is still too high in the general case: see Kaggle comments from practitioners).
* Maximizing probabilities is equivalent to maximizing the negative log probabilities, which are computationally easier to work with (no numerical explosions).
* Since the log function is monotonic, maximizing probabilities over weights is furthermore equivalent to maximizing the sum of the negative log probabilities.
* Ultimately, a lot of math later we arrive at a Bayesian cost function that we are minimizing:

$$C = E + \frac{\sigma_D^2}{\sigma_W^2}\sum_{i}w_i^2$$

* The first term of this new cost function, $E$, is the squared error.
* The second term is a penalty, which takes the form of a ratio of the variances of the underlying Gaussian prior.
* In other words, assuming a Gaussian prior distribution and a Gaussian posterior distribution, there is a well-specified penalty term, discoverable via Bayesian statistical trickery, which gives the best possible model performance.


* GH introduces what he calls "MacKay's quick and dirty trick" for Bayesian optimization of NN training:
  1. Start with guesses for both the Gaussian noise variance and the variance of the prior weights.
  2. Do some learning.
  3. Reset the weights prior variance to be the variance of the actual learned weights.
  4. Loop until satisfied.
  

## Week 10: Mixture models
* Mixing models under naive cost functions encourages compensatory specialization. This is one approach to model mixtures (see for example bagging and random forests).
* Another approach is to have **expert models** that are good for one particular segment or chunk of the data. In this paradigm we need a manager that compares model performance on different segments of the data.
* Competently implemented, an expert mixture will resulting in the concatenation of a good number of models focused on specific segments of the data stream.
* A manager may maximize the log probability of the target values under a mixture of models, where the manager controls the scale (and therefore, the contribution) of each model to the models to values along the band of possible values.
* If we model the mixture as a Gassian, this ultimately results in an extremely complicated but proveably correct smashed probability function.


* Ignored some material on full Bayesian learning: again, not ready to tackle this rigorously yet.


* **Dropout** is an efficient way of combining neural network models. There's a section which sketches dropout performance characteristics.


## Week 11: Hopfield
* **Hopfield nets** are, along with backprop, one of the main reasons for the resurgence of interest in NNs.
* Hopfield nets are very different from feedforward models. They belong to a class of models known as **energy-based models**, of which they are the simplest practical example.
* The key insight is that feedforward networks can get into very complicated and hard-to-understand states.
* If connections are *symmetric*, however, there is a global energy function, and the rest of the properties of the model can be interpreted in view of that function.
* A binary threshold decision rule is used, which causes the network to settle to the minimum in the energy function.


* The global energy function is the sum of many contributions.
* Each contribution depends on one connection weight, the binary state of two neurons, and a bias term.
* The form of the (quadratic) expression is such that each node can easily compute its own contribution to the global energy.


* ...this is getting too difficult to understand without focused application!