# lesson 0.5: introduction to neural networks

Objectives:
  - explain what **neural network** means in the context of machine learning
  - describe at a high level how neural networks are trained
  - relate neural networks to function approximation
  - name some neural network models commonly used in applied settings

## what do machine learning researchers mean by "neural network"?
+ a mathematical model that computes functions in a way that resembles, at a very high level, how we think networks of neurons communicate in the brain

### an example neural network

![net](./images/lesson-0.5/net.gif)

+ takes as *input* a *vector* $x$ which activates *artificial neurons* in each of its *fully-connected layers* in a *feedforward* way to produce as *output* another vector $y$
  - **input**: e.g., an image, flattened to make it into a vector
  - **artificial neurons**: have an **activation** as explained below
  - **layer**: unit of a network, made up of neurons, each of which takes input from neurons in the previous layer and are connected to neurons in the next layer
    + e.g., hidden layers, input layer, output layer
  - **fully-connected**: any neuron $m_i$ in layer $m$ connects to every neuron $n_j$ in layer $n$
  - **output**: e.g., length $n$ vector of probabilities that input is an image of some class $n_i$
  - **feedforward**: information only moves in one direction in this network, from the input to the output. There are no connections that let information flow backwards.

### artificial neurons
![neuron](./images/lesson-0.5/artificial-neuron.jpeg)
image from <https://dzone.com/articles/the-artificial-neural-networks-handbook-part-4>
  + a typical artificial neuron computes a function $y = g(x)$ that can be broken down into two parts 
    - (1) the dot product of the inputs $x_i$ and the weights
    - (2) a non-linear activation function
  + 
  
**The weights are what we change during learning!** I.e., they are the parameters of our model that we optimize.

## how are neural networks trained?
- typically, using **stochastic gradient descent** (SGD) in combination with **backpropagation**
  + stochastic gradient descent:
    + as with other machine learning algorithms, we create a loss function and then 
      optimize our model by minimizing that loss function
      - large datasets make it expensive to compute loss on the whole dataset
      - models with many parameters make it impossible to analytically compute gradient
    + so instead, we approximate the gradient with one sample or a batch of samples from our training set
  + backpropagation:
    - essentially, running the loss "backward" through the network to find how much each weight attributed to the error
    - mathematically speaking, the chain rule
    - in practice, done with **automatic differentation**

## neural networks commonly used in applied settings

### *convolutional* neural networks
![convnet](./images/lesson-0.5/convnet.gif)

First sign that neural networks would become state of the art in computer vision was when a convolutional neural network, commonly referred to as AlexNet, achieved highest accuracy on the ImageNet dataset in 2012. The layer types used in AlexNet are still common in state-of-the-art CNNs.
![alexnet](./images/lesson-0.5/AlexNet-Architecture.png)

### *recurrent* neural networks
