## The Theory Behind Neural Networks

Using a *simpler image classification example*, we need to load and visualize our data, edit our data (reshape, normalize, transform to categorical), create our model, compile our model and train the model on our data.

In the *layer size*, we set a number of inputs and add some layers of neurons (or nodes), and an outuput layer of `x` neurons (one for each digit).

Every neuron in each layer is connected to every neuron in the next layer. This is what the “Dense” Keras layer means. That can store a whole lot of information.

So if each neuron has a connection between the oher, how do we store information? We store it in the “weights” between two neurons. This is the analogous to the strength of connection between the neurons of our brain (although with some differences in the mechanisms).

1. **A Simpler Model**
    
    The earliest use case of one neuron was to build a **regression line**, using the equation $y = mx + b$. **Regression is when we use a continues input to predict a continuous output**.
    
    For each variable coming into our neuron, we’re going to find a **slope** ($m$), or *“weight”* for it. The $b$, or *y-intercept*, from $y=mx+b$ is stored with the neuron. Our goal here is to **find a line that can go through these points**. That way, we can use that line to make predictions.
    
    Traditional regression lines are built on the concept of **least squared error**, meaning, we’re going to take the difference of each estimate from the true value and square it.
    
    ```python
    def get_rmse(data, m, b):
    	"""Calculates Mean Square Error."""
    	n = len(data)
    	squared_error = 0
    	for x, y in data:
    		# Find predicted y
    		y_hat = m * x + b
    		# Square difference between prediction and true value
    		squared_error += (y - y_hat) ** 2
    	mse = squared_error / n
    	return mse ** .5
    ```
    
    To get the **loss curve**, we have to use:
    
    - **The Gradient**: Gradient Algorithm, to calculate which direction loss decreases the most.
    - $**λ$ - The learning rate**: how far to travel.
    - **Epoch**: A model update with the full dataset.
    - **Batch**: A sample of the full dataset.
    - **Step**: An update to the weight parameters.
    
    A lot of research has been done on the best way to define the learning rate, and machine learning frameworks have a few tools that will automatically adjust the learning rate.
    
    > For instance, a popular one is **Adam (Adaptive momentum)** which is kind of like thinking of our loss curve as a mountain and our position as a marble. If we drop a marble on top of a mountain, it will pick up speed, jumping over trenches (local minima) before hopefully landing at a lower minima. Adam has since been improved, and the current default in TensorFlow is called “**RMSprop**”.
    > 
2. **From Neuron to Network**
    
    After explaining a few of deep learning concepts, we can start building up our network. Instead of having a singular `x` , we’re going to have to have multiple `x`  inputs and find each of them a **weigth**. We just need to find the gradient for the new variable.
    
    We can also take the output and feed it into another, and as long as we don’t make a loop, we can connect the same output to multiple inputs. And with this we officially have a **deep learning network**. When we calculate from gradient descent, we can use the error calculated in a lter neuron as part of the error for the previous neuron it’s connected to.
    
    ---
    
    > **Backpropagation algorithm** - Is essential for training large neural networks quickly. The activation function has to be a non-linear function, otherwise the neural network will only be able to learn linear models.
    > 
    
    > A commonly used activation function is the **Sigmoid function**: $f(x)=\frac{1}{1+e^{⁻x}}$.
    > 
    
    > The goal is to l**earn the weights of the network automatically from data** such that the predicted output $y_{output}$  is close to the target $y_{target}$ for all inputs $x_{input}$.
    > 
    
    ---
    
3. **Activation Functions**
    
    There are a lot of activation functions, some more used are:
    
    1. **Linear**
        - $ŷ=wx + b$
        
        ```python
        # Multiply each input
        # with a weight (w) and
        # and intercept (b)
        y_hat = w * x + b
        ```
        
    2. **ReLU (rectified linear activation function)**
        - $ŷ=\begin{cases} wx + b &if& wx+b > 0 \\ 0 &otherwise& \end{cases}$
        
        ```python
        # Only return result
        # if total is positive
        linear = w * x + b
        y_hat = linear * (linear > 0)
        ```
        
    3. **Sigmoid**
        - $ŷ=\frac{1}{1 + e ^ {-(wx+b)}}$
        
        ```python
        # Start with line
        linear = w * x + b
        # Warp to - inf to 0
        inf_to_zero = np.exp(-1 * linear)
        # Squish to -1 to 1
        y_hat = 1 / (1 + inf_to_zero)
        ```
        
    
    Computer like equations of a line because they’re quick to compute and it’s easy for us to give it the rules on how to differentiate them.
    
    One easy way to add non-linearity is to feed our equation of a line into another non-linear function.
    
    Given enough data, our neural network will figure our the parameters of each of these-sub-components for us. Having a general understanding of the shape of the data and the relationship between the variables can help us build more efficient models for our data, saving time and computation.
    
4. **Overfitting**
    
    Why not have a super large neural network?
    
    - The problem goes back to classical statistics, but it still plagues neural networks.
    - Not all problems can be so simple, and it’s the job of a Data Scientist to be able to determine the correct complexity for the model given.

5. **From neuron to classification**