# Neural networks

Now this is the big topic of the last few decades. Since the 2010s, neural networks have become the most popular method applied in the field of ML. Why is that? We'll talk a bit about the history of neural networks, what they needed and lacked and what changed.

#### History

Neural networks were first introduced in the 1940s by Warren McCulloch and Walter Pitts. They proposed a simple electrical circuit that could be used to model the behavior of neurons. Later, the idea that neurons' pathways strengthen over repeated use, especially between neurons that fire together (Donald Hebb). That was the beginning of mapping the brain to a computer.  

The start of the neural networks algorithm came with Frank Rosenblatt in 1957. He proposed the perceptron, a simple neural network that could be used to classify linearly separable data. The perceptron was a single layer neural network with a single output neuron. The input was processed as a weighted sum; then a threshold was applied, and the output was either 0 or 1.

![An image of the perceptron; it has an input layer, does a weighted sum over the numbers and applies a threshold to determine the output as 1 or 0](https://miro.medium.com/max/1400/1*ofVdu6L3BDbHyt1Ro8w07Q.png)  
Via [Towards Data Science](https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com%2Frosenblatts-perceptron-the-very-first-neural-network-37a3ec09038a&psig=AOvVaw3D5MgtSVmqF_YPMurYUCxv&ust=1674493057232000&source=images&cd=vfe&ved=0CA8QjRxqFwoTCIj4pfDS2_wCFQAAAAAdAAAAABAg)  

The goal of the perceptron was to "learn" the weights of the linear combination, in order to minimize the difference between predicted output and desired output. Unfortunately, the perceptron could only learn linearly separable data, due to the linear processing applied. Another problem came with the scale of the model itself. By adding multiple layers, the model could learn more complex functions, and finally separate non-linear data. The problem, however, was the sheer complexity of the training task. At the time, the hardware and the methods applied were indicative of the fact that neural networks would lead to nowhere. That's how the first "AI winter" came to be, as AI research was abandoned for a while.  

In the 1980s, the next efforts to bring AI forward began. Backpropagation had been implemented in the late 1960s, but it was only in the 1980s that people thought to use it for neural networks (we will discuss backpropagation in a future section). By this time, the so-called "Expert system" algorithm was adopted by many companies, relying on sets of rules to solve problems and make decisions. Around 1990, the fall of export systems brought a 2nd AI winter, although a shorter one. The developments of multi-layer perceptrons powered by the backpropagation algorithm kept going. The problem was with hardware capabilities. These "slow learners" relied on many, many iterations to reach good solutions. Computers needed to go faster. As such, progress was slow, dependent on the speed of the processors.  

In the 2000s, the "AI summer" came. The development of GPUs (graphics processing units) allowed for a huge increase in the speed of neural networks. The first GPU was released in 1999, and by 2006, the first GPU-based neural network was developed. GPUs' role is to compute a lot of math at the same time. As it so happens, this is exactly what networks need. As such, the speed of the training process increased by a considerable order of magnitude. The development of the internet and the spreading availability of datasets (data) allowed for the development of deep learning, which is the name given to neural networks with many layers (depends on each person, but generally networks with more than 3 hidden layers are considered deep).  
In the 2010s, GPUs leaped in performance, cementing the research of neural networks at the front of the AI industry. With time, specialized hardware was introduced, such as TPUs (Tensor Processing Units) and (lately) ML accelerators on mobile devices.


# Introduction

Now that we know a bit about how neural networks came to be, let's talk about what they are and how they work. We've mentioned them before, so now we'll go into detail.  

A neural network is model that tries to mimic the human brain. Instead of actual neurons, we're using numbers and functions to achieve results. Each cell in a network is a **neuron**. The connections represent the pathways between neurons. In ML, they are called **weights** (mathematically), or **edges** (graphically, coming from the discipline of graphs). The neurons are organized in layers (input, hidden, output), and we apply **activation functions** after each layer's values have been computed. These activation functions are used to introduce non-linearity into the network (we'll see in more detail as we implement a network).  

![A simple network with 3 input neurons connected to a hidden layer with 2 neurons and an output layer with 2 neurons. Each layer's neurons are fully connected to the next batch](../assets/simpleNN.png)  

Let's disect this image, shall we?  
##### Input layer
We have neurons $n_1, n_2, n_3$ in the input layer. These receive the input data. The features we've been using so far are these neurons' values. Quite simple and direct so far. We now have to use and combine them somehow to reach useful information. The weights are the connections between neurons of consecutive layers. These denote the "importance" of any given neuron's value.

##### Hidden layer
These are the intermediary layers that do the bulk of the processing. Let's jump to an example. Using the diagram above, the 2 neurons in the hidden layer would receive the following values:
$$
\begin{align}
n_{1, hidden} &= w_{1, 1} \cdot n_1 + w_{2, 1} \cdot n_2 + w_{3, 1} \cdot n_3 \\
n_{2, hidden} &= w_{1, 2} \cdot n_1 + w_{2, 2} \cdot n_2 + w_{3, 2} \cdot n_3
\end{align}
$$
I haven't numbered them, since it won't be that important. What matters is that we observe the following fact: what we're doing is essentially a linear combination of the input values. Remember when we discussed linear regression? We are following the same steps here. But something is missing: the bias term. Yes, we need it, as we've needed it before. Just as a linear function would not be able to fit to data that does not go through the origin, a neural network cannot fit to certain types of data without the bias term. In terms of how neural networks cover this need, we could have an additional neuron on each layer with a constant value of 1. The weight associated with that neuron would be the bias term.

##### Output layer
I have drawn 2 neurons in the output layer. It's important to introduce this fact early on: neural networks can produce any number of outputs, so it's a matter of the problem at hand and the result we want to achieve. After the computing has been done, these neurons will hold the final result: the output of the neural network.

### Warning: activation functions
I would like you to consider the following issue: as we've seen, going from one layer to the next involves computing a linear combination of the values in the previous layer. Apply this any number of times, but in the end what you get is still a linear combination. We know for a fact that a linear combination outputs a line, no matter how many times we apply it. This is problematic. We need to introduce non-linearity into the network, in order to help it fit data in high-dimensional space that is not linearly separable. This is why **activation functions** exist. They are applied after each layer's values have been computed.  

There are lots of activation functions out there, and we could in theory use anything, but the purpose is to introduce non-linearity with the cheapest computational cost possible. Here are the most common ones:
- Sigmoid  
![sigmoid](https://miro.medium.com/max/640/1*Xu7B5y9gp0iL5ooBj7LtWw.webp)  
Via [Towards Data Science](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6)
- Tanh  
![tanh](https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-27_at_4.23.22_PM_dcuMBJl.png)  
Via [Papers with code](https://paperswithcode.com/method/tanh-activation)
- ReLU  
![relu](https://www.nomidl.com/wp-content/uploads/2022/04/image-10.png)  
Via [Nomidl](https://www.nomidl.com/deep-learning/what-is-relu-and-sigmoid-activation-function/)
- Leaky ReLU  
![leaky relu](https://production-media.paperswithcode.com/methods/Screen_Shot_2020-05-25_at_3.09.45_PM.png)  
Via [Papers with code](https://paperswithcode.com/method/leaky-relu)

Anything that isn't a simple $f(x) = x$ will do. But we have to be careful about the computation involved. This is why the industry turned to ReLU, for it's simplicity and speed. We've talked about gradient descent before. Imagine calculating the derivative of Tanh vs ReLU. The former is a lot more complex, and the latter is a simple $f(x) = 1$ if $x > 0$, and $f(x) = 0$ otherwise. This is why ReLU is so popular. It really might be the most popular out there. We'll talk about the *vanishing gradient* problem when discussing the training of neural networks, and we'll see why sigmoid and tanh are not used as much.  

### Some more details

There are a few things one should know about neural networks, right from the beginning. They can be arranged in a multitude of ways, which is why lots of architectures exist. We call nets in which all neurons are connected as *fully connected*. We'll see that there are problems that benefit from designing a network with a different topologies.  
The more layers & neurons we have, the more complex the data we can fit. The great thing about networks is that they really excel at finding patterns in highly-dimensional, highly-complex data. They don't require human intervention all that much. If we see signs of high bias (underfitting), we can simply add more neurons (existing or new layers). This increases complexity. The problem then is to gather enough data to be able to train these. They are prone to overfitting without a lot of data, since they are so great at fitting despite complexity.  

One thing to keep in mind: the parameters we train with NNs are the *weights*. Neurons are simply information-holding cells. Activation functions are static procedures we apply to values. There is nothing we should be modifing except for weights. Now, consider the number of the weights. They are parameters, so for each one we add, we increase the required compute power. In the diagram above, we have 10 weights. That's 10 parameters to tweak. Add the bias terms (in practice, you really should), and we get another 4, so that's 14 total. Things get really complicated really fast.

# Training: How do we find the optimal parameters?

Starting with the basics, neural networks are a supervised algorithm. This means we know the output. This also means we can define error/loss functions. These tell us how far from the real answer we are.  
The next step is to define a method that brings us closer to the right values for the parameters. We could guess and check. Imagine how slow that would go, based on the fact that a simple network like the one above should have 14 parameters to tweak. So then we use math. We turn to *gradient descent*.