# What is an artificial neural network?

It is a very powerful , strong as well as a very complicated Machine Learning technique which mimics a human brain and how it functions.
Like our human brain has millions of neurons in a hierarchy and Network of neurons which are interconnected with each other via Axons and passes Electrical signals from one layer to another called synapses. This is how we humans learn things. Whenever we see, hear,feel and think something a synapse(electrical impulse) is fired from one neuron to another in the hierarchy which enables us to learn , remember and memorize things in our daily life since the day we were born.


Artificial neural networks are one of the main tools used in machine learning. As the “neural” part of their name suggests, they are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer consisting of units that transform the input into something that the output layer can use. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize.

While neural networks (also called “perceptrons”) have been around since the 1940s, it is only in the last several decades where they have become a major part of artificial intelligence. This is due to the arrival of a technique called “backpropagation,” which allows networks to adjust their hidden layers of neurons in situations where the outcome doesn’t match what the creator is hoping for — like a network designed to recognize dogs, which misidentifies a cat, for example

Another important advance has been the arrival of deep learning neural networks, in which different layers of a multilayer network extract different features until it can recognize what it is looking for.

# Sounds pretty complex. Can you explain it like I’m five?

For a basic idea of how a deep learning neural network learns, imagine a factory line. After the raw materials (the data set) are input, they are then passed down the conveyer belt, with each subsequent stop or layer extracting a different set of high-level features. If the network is intended to recognize an object, the first layer might analyze the brightness of its pixels.

The next layer could then identify any edges in the image, based on lines of similar pixels. After this, another layer may recognize textures and shapes, and so on. By the time the fourth or fifth layer is reached, the deep learning net will have created complex feature detectors. It can figure out that certain image elements (such as a pair of eyes, a nose, and a mouth) are commonly found together.

Once this is done, the researchers who have trained the network can give labels to the output, and then use backpropagation to correct any mistakes which have been made. After a while, the network can carry out its own classification tasks without needing humans to help every time.

Beyond this, there are different types of learning, such as supervised or unsupervised learning or reinforcement learning, in which the network learns for itself by trying to maximize its score — as memorably carried out by Google DeepMind’s Atari game-playing bot.

# How many types of neural network are there?

There are multiple types of neural network, each of which come with their own specific use cases and levels of complexity. The most basic type of neural net is something called a **feedforward neural network,** in which information travels in only one direction from input to output.

A more widely used type of network is the **recurrent neural network,** in which data can flow in multiple directions. These neural networks possess greater learning abilities and are widely employed for more complex tasks such as learning handwriting or language recognition.

There are also **convolutional neural networks, Boltzmann machine networks, Hopfield networks,** and a variety of others. Picking the right network for your task depends on the data you have to train it with, and the specific application you have in mind. In some cases, it may be desirable to use multiple approaches, such as would be the case with a challenging task like voice recognition.

# What kind of tasks can a neural network do?
A quick scan of our archives suggests the proper question here should be “what tasks can’t a neural network do?” From making cars drive autonomously on the roads, to generating shockingly realistic CGI faces, to machine translation, to fraud detection, to reading our minds, to recognizing when a cat is in the garden and turning on the sprinklers; neural nets are behind many of the biggest advances in A.I.

Broadly speaking, however, they are designed for spotting patterns in data. Specific tasks could include classification (classifying data sets into predefined classes), clustering (classifying data into different undefined categories), and prediction (using past events to guess future ones, like the stock market or movie box office).

# How exactly do they “learn” stuff?

In the same way that we learn from experience in our lives, neural networks require data to learn. In most cases, the more data that can be thrown at a neural network, the more accurate it will become. Think of it like any task you do over and over. Over time, you gradually get more efficient and make fewer mistakes.

When researchers or computer scientists set out to train a neural network, they typically divide their data into three sets. First is a training set, which helps the network establish the various weights between its nodes. After this, they fine-tune it using a validation data set. Finally, they’ll use a test set to see if it can successfully turn the input into the desired output.

# Do neural networks have any limitations?

On a technical level, one of the bigger challenges is the amount of time it takes to train networks, which can require a considerable amount of compute power for more complex tasks. The biggest issue, however, is that neural networks are “black boxes,” in which the user feeds in data and receives answers. They can fine-tune the answers, but they don’t have access to the exact decision making process.

This is a problem a number of researchers are actively working on, but it will only become more pressing as artificial neural networks play a bigger and bigger role in our lives.

# Types of transfer functions

The transfer function (activation function) of a neuron is chosen to have a number of properties which either enhance or simplify the network containing the neuron. Crucially, for instance, any multilayer perceptron using a linear transfer function has an equivalent single-layer network; a non-linear function is therefore necessary to gain the advantages of a multi-layer network.[citation needed]

Below, u refers in all cases to the weighted sum of all the inputs to the neuron, i.e. for n inputs,
$$
u=\sum_{i=1}^{n} w_{i} x_{i}
$$
where w is a vector of synaptic weights and x is a vector of inputs.

## 1.Step function
The output y of this transfer function is binary, depending on whether the input meets a specified threshold, θ. The "signal" is sent, i.e. the output is set to one, if the activation meets the threshold.

$$
y=\left\{\begin{array}{ll}{1} & {\text { if } u \geq \theta} \\ {0} & {\text { if } u<\theta}\end{array}\right.
$$

This function is used in perceptrons and often shows up in many other models. It performs a division of the space of inputs by a hyperplane. It is specially useful in the last layer of a network intended to perform binary classification of the inputs. It can be approximated from other sigmoidal functions by assigning large values to the weights.

## 2.Linear combination
In this case, the output unit is simply the weighted sum of its inputs plus a bias term. A number of such linear neurons perform a linear transformation of the input vector. This is usually more useful in the first layers of a network. A number of analysis tools exist based on linear models, such as harmonic analysis, and they can all be used in neural networks with this linear neuron. The bias term allows us to make affine transformations to the data.

## 3.Sigmoid
A fairly simple non-linear function, the sigmoid function such as the logistic function also has an easily calculated derivative, which can be important when calculating the weight updates in the network. It thus makes the network more easily manipulable mathematically, and was attractive to early computer scientists who needed to minimize the computational load of their simulations. It was previously commonly seen in multilayer perceptrons. However, recent work has shown sigmoid neurons to be less effective than rectified linear neurons. The reason is that the gradients computed by the backpropagation algorithm tend to diminish towards zero as activations propagate through layers of sigmoidal neurons, making it difficult to optimize neural networks using multiple layers of sigmoidal neurons.

## 4.Rectifier ReLU ( Rectifier Linear Unit )
In the context of artificial neural networks, the rectifier is an activation function defined as the positive part of its argument:

where x is the input to a neuron. This is also known as a ramp function and is analogous to half-wave rectification in electrical engineering. This activation function was first introduced to a dynamical network by Hahnloser et al. in a 2000 paper in Nature[10] with strong biological motivations and mathematical justifications.[11] It has been demonstrated for the first time in 2011 to enable better training of deeper networks,[12] compared to the widely used activation functions prior to 2011, i.e., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical[13] counterpart, the hyperbolic tangent.

Source: https://en.wikipedia.org/wiki/Artificial_neuron

 ## Neural network can be thought of as a nonlinear generalization of the linear model, both for regression and classiﬁcation.
 
 # Fitting Neural Network
 
 The neural network model has unknown parameters, often called weights, and we seek values for them that make the model ﬁt the training data well. We denote the complete set of weights by θ, which consists of

For K-class classiﬁcation, there are K units at the top, with the kth unit modeling the probability of class k. There are K target measurements Yk, k = 1,...,K, each being coded as a 0−1 variable for the kth class. Derived features Zm are created from linear combinations of the inputs, and then the target Yk is modeled as a function of linear combinations of the Zm, 

$$
\begin{aligned} Z_{m} &=\sigma\left(\alpha_{0 m}+\alpha_{m}^{T} X\right), m=1, \ldots, M \\ T_{k} &=\beta_{0 k}+\beta_{k}^{T} Z, k=1, \ldots, K \\ f_{k}(X) &=g_{k}(T), k=1, \ldots, K \end{aligned}
 -0$$

where Z = (Z1,Z2,...,ZM), and T = (T1,T2,...,TK). 
The activation function σ(v) is usually chosen to be the sigmoid σ(v) = 1 / ( 1 + e^v); 


$$
\begin{aligned}\left\{\alpha_{0 m}, \alpha_{m} ; m\right.&=1,2, \ldots, M \} \quad M(p+1) \text { weights } \\\left\{\beta_{0 k}, \beta_{k} ; k\right.&=1,2, \ldots, K \} \quad K(M+1) \text { weights. } \end{aligned}
$$

For regression, we use sum-of-squared errors as our measure of ﬁt (error function)

$$
R(\theta)=\sum_{k=1}^{K} \sum_{i=1}^{N}\left(y_{i k}-f_{k}\left(x_{i}\right)\right)^{2}
$$

For classiﬁcation we use either squared error or cross-entropy (deviance):

$$
R(\theta)=-\sum_{i=1}^{N} \sum_{k=1}^{K} y_{i k} \log f_{k}\left(x_{i}\right)
$$

and the corresponding classiﬁer is G(x) = argmaxkfk(x). With the softmax activation function and the cross-entropy error function, the neural network model is exactly a linear logistic regression model in the hidden units, and all the parameters are estimated by maximum likelihood

Typically we don’t want the global minimizer of R(θ), as this is likely to be an overﬁt solution. Instead some regularization is needed: this is achieved directly through a penalty term, or indirectly by early stopping.

**The generic approach to minimizing R(θ) is by gradient descent, called back-propagation in this setting.**
Because of the compositional form of the model, the gradient can be easily derived using the chain rule for diﬀerentiation. This can be computed by a forward and backward sweep over the network, keeping track only of quantities local to each unit

Here is back-propagation in detail for squared error loss. Let
$$
z_{m i}=
\sigma\left(\alpha_{0 m}+\alpha_{m}^{T} x_{i}\right)
$$
and
$$
z_{i}=\left(z_{1 i}, z_{2 i}, \dots, z_{M i}\right)
$$
Then we have 
$$
\begin{aligned} R(\theta) & \equiv \sum_{i=1}^{N} R_{i} \\ &=\sum_{i=1}^{N} \sum_{k=1}^{K}\left(y_{i k}-f_{k}\left(x_{i}\right)\right)^{2} \end{aligned}
$$

with derivative

$$
\begin{array}{l}{\frac{\partial R_{i}}{\partial \beta_{k m}}=-2\left(y_{i k}-f_{k}\left(x_{i}\right)\right) g_{k}^{\prime}\left(\beta_{k}^{T} z_{i}\right) z_{m i}} \\ {\frac{\partial R_{i}}{\partial \alpha_{m \ell}}=-\sum_{k=1}^{K} 2\left(y_{i k}-f_{k}\left(x_{i}\right)\right) g_{k}^{\prime}\left(\beta_{k}^{T} z_{i}\right) \beta_{k m} \sigma^{\prime}\left(\alpha_{m}^{T} x_{i}\right) x_{i \ell}}\end{array}
$$

Given these derivatives, a gradient descent update at the (r + 1)st iteration has the form

$$
\begin{aligned} \beta_{k m}^{(r+1)} &=\beta_{k m}^{(r)}-\gamma_{r} \sum_{i=1}^{N} \frac{\partial R_{i}}{\partial \beta_{k m}^{(r)}} \\ \alpha_{m \ell}^{(r+1)} &=\alpha_{m \ell}^{(r)}-\gamma_{r} \sum_{i=1}^{N} \frac{\partial R_{i}}{\partial \alpha_{m \ell}^{(r)}} \end{aligned} -1
$$

where γr is the learning rate, discussed below

$$
\begin{array}{l}{\frac{\partial R_{i}}{\partial \beta_{k m}}=\delta_{k i} z_{m i}} \\ {\frac{\partial R_{i}}{\partial \alpha_{m \ell}}=s_{m i} x_{i \ell}}\end{array} -2
$$

The quantities $\delta_{k i} \text { and } s_{m i}$ are “errors” from the current model at the output and hidden layer units, respectively. From their deﬁnitions, these errors satisfy

$$
s_{m i}=\sigma^{\prime}\left(\alpha_{m}^{T} x_{i}\right) \sum_{k=1}^{K} \beta_{k m} \delta_{k i} -3
$$

known as the **back-propagation equations.** Using this, the updates in **-1** can be implemented with a two-pass algorithm. In the forward pass, the current weights are ﬁxed and the predicted values $\hat{f}_{k}\left(x_{i}\right)$ are computed . In the backward pass, the errors $\delta_{k i}$ are computed from **-0**, and then back-propagated via **-3** to give the errors $\boldsymbol{S}_{m i}$. Both sets of errors are then used to compute the gradients for the updates in **-1**

This two-pass procedure is what is known as back-propagation. It has also been called the delta rule (Widrow and Hoﬀ, 1960). The computational components for cross-entropy have the same form as those for the sum of squares error functio

Back-propagation can be very slow, and for that reason is usually not the method of choice. Second-order techniques such as Newton’s method are not attractive here, because the second derivative matrix of R (the Hessian) can be very large. Better approaches to ﬁtting include conjugate gradients and variable metric methods. These avoid explicit computation of the second derivative matrix while still providing faster convergence.



# Some Issues in Training Neural Networks

## Starting value
Note that if the weights are near zero, then the operative part of the sigmoid  is roughly linear, and hence the neural network collapses into an approximately linear model . Usually starting values for weights are chosen to be random values near zero. Hence the model starts out nearly linear, and becomes nonlinear as the weights increase. Individual units localize to directions and introduce nonlinearities where needed. Use of exact zero weights leads to zero derivatives and perfect symmetry, and the algorithm never moves. Starting instead with large weights often leads to poor solutions.

## overfitting

A more explicit method for regularization is weight decay, which is analogous to ridge regression used for linear models. We add a penalty to the error function R(θ) + λJ(θ)

## Scaling of the input

 At the outset it is best to standardize all inputs to have mean zero and standard deviation one. This ensures all inputs are treated equally in the regularization process, and allows one to choose a meaningful range for the random starting weights. With standardized inputs, it is typical to take random uniform weights
 
## Number of Hidden Units and Layers

Generally speaking it is better to have too many hidden units than too few. With too few hidden units, the model might not have enough ﬂexibility to capture the nonlinearities in the data; with too many hidden units, the extra weights can be shrunk toward zero if appropriate regularization is used. **Typically the number of hidden units is somewhere in the range of 5 to 100,** with the number increasing with the number of inputs and number of training cases. It is most common to put down a reasonably large number of units and train them with regularization. Some researchers use cross-validation to estimate the optimal number, but this seems unnecessary if cross-validation is used to estimate the regularization parameter. Choice of the number of hidden layers is guided by background knowledge and experimentation. Each layer extracts features of the input for regression or classiﬁcation. Use of multiple hidden layers allows construction of hierarchical features at diﬀerent levels of resolution.

## Multiple Minima

The error function R(θ) is nonconvex, possessing many local minima. As a result, the ﬁnal solution obtained is quite dependent on the choice of starting weights. One must at least try a number of random starting conﬁgurations, and choose the solution giving lowest (penalized) error. Probably a better approach is to use the average predictions over the collection of networks as the ﬁnal prediction (Ripley, 1996). This is preferable to averaging the weights, since the nonlinearity of the model implies that this averaged solution could be quite poor. Another approach is via **bagging,** which averages the predictions of networks training from randomly perturbed versions of the training data. 

# Some other stuffs

The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis.

# What is Activation Function?

It’s just a thing function that you use to get the output of node. It is also known as **Transfer Function**.

### Why we use Activation functions with Neural Networks?
It is used to determine the output of neural network like yes or no. It maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).
The Activation Functions can be basically divided into 2 types-

1. Linear Activation Function
2. Non-linear Activation Functions