### Crowd Counting Model using Deep Learning
#### Introduction
Artificial Intelligence and Machine Learning is going to be our biggest helper in coming decade!

Today morning, I was reading an article which reported that an AI system won against 20 lawyers and the lawyers were actually happy that AI can take care of repetitive part of their roles and help them work on complex topics. These lawyers were happy that AI will enable them to have more fulfilling roles.

Today, I will be sharing a similar example – How to count number of people in crowd using Deep Learning and Computer Vision, [analyticsvidhya在线课程](https://trainings.analyticsvidhya.com/courses/course-v1:AnalyticsVidhya+CVDL101+CVDL101_T1/about)?  But, before we do that – let us develop a sense of how easy the life is for a Crowd Counting Scientist.

**P.S.** This article assumes that you have a basic knowledge of how convolutional neural networks (CNNs) work

#### Act like a Crowd Counting Scientist
Let’s start!

Can you help me count / estimate number of people in this picture attending this event?

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/crowdcounting/crowd-at-a-stadium-in-johannesburg-south-africa-for-rugby-768x514.jpg?raw=true)

Ok – how about this one?

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/crowdcounting/IMG_2-850x592.jpg?raw=true)

You get the hang of it. By end of this tutorial, we will create an algorithm for Crowd Counting with an amazing accuracy (compared to humans like you and me). Will you use such an assistant?

### Table of Contents
1. What is Crowd Counting?
2. Why is Crowd Counting required?
3. Understanding the Different Computer Vision Techniques for Crowd Counting
4. The Architecture and Training Methods of CSRNet
5. Building your own Crowd Counting model in Python

#### 1、What is Crowd Counting?
Crowd Counting is a technique to count or estimate the number of people in an image. Take a moment to analyze the below image:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/crowdcounting/IMG_2-850x592%20(1).jpg?raw=true)

Can you give me an approximate number of how many people are in the frame? Yes, including the ones present way in the background. The most direct method is to manually count each person but does that make practical sense? It’s nearly impossible when the crowd is this big!

Crowd scientists (yes, that’s a real job title!) count the number of people in certain parts of an image and then extrapolate to come up with an estimate. More commonly, we have had to rely on crude metrics to estimate this number for decades.

Surely there must be a better, more exact approach?

Yes, there is!

While we don’t yet have algorithms that can give us the EXACT number, most computer vision techniques can produce impressively precise estimates. Let’s first understand why crowd counting is important before diving into the algorithm behind it.

#### 2、Why is Crowd Counting useful?
Let’s understand the usefulness of crowd counting using an example. Picture this – your company just finished hosting a huge data science conference. Plenty of different sessions took place during the event.

You are asked to analyze and estimate the number of people who attended each session. This will help your team understand what kind of sessions attracted the biggest crowds (and which ones failed in that regard). This will shape next year’s conference, so it’s an important task!

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/crowdcounting/IMG_2-850x592%20(1).jpg?raw=true)

There were hundreds of people at the event – counting them manually will take days! That’s where your data scientist skills kick in. You managed to get photos of the crowd from each session and build a computer vision model to do the rest!

There are plenty of other scenarios where crowd counting algorithms are changing the way industries work:

- Counting the number of people attending a sporting event
- Estimating how many people attended an inauguration or a march (political rallies, perhaps)
- Monitoring of high-traffic areas
- Helping with staffing allocation and resource allotment

Can you come up with some other use cases? Let me know in the [issues](https://github.com/5267/ML/issues) section! We can connect and try to figure out how we can use crowd counting techniques in your scenario.

#### 3、 Understanding the Different Computer Vision Techniques for Crowd Counting
Broadly speaking, there are currently four methods we can use for counting the number of people in a crowd:

1. Detection-based methods
Here, we use a moving window-like detector to identify people in an image and count how many there are. The methods used for detection require well trained classifiers that can extract low-level features. Although these methods work well for detecting faces, they do not perform well on crowded images as most of the target objects are not clearly visible.

2. Regression-based methods
We were unable to extract low level features using the above approach. Regression-based methods come up trumps here. We first crop patches from the image and then, for each patch, extract the low level features.

3. Density estimation-based methods
We first create a density map for the objects. Then, the algorithm learn a linear mapping between the extracted features and their object density maps. We can also use random forest regression to learn non-linear mapping.

4. CNN-based methods
Ah, good old reliable convolutional neural networks (CNNs). Instead of looking at the patches of an image, we build an end-to-end regression method using CNNs. This takes the entire image as input and directly generates the crowd count. CNNs work really well with regression or classification tasks, and they have also proved their worth in generating density maps.

CSRNet, a technique we will implement in this article, deploys a deeper CNN for capturing high-level features and generating high-quality density maps without expanding the network complexity. Let’s understand what CSRNet is before jumping to the coding section.

#### 4、Understanding the Architecture and Training Method of CSRNet
CSRNet uses VGG-16 as the front end because of its strong transfer learning ability. The output size from VGG is ⅛th of the original input size. CSRNet also uses dilated convolutional layers in the back end.

But what in the world are dilated convolutions? It’s a fair question to ask. Consider the below image:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/crowdcounting/Screenshot-from-2019-02-01-16-49-21.png?raw=true)

The basic concept of using dilated convolutions is to enlarge the kernel without increasing the parameters. So, if the dilation rate is 1, we take the kernel and convolve it on the entire image. Whereas, if we increase the dilation rate to 2, the kernel extends as shown in the above image (follow the labels below each image). It can be an alternative to pooling layers.

#### 4.1 Underlying Mathematics
I’m going to take a moment to explain how the mathematics work，This will come in handy when you need to tweak or modify your model.



### Prerequisities
#### A Comprehensive Tutorial to learn Convolutional Neural Networks from Scratch

#### An Introductory Guide to Deep Learning and Neural Networks #1
#### Table of Contents
1. Understanding the Course Structure
2. Course 1: Neural Networks and Deep Learning
    - Module 1: Introduction to Deep Learning
    - Module 2: Neural Network Basics
        - Logistic Regression as a Neural Network
        - Python and Vectorization
    - Module 3: Shallow Neural Networks
    - Module 4: Deep Neural Networks

#### 1. Understanding the Course Structure

This deep learning specialization is made up of 5 courses in total. Course #1, our focus in this article, is further divided into 4 sub-modules above

#### 2. Course 1 : Neural Networks and Deep Learning
#### 2.1 Module 1: Introduction to Deep Learning

**What is a Neural Network?**

Consider an example where we have to predict the price of a house. The variables we are given are the size of the house in square feet (or square meters) and the price of the house. Now assume we have 6 houses. So first let’s pull up a plot to visualize what we’re looking at:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/1.png?raw=true)

On the x-axis, we have the size of the house and on the y-axis we have it’s corresponding price. A linear regression model will try to draw a straight line to fit the data:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/2.png?raw=true)

So, the input(x) here is the size of the house and output(y) is the price. Now let’s look at how we can solve this using a simple neural network:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/3.png?raw=true)

Here, a neuron will take an input, apply some activation function to it, and generate an output. One of the most commonly used activation function is **ReLU** (Rectified Linear Unit):

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/4.png?raw=true)

ReLU takes a real number as input and returns the maximum of 0 or that number. So, if we pass 10, the output will be 10, and if the input is -10, the output will be 0

For now let’s stick to our example. If we use the ReLU activation function to predict the price of a house based on its size, this is how the predictions may look:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/5.png?raw=true)

So far, we have seen a neural network with a single neuron, i.e., we only had one feature (size of the house) to predict the house price. But in reality, we’ll have to consider multiple features like number of bedrooms, postal code, etc.? House price can also depend on the family size, neighbourhood location or school quality. How can we define a neural network in such cases?

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/6.png?raw=true)

It gets a bit complicated here. Refer to the above image as you read – we pass 4 features as input to the neural network as x, it automatically identifies some hidden features from the input, and finally generates the output y. This is how a neural network with 4 inputs and an output with single hidden layer will look like:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/7.png?raw=true)

Now that we have an intuition of what neural networks are, let’s see how we can use them for supervised learning problems.

**Supervised Learning with Neural Networks**

Supervised learning refers to a task where we need to find a function that can map input to corresponding outputs (given a set of input-output pairs). We have a defined output for each given input and we train the model on these examples. Below is a pretty handy table that looks at the different applications of supervised learning and the different types of neural networks that can be used to solve those problems:

Input (X) | Output (y)	| Application	| Type of Neural Network 
- | :-: | -: | -:
Home Features | Price | Real Estate	| Standard Neural Network
Ad, user info | Click prediction (0/1)	| Online Advertising|Standard Neural Network
Image | Image Class	|Photo Tagging | CNN
Audio | Text Transcript|	Speech Recognition | RNN
English | Chinese | Machine Translation | RNN
Image, Radar info| Position of car | Autonomous Driving | Custom / Hybrid NN

Below is a visual representation of the most common Neural Network types:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/8.png?raw=true)

As you might be aware, supervised learning can be used on both structured and unstructured data.

In our house price prediction example, the given data tells us the size and the number of bedrooms. This is **structured data**, meaning that each feature, such as the size of the house, the number of bedrooms, etc. has a very well defined meaning.

In contrast, **unstructured data** refers to things like **audio, raw audio, or images** where you might want to recognize what’s in the image or text (like object detection). Here, the features might be the pixel values in an image, or the individual words in a piece of text. It’s not really clear what each pixel of the image represents and therefore this falls under the unstructured data umbrella.

Simple machine learning algorithms work well with structured data. But when it comes to unstructured data, their performance tends to take quite a dip. This is where neural networks have proven to be so effective and useful. They **perform exceptionally well on unstructured data**. Most of the ground-breaking research these days has neural networks at it’s core.

**Why is Deep Learning Taking off?**

To understand this, take a look at the below graph:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/9.png?raw=true)

As the amount of data increases, the performance of traditional learning algorithms, like SVM and logistic regression, does not improve by a whole lot. In fact, it tends to plateau after a certain point. In the case of neural networks, the performance of the model increases with an increase in the data you feed to the model.

There are basically three scales that drive a typical deep learning process:

1. Data
2. Computation Time
3. Algorithms

To improve the computation time of the model, activation function plays an important role. If we use a sigmoid activation function, this is what we end up with:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/10.png?raw=true)

The slope, or the gradient of this function, at the extreme ends is close to zero. Therefore, the parameters are updated very slowly, resulting in very slow learning. Hence, switching from a sigmoid activation function to ReLU (Rectified Linear Unit) is one of the biggest breakthroughs we have seen in neural networks. ReLU updates the parameters much faster as the slope is 1 when x>0. This is the primary reason for faster computation of the models.

#### 2.2 Module 2: Introduction to Deep Learning
This module is further divided into two parts:

- Part I: Logistic Regression as a Neural Network
- Part II: Python and Vectorization

**Part I: Logistic Regression as a Neural Network**

**Binary Classification**

In a binary classification problem, we have an input x, say an image, and we have to classify it as having a cat or not. If it is a cat, we will assign it a 1, else 0. So here, we have only two outputs – either the image contains a cat or it does not. This is an example of a binary classification problem.

We can of course use the most popular classification technique, logistic regression, in this case.

**Logistic Regression**

We have an input X (image) and we want to know the probability that the image belongs to class 1 (i.e. a cat). For a given X vector, the output will be:

$$y = w^TX + b$$

Here **w** and **b** are the parameters. Since our output y is probability, it should range between 0 and 1. But in the above equation, it can take any real value, which doesn’t make sense for getting the probability. So logistic regression also uses a sigmoid function to output probabilities:

$$\hat{y} = \sigma(w^Tx + b)$$

For any value as input, it will only return values in the 0 to 1 range. The formula for a sigmoid function is:

$$\sigma(z)= \frac{1}{1+e^{-z}}$$

So, if z is very large, exp(-z) will be close to 0, and therefore the output of the sigmoid will be 1. Similarly, if z is very small, exp(-z) will be infinity and hence the output of the sigmoid will be 0.

Note that the parameter w is nx dimensional vector, and b is a real number. Now let’s look at the cost function for logistic regression

**Logistic Regression Cost Function**

To train the parameters w and b of logistic regression, we need a cost function. We want to find parameters w and b such that at least on the training set, the outputs you have (y-hat) are close to the actual values (y).

We can use a loss function defined below:

$$L(\hat{y},y) = \frac{1}{2}(\hat{y}-y)^2$$

The problem with this function is that the optimization problem becomes non-convex, resulting in multiple local optima. Hence, gradient descent will not work well with this loss function. So, for logistic regression, we define a different loss function that plays a similar role as that of the above loss function and also solves the optimization problem by giving a convex function:

$$L(\hat{y},y) = -(ylog\hat{y} + (1-y)log(1-\hat{y})$$

We want our cost function to be as small as possible. For that, we want our parameters w and b to be optimized.

**Gradient Descent**

This is a technique that helps to learn the parameters w and b in such a way that the cost function is minimized. The cost function for logistic regression is convex in nature (i.e. only one global minima) and that is the reason for choosing this function instead of the squared error (can have multiple local minima).

Let’s look at the steps for gradient descent:

1. Initialize w and b (usually initialized to 0 for logistic regression)
2. Take a step in the steepest downhill direction
3. Repeat step 2 until global optimum is achieved

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/11.png?raw=true)

The updated equation for gradient descent becomes:

$$w:=w-\alpha\frac{dJ(w)}{dw}$$

Here, ⍺ is the learning rate that controls how big a step we should take after each iteration.

If we are on the right side of the graph shown above, the slope will be positive. Using the updated equation, we will move to the left (i.e. downward direction) until the global minima is reached. Whereas if we are on the left side, the slope will be negative and hence we will take a step towards the right (downward direction) until the global minima is reached. Pretty intuitive, right?

The updated equations for the parameters of logistic regression are:

$$w:=w-\alpha\frac{dJ(w,b)}{dw}$$
$$b:=b-\alpha\frac{dJ(w,b)}{db}$$

**Derivatives**

Consider a function, f(a) = 3a, as shown below:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/12.png?raw=true)

The derivative of this function at any point will give the slope at that point. So,

$$f(a=2) = 3*2 =6 $$
$$f(a=2.001) = 3*2.001 = 6.003$$

Slope/derivative of the function at a = 2 is:

$$Slope = \frac{height}{width}$$

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/13.png?raw=true)

$$Slope = \frac{0.003}{0.001} = 3$$

This is how we calculate the derivative/slope of a function.

**Computation Graph**

These graphs organize the computation of a specific function. Consider the below example:

$$J(a,b,c) = 3(a+bc)$$

We have to calculate J given a, b, and c. We can divide this into three steps:

1. u = bc
2. v = a+u
3. J = 3v

Let’s visualize these steps for a = 5, b = 3 and c = 2:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/14.png?raw=true)

This is the **forward propagation** step where we have calculated the output, i.e., J. We can also use computation graphs for **backward propagation** where we update the parameters, a,b and c in the above example.

**Derivatives with a Computation Graph**

Now let’s see how we can calculate derivatives with the help of a computation graph. Suppose we have to calculate dJ/da. The steps will be:

1. Since J is a function of v, calculate dJ/dv:
dJ/dv = d(3v)/dv = 3
2. Since v is a function of a and u, calculate dv/da:
dv/da = d(a+u)/da = 1
3. Calculate dJ/da:
dJ/da =  (dJ/dv)*(dv/da) = 3*1 = 3

Similarly, we can calculate dJ/db and dJ/dc:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/15.png?raw=true)

Now we will take the concept of computation graphs and gradient descent together and see how the parameters of logistic regression can be updated.

**Logistic Regression Gradient Descent**

$$\hat{z} = \sigma(w^Tx + b)$$
$$\hat{y}=a=\sigma(z)= \frac{1}{1+e^{-z}}$$
$$逻辑回归损失函数：L(\hat{a},y) = -(ylog(a) + (1-y)log(1-a)$$

机器学习或者统计机器学习常见的损失函数有如下四种： 
1. 0-1损失函数（0-1 loss function） 
2. 平方损失函数（quadratic loss function) 
3. 绝对值损失函数(absolute loss function) 
4. 对数损失函数（logarithmic loss function) 或对数似然损失函数(log-likehood loss function)

为什么逻辑回归的损失函数是像上面那样的，参见[这篇文章](https://blog.csdn.net/yaochuyi/article/details/80001239)和[这篇文章](https://blog.csdn.net/weixin_41537599/article/details/80585201)。

where L is the loss function. Now, for two features (x1 and x2), the computation graph for calculating the loss will be:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/16.png?raw=true)

Here, w1, w2, and b are the parameters that need to be updated. Below are the steps to do this (for w1):

1. Calculate da:
$$da = dL/da = (-y/a) + (1-y)/(1-a)$$
2. Calculate dz:
$$dz =  (dL/da)*(da/dz) =  [(-y/a) + (1-y)/(1-a)]*[a(1-a)] = a-y$$
3. Calculate dw1:
$$dw1 = [(dL/da)*(da/dz)]*dz/dw1 = (a-y)*dz/dw1$$

Similarly, we can calculate dw2 and db. Finally, the weights will be updated using the following equations:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/17.png?raw=true)

Keep in mind that this is for a single training example. We will have multiple examples in real-world scenarios. So, let’s look at how gradient descent can be calculated for ‘m’ training examples.

```
J = 0; dw1 = 0; dw2 =0; db = 0;                
w1 = 0; w2 = 0; b=0;                            
for i = 1 to m
    # Forward pass
    z(i) = W1*x1(i) + W2*x2(i) + b
    a(i) = Sigmoid(z(i))
    J += (Y(i)*log(a(i)) + (1-Y(i))*log(1-a(i)))
    
    # Backward pass
    dz(i) = a(i) - Y(i)
    dw1 += dz(i) * x1(i)
    dw2 += dz(i) * x2(i)
    db  += dz(i)
J /= m
dw1/= m
dw2/= m
db/= m

# Gradient descent
w1 = w1 - alpa * dw1
w2 = w2 - alpa * dw2
b = b - alpa * db
```

**Part II – Python and Vectorization**

Up to this point, we have seen how to use gradient descent for updating the parameters for logistic regression. In the above example, we saw that if we have ‘m’ training examples, we have to run the loop ‘m’ number of times to get the output, which makes the computation very slow.

Instead of these for loops, we can use vectorization which is an effective and time efficient approach.

**Vectorization**

Vectorization is basically a way of getting rid of for loops in our code. It performs all the operations together for ‘m’ training examples instead of computing them individually

Now, let’s look at the vectorized form. We can represent the w and x in a vector form

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/18.png?raw=true)

Now we can calculate Z for all the training examples using:

$$Z = np.dot(W,X)+b   (numpy is imported as np)$$

The dot function of NumPy library uses vectorization by default. This is how we can vectorize the multiplications. Let’s now see how we can vectorize an entire logistic regression algorithm.

**Vectorizing Logistic Regression**

Keeping with the ‘m’ training examples, the first step will be to calculate Z for all of these examples:

$$Z = np.dot(W^T, X) + b$$

Here, X contains the features for all the training examples while W is the coefficient matrix for these examples. The next step is to calculate the output(A) which is the sigmoid of Z:

$$A = \frac{1}{1 + np.exp(-Z)}$$

Now, calculate the loss and then use backpropagation to minimize the loss:

$$dz = A – Y$$

Finally, we will calculate the derivative of the parameters and update them:

```
dw = np.dot(X, dz.T) / m
db = dz.sum() / m
W = W – ⍺dw
b = b – ⍺db
```

**Broadcasting in Python**

Broadcasting makes certain parts of the code much more efficient. But don’t just take my word for it! Let’s look at some examples:

- obj.sum(axis = 0) sums the columns while obj.sum(axis = 1) sums the rows
- obj.reshape(1,4) changes the shape of the matrix by broadcasting the values

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/19.png?raw=true)

If we add 100 to a (4×1) matrix, it will copy 100 to a (4×1) matrix. Similarly, in the example below, (1×3) matrix will be copied to form a (2×3) matrix:

The general principle will be:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/20.png?raw=true)

If we add, subtract, multiply or divide an (m,n) matrix with a (1,n) matrix, this will copy it m times into an (m,n) matrix. This is called broadcasting and it makes the computations much faster. Try it out yourself!

**A note on Python/Numpy Vectors**

If you form an array using:

$$a = np.random.randn(5)$$

It will create an array of shape (5,) which is a rank 1 array. Using this array will cause problems while taking the transpose of the array. Instead, we can use the following code to form a vector instead of a rank 1 array:

```
a = np.random.randn(5,1)    # shape (5,1) column vector
a = np.random.randn(1,5)    # shape (1,5) row vector
```

To convert a (1,5) row vector to a (5,1) column vector, one can use:

$$a = a.reshape((5,1))$$

That’s it for module 2. In the next section, we will dive deeper into the details of a Shallow Neural Network.

**2.3 Module 3: Shallow Neural Networks**

**Neural Networks Overview**

In logistic regression, to calculate the output (y = a), we used the below computation graph:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/21.png?raw=true)

In case of a neural network with a single hidden layer, the structure will look like:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/22.png?raw=true)

**Neural Network Representation**

Consider the following representation of a neural network:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/23.png?raw=true)

Can you identify the number of layers in the above neural network? Remember that while counting the number of layers in a NN, we do not count the input layer. So, there are 2 layers in the NN shown above, i.e., one hidden layer and one output layer.

**Computing a Neural Network’s Output**

Let’s look in detail at how each neuron of a neural network works. Each neuron takes an input, performs some operation on them (calculates z = w[T] + b), and then applies the sigmoid function:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/24.png?raw=true)

This step is performed by each neuron. The equations for the first hidden layer with four neurons will be:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/25.png?raw=true)

So, for given input X, the outputs for each neuron will be:

```
z[1] = W[1]x + b[1]

a[1] = 𝛔(z[1])

z[2] = W[2]x + b[2]

a[2] = 𝛔(z[2])
```

**Activation Function**

While calculating the output, an activation function is applied. The choice of an activation function highly affects the performance of the model. So far, we have used the sigmoid activation function

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/26.png?raw=true)

However, this might not the best option in some cases. Why? Because at the extreme ends of the graph, the derivative will be close to zero and hence the gradient descent will update the parameters very slowly.

There are other functions which can replace this activation function:

- tanh:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/27.png?raw=true)

- ReLU (already covered earlier):

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/28.png?raw=true)

Activation Function	| Pros	| Cons
- | :-: | -:
Sigmoid |	Used in the output layer for binary classification| Output ranges from 0 to 1
tanh|	Better than sigmoid	|Updates parameters slowly when points are at extreme ends
ReLU|	Updates parameters faster as slope is 1 when x>0|Zero slope when x<0

We can choose different activation functions depending on the problem we’re trying to solve.

**Gradient Descent for Neural Networks**

The parameters which we have to update in a two-layer neural network are: w[1], b[1], w[2] and b[2]

The gradient descent steps can be summarized as:

```
Repeat:
    Compute predictions (y'(i), i = 1,...m)
    Get derivatives: dW[1], db[1], dW[2], db[2]
    Update: W[1] = W[1] - ⍺ * dW[1]
            b[1] = b[1] - ⍺ * db[1]
            W[2] = W[2] - ⍺ * dW[2]
            b[2] = b[2] - ⍺ * db[2]
```

Let’s quickly look at the forward and backpropagation steps for a two-layer neural networks.

**Forward propagation:**

```
Z[1] = W[1]*A[0] + b[1]    # A[0] is X
A[1] = g[1](Z[1])
Z[2] = W[2]*A[1] + b[2]
A[2] = g[2](Z[2])
```

**Backpropagation:**

```
dZ[2] = A[2] - Y   
dW[2] = (dZ[2] * A[1].T) / m
db[2] = Sum(dZ[2]) / m
dZ[1] = (W[2].T * dZ[2]) * g'[1](Z[1])  # element wise product (*)
dW[1] = (dZ[1] * A[0].T) / m   # A[0] = X
db[1] = Sum(dZ[1]) / m
```

These are the complete steps a neural network performs to generate outputs. Note that we have to initialize the weights (W) in the beginning which are then updated in the backpropagation step. So let’s look at how these weights should be initialized.

**Random Initialization**

We have previously seen that the weights are initialized to 0 in case of a logistic regression algorithm. But should we initialize the weights of a neural network to 0? It’s a pertinent question. Let’s consider the example shown below:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/29.png?raw=true)


注：feature数决定输入层数量

If the weights are initialized to 0, the W matrix will be:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/30.png?raw=true)

No matter how many units we use in a layer, we are always getting the same output which is similar to that of using a single unit. So, instead of initializing the weights to 0, we randomly initialize them using the following code:

```
w[1] = np.random.randn((2,2)) * 0.01
b[1] = np.zero((2,1))
```

We multiply the weights with 0.01 to initialize small weights. If we initialize large weights, the activation will be large, resulting in zero slope (in case of sigmoid and tanh activation function). Hence, learning will be slow. So we generally initialize small weights randomly.

**2.4 Module 4: Deep Neural Networks**

**Deep L-Layer Neural Network**

n this section, we will look at how the concepts of forward and backpropogation can be applied to deep neural networks. But you might be wondering at this point what in the world deep neural networks actually are?

Shallow vs depth is a matter of degree. **A logistic regression is a very shallow model as it has only one layer** (remember we don’t count the input as a layer):

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/31.png?raw=true)

A deeper neural network has more number of hidden layers:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/32.png?raw=true)

Let’s look at some of the notations related to deep neural networks:

- L is the number of layers in the neural network
- n[l] is the number of units in layer l
- a[l] is the activations in layer l
- w[l] is the weights for z[l]

These are some of the notations which we will be using in the upcoming sections. Keep them in mind as we proceed, or just quickly hop back here in case you miss something.

**Forward Propagation in a Deep Neural Network**

For a single training example, the forward propagation steps can be written as:

```
z[l] = W[l]a[l-1] + b[l]
a[l] = g[l](z[l])
```

We can vectorize these steps for ‘m’ training examples as shown below:

```
Z[l] = W[l]A[l-1] + B[l]
A[l] = g[l](Z[l])
```

These outputs from one layer act as an input for the next layer. We can’t compute the forward propagation for all the layers of a neural network without a for loop, so its fine to have a for loop here. Before moving further, let’s look at the dimensions of various matrices that will help us understand these steps in a better way.

**Getting your matrix dimensions right**

Analyzing the dimensions of a matrix is one of the best debugging tools to check how correct our code is. We will discuss what should be the correct dimension for each matrix in this section. Consider the following example:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/33.png?raw=true)

Can you figure out the number of layers (L) in this neural network? You are correct if you guessed **5**. There are 4 hidden layers and 1 output layer. The units in each layer are:

$$n[0] = 2, n[1] = 3, n[2] = 5, n[3] = 4, n[4] = 2,  and n[5] = 1$$

The generalized form of dimensions of W, b and their derivatives is:

```
W[l] = (n[l], n[l-1])
b[l] = (n[l], 1)
dW[l] = (n[l], n[l-1])
db[l] = (n[l],1)
Dimension of Z[l], A[l], dZ[l], dA[l] = (n[l],m)
```

where **‘m’** is the number of training examples. These are some of the generalized matrix dimensions which will help you to run your code smoothly.

**Why Deep Representations?**

In deep neural networks, we have a large number of hidden layers. What are these hidden layers actually doing? To understand this, consider the below image:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/34.png?raw=true)


Deep neural networks find relations with the data (simpler to complex relations). What the first hidden layer might be doing, is trying to find simple functions like identifying the edges in the above image. And as we go deeper into the network, these simple functions combine together to form more complex functions like identifying the face. Some of the common examples of leveraging a deep neural network are:

- Face Recognition
    - Image ==> Edges ==> Face parts ==> Faces ==> desired face
- Audio recognition
    - Audio ==> Low level sound features like (sss, bb) ==> Phonemes ==> Words ==> Sentences
    
**Building Blocks of Deep Neural Networks**

Consider any layer in a deep neural network. The input to this layer will be the activations from the previous layer (l-1), and the output of this layer will be its own activations.

- Input: a[l-1]
- Output: a[l]

This layer first calculates the z[l] on which the activations are applied. This z[l] is saved as cache. For the backward propagation step, it will first calculate da[l], i.e., derivative of the activation at layer l, derivative of weights dw[l], db[l], dz[l], and finally da[l-1]. Let’s visualize these steps to reduce the complexity:

![Alt text](https://github.com/5267/ML/blob/master/resources/scenario_pics/deeplearning/35.png?raw=true)

**Forward and Backward Propagation**

The input in a forward propagation step is a[l-1] and the outputs are a[l] and cache z[l], which is a function of w[l] and b[l]. So, the vectorized form to calculate Z[l] and A[l] is:

$$Z[l] = W[l] * A[l-1] + b[l]$$

$$A[l] = g[l](Z[l])$$

We will calculate Z and A for each layer of the network. After calculating the activations, the next step is backward propagation, where we update the weights using the derivatives. The input for backward propagation is da[l] and the outputs are da[l-1], dW[l] and db[l]. Let’s look at the vectorized equations for backward propagation:

```
dZ[l] = dA[l] * g'[l](Z[l])
dW[l] = 1/m * (dZ[l] * A[l-1].T)
db[l] = 1/m * np.sum(dZ[l], axis = 1, keepdims = True)

dA[l-1] = w[l].T * dZ[l]
```

**Parameters vs Hyperparameters**

This is an oft-asked question by deep learning newcomers. The major difference between parameters and hyperparameters is that parameters are learned by the model during the training time, while hyperparameters can be changed before training the model.

Parameters of a deep neural network are W and b, which the model updates during the backpropagation step. On the other hand, there are a lot of hyperparameters for a deep NN, including:

- Learning rate – ⍺
- Number of iterations
- Number of hidden layers
- Units in each hidden layer
- Choice of activation function



#### Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization #2

#### Module 1: Practical Aspects of Deep Learning

This module is fairly comprehensive, and is thus further divided into three parts:

- Part I: Setting up your Machine Learning Application
- Part II: Regularizing your Neural Network
- Part III: Setting up your Optimization Problem

#### Part I: Setting up your Machine Learning Application

**Train / Dev / Test sets**

1. Number of hidden layers in the network
2. Number of hidden units for each hidden layer
3. Learning rate
4. Activation function for different layers, etc.

There is no specified or pre-defined way of choosing these hyperparameters. The below is what we generally follow:

1. Start with an idea, i.e. start with a certain number of hidden layers, certain learning rate, etc.
2. Try the idea by coding it
3. Experiment how well the idea has worked
4. Refine the idea and iterate this process

Now how do we identify whether the idea is working? This is where the train / dev / test sets come into play. Suppose we have an entire dataset:

![Alt text]()

1. Training Set: We train the model on the training data.
2. Dev Set: After training the model, we check how well it performs on the dev set.
3. Test Set: When we have a final model (i.e., the model that has performed well on both training as well as dev set), we evaluate it on the test set in order to get an unbiased estimate of how well our algorithm is doing.

There is still one question left after this – what should be the length of these training, dev and test sets? It’s actually a pretty critical aspect of any machine learning project, and will end up playing a big part in deciding how well the model performs. Let’s look at some traditional guidelines that experts follow to decide the length of each set:

- In the previous era, when we had small datasets, the distribution of different sets was:

![Alt text]()

or just:

![Alt text]()

As the availability of data has increased in recent years, we can use a huge slice of it for training the model:

![Alt text]()

This is certainly one way of deciding the length of these different sets. This works fine most of the time, but indulge me and consider the following scenario:

    Suppose we have scraped multiple images of cats from different sites, and also clicked a few images using our own camera. The distribution of both these types of images will be different, right? Now, we split the data in such a way that the training set contains all the scraped images, while the dev and test sets have all the camera images. In this case, the distribution of the training set will be different from the dev and test sets and hence, there’s a good chance we might not get good results.
    
In cases like these (different distributions), we can follow the following guidelines:

1. Divide the training, dev and test sets in such a way that their distribution is similar
2. Skip the test set and validate the model using the dev set only

We can also use these sets to look at the bias and variance of the model, These help us decide how well the model is fitting and performing.

**Bias / Variance**

Consider a dataset which gives us the below plot:

![Alt text]()

What will happen if we fit a straight line to classify the points into different classes? The model will under-fit and have a high bias. On the other hand, if we fit the data perfectly, i.e., all the points are classified into their respective class, we will have high variance (and overfitting). The right model fit is usually found between these two extremes:

![Alt text]()

We want our model to be just right, which means having **low bias and low variance**. We can decide if the model should have high bias or high variance by checking the train set and dev set error. Generally, we can define it as:

![Alt text]()

- If the dev set error is much more than the train set error, the model is overfitting and has a high variance
- When both train and dev set errors are high, the model is underfitting and has a high bias
- If the train set error is high and the dev set error is even worse, the model has both high bias and high variance
- And when both the train and dev set errors are small, the model fits the data reasonably and has low bias and low variance

**Basic Recipe for Machine Learning**

**Question 1: Does the model have high bias（underfitting）?**

**Solution**: We can figure out whether the model has high bias by looking at the training set error. High training error results in high bias. In such cases, we can try bigger networks, train models for a longer period of time, or try different neural network architectures.

**Question 2: Does the model have high variance（overfitting）?**

**Solution** : If the dev set error is high, we can say that the model has high variance. To reduce the variance, we can get more data, use regularization, or try different neural network architectures.

One of the most popular techniques to reduce variance is called regularization. Let’s look at this concept and how it applies to neural networks in part II.

#### Part II: Regularizing your Neural Network

We can reduce the variance by increasing the amount of data. But is that really a feasible option every time? Perhaps there is no other data available, and if there is, it might be too expensive for your project to source. This is quite a common problem. And that’s why the concept of regularization plays an important role in preventing overfitting.




### Reference
[A Comprehensive Tutorial to learn Convolutional Neural Networks from Scratch](https://www.analyticsvidhya.com/blog/2018/12/guide-convolutional-neural-network-cnn/?utm_source=blog&utm_medium=crowd-counting)
[CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes](https://arxiv.org/abs/1802.10062)

[An Introductory Guide to Deep Learning and Neural NetworksCourse #1](https://www.analyticsvidhya.com/blog/2018/10/introduction-neural-networks-deep-learning/)