# What is deep learning?
## 1.1 Artificial intelligence, machine learning, and deep learning
### 1.1.1 Artificial intelligence
A concise definition of *artificial intelligence* is: *the effort to automate intellectual tasks normally performed by humans*. For a long period of time, experts believed that human-level artificial intelligence could be acheived by having programmers craft a large set of explicit rules for manipulating knowledge. This is known as *symbolic AI*, and was the dominant paradigm in AI from 1950s to late 1980s, reaching its peak during the *expert systems* boom of the 1980s.

Symbolic AI was suitable to solve well-defined, logical problems, but it struggled to solve more complex, fuzzy problems, such as image classification, speech recognition, and language translation. A new approach arose to take symbolic AI's place: *machine learning*.

### 1.1.2 Machine learning
Machine learning arises from the this questions: *could a computer go beyond "what we know how to order it to perform" and learn on its own how to perform a special task?* Rather than programmers writing data-processing rules by hand, could a computer automatically learn rules by looking at data?

A machine learning system is *trained* rather than explicitly programmed. It's presented many examples relevant to a task, and it finds statistical structure in the examples that eventually allow the system to develop a set of rules for automating the task.

Machine learning only started to become popular in the 1990s, but it quickly became the most popular and most successful subfield of AI, fueled by larger datasets and faster hardware. Machine learning is related to mathematical statistics, but it differs from statistics in several important ways. Unlike statistics, machine learning tends to deal with large, complex datasets (such as a dataset with millions of images, each consisting of tens of thousands of pixels) for which classical statistical analysis such as Bayesian analysis would be impractical.

### 1.1.3 Learning representations from data
To apply machine learning, we need three things:
 - *Input data points* - For instance, if the task is speech recognition, these data points could be sound files of people talking. If the task is image tagging, they could be pictures.
 - *Examples of the expected output* - In a speech-recognition task, these could be human-generated transcripts of sound files. In an image task, expected outputs could be tags such as "dog," "cat," and so on.
 - *A way to measure whether the algorithm is doing a good job* - This is necessary in order to determine the distance between the algorithm's current output and its expected output. The measurement is used as a feeback signal to adjust the way the algorithm works. This adjustment step is what we call *learning*.
 
The central problem in machine learning and deep learning is to *meaningfully transform data*, or to learn useful *representations* of the input data at hand. To better understand this process, let's look at an example. Consider an x-axis, y-axis, and some points represented by their coordinates in the image below.

![sample data](images/1_1_3_sample_data.jpg)

Let's assume we want to develop an algorithm that can take the coordinates (x, y) of a point and output whether that point is likely to be black or white. In this case:
 - The inputs are the coordinates of our points.
 - The expected outputs are the colors of points.
 - A way to measure whether the algorithm is doing a good job could be the percentage of points that are being correctly classified.
 
To do this, we need a new representation of the data that separates the white points from the black points. One transformation could be a coordinate change (see below).

![coordinate change](images/1_1_3_coordinate_change.jpg)

With the new representation, the black/white classification problem can be expressed as a simple rule: "If x > 0, then the points are Black," or "If x < 0 then the points are White." *Learning*, in the context of machine learning, describes an automatic search process for better data representations. 

All machine learning algorithms consist of automatically finding such transformations that turn data into more-useful representations for a given task. These operations can be coordinate changes (like above), or linear projections, translations, nonlinear operations, and so on.

Machine learning is essentially the process of searching for useful representations of some input data, within a predefined space of possibilities, while using guidance from a feedback signal.

### 1.1.4 The "deep" in deep learning
Deep learning is a specific subfield of machine learning: a new take on learning representations from data that puts an emphasis on learning successive layers of increasingly meaninful representations. *Depth* describes the number of layers contributing to a model of the data. Modern deep learning often involves tens or even hundreds of successive layers of representations- and they're all learned automatically from exposure to training data. Other approaches to machine learning only focus on learning one or two layer representations of the data.

In deep learning, layered representations are almost always learned via models called *neural networks*, which is a reference to neurobiology. Although some central concepts of deep learning were developed in part by drawing inspiration from our understanding of the brain, deep learning models are **not** models of the brain. For our purposes, deep learning is a mathematical framework for learning representations from data.

Let's take a look at how a neural network several layers deep transforms an image of a digit in order to recognize what digit it is.

![DNN](images/1_1_4_DNN.jpg)

The network transforms the digit image into representations that are increasingly different from the original image and increasingly informative about the final result. You can think of a deep network as a multistage information-distillation operation, where information goes through successive filters and comes out increasingly purified.

![deep representation](images/1_1_4_deep_representation.jpg)

Deep learning is technically a multistage way to learn data representations.

### 1.1.5 Understanding how deep learning works, in three figures
At this point, we know that machine learning is about mapping inputs to targets, which is done by observing many examples of input and targets. We also know that deep neural networks do this input-to-target mapping via a deep sequence of simple data transformations (layers) and that these data transformations are learned by exposure to examples.

The specification of what a layer does to its input is stored in the layer's *weights* (also called *parameters*), which in essence are a bunch of numbers. In this context, *learning* means finding a set of values for the weights of all layers in a network, such that the network will correctly map example inputs to their associated targets. Crazily, a deep neural networks can contain tens of millions of parameters. Finding the correct value for all of them may seem like a daunting task, especially given that modifying the value of one parameter will affect the behavior of all the others!

![params](images/1_1_5_params.jpg)

To control the output of a neural network, you need to be able to measure how far its output is from what you expected. This is the job of the *loss function* of the network, also called the *objective function*. The loss function takes the predictions generated by the network in addition to the true target (what you wanted the network to output) and computes a distance score, capturing how well the network has done on this specific example.

![loss function](images/1_1_5_loss_function.jpg)

The trick in deep learning is to use this score as a feedback signal to adjust the value of the weights a little, in a direction that will lower the loss score for the current example. This adjustment is the job of the *optimizer*, which implements what's called the *Backpropagation* algorithm: the central algorithm in deep learning (we will get into Backpropagation in later sections).

![loss score](images/1_1_5_loss_score.jpg)

Initially, the weights of the network are assigned random values, so the network merely implements a series of random transformations. Naturally, its outputis far from what it should ideally be, and the loss score is accordingly very high. But with every example the network processes, the weights are adjusted a little in the correct direction, and the loss score decreases. This is the *training loop*, which, repeated a sufficient number of times (typically tens of iterations over thousands of examples), yields weight values that minimize the loss function. A network with a minimal loss is one for which the outputs are as close as they can be to the targets: a trained network. It is a simple mechanism, but once it is scaled, it looks like magic.

### 1.1.6 What deep learning has achieved so far
Deep learning is a fairly old subfield of machine learning, but it rose to prominence in the early 2010s. This rise is often attributed to a greater plethora of more and larger datasets, as well as faster, more powerful computer hardware (GPUs).

In particular, deep learning has achieved the following breakthroughs, all in historically difficult areas of machine learning:
 - Near-human-level image classfication
 - Near-human-level speech recognition
 - Near-human-level handwriting transcription
 - Improved machine translation
 - Improved text-to-speech conversion
 - Digital assistants such as Siri, Google Now, and Amazon Alexa
 - Near-human-level autonomous driving
 - Improved ad targeting, as used by Google, Baidu, and Bing
 - Improved search results on the web
 - Ability to answer natural-language questions
 - Superhuman Go playing
 
### 1.1.7 Don't believe the short-term hype
Deep learning has led to some remarkable achievements, but expectations for what the field will be able to achieve in the next decade tend to run much higher than what will likely be possible. Many applications will likely remain elusive for quite some time, such as believable dialogue systems, human-level machine translation across arbitrary languages, and human-level natural-language understanding.

### 1.1.8 The promise of AI
AI has yet to transition to being central to the way we work, think, and live. Right now, it may seem difficult to believe that AI could have a large impact on our world, because it isn't yet widely deployed - much as, back in 1995, it would have been difficult to believe in the future impact of the internet. In a not-so-distant future, AI will be your assistant, your friend, it will answer your questions, educate your kids, and monitor your health. It will deliver groceries to your door and drive you point A to B.

## 1.2 Before deep learning: a brief history of machine learning
Here we will briefly go over machine learning approaches to describe the historical context in which they were developed as well as their respective use cases.

### 1.2.1 Probabilistic modeling
Probabilistic modeling is the application of the principles of statistics to data analysis. One of the best known algorithms in this category is the Naive Bayes algorithm.

Naive Bayes is a type of machine learning classifier based on applying Bayes' theorem while assuming that the features in the input data are all independent (a strong, or "naive" assumption).

A closely related model is logistic regression, which is sometimes considered to be the "hello world" of modern machine learning. Logistic regression is a classification algorithm rather than a regression algorithm. It is often the first thing a data scientist will try on a dataset to get a feel for the classification task at hand.

### 1.2.2 Early neural networks
For a long time, the missing piece was an efficient way to train large neural networks. This changed in the mid-1980s when the Backpropagation algorithm was rediscovered. The first successful practical application of neural nets came in 1989 from Bell Labs, when Yann LeCun combined the ideas of convolutional neural networks and backpropagation, and applied them to the problem of classifying handwritten digits. The resulting network, *LeNet*, was used by the USPS to read ZIP codes on envelopes.

### 1.2.3 Kernel methods
Kernel methods are a group of classification algorithms, the best known of which is the *support vector machine* (SVM). SVMs aim at solving classification problems by finding good decision boundaries between two sets of points belonging to two different categories. SVMs find these boundaries in two steps:
 1. The data is mapped to a new high-dimensional representation where the decision boundary can be expressed as a hyperplane.
 2. A good decision boundary is computed by trying to maximize the distance between the hyperplane and the closest data points from each class, a step called maximizing the margin. This allows the boundary to generalize well to new samples outside of the training set.
    
To find good decision hyperplanes in the new representation space, you don't have to explicitly compute the coordinates of your points in the new space; you just need to compute the distance between pairs of points in that space, which can be done efficiently using a *kernel function*. A kernel function is a computationally tractable operation that maps any two points in your initial space to the distance between these points in your target representation.

SVMs became widely popular due to their simplicity and easy interpretation, but proved hard to scale to large datasets and didn't provide good results for perceptual problems such as image classification.

### 1.2.4 Decision trees, random forests, and gradient boosting machines
*Decision trees* are flowchart-like structures that let you classify input data points or predict output values given inputs. In particular, the *Random Forest* algorithm introduced a robust, practical take on decision tree learning that involves building a large number of specialized decision trees and then assembling their outputs. 

When the popular machine learning competition website Kaggle got started in 2010, random forests quickly became a favorite on the platform - until 2014, when *gradient boosting machines* took over. A gradient boosting machine, much like a random forest, is a machine learning technique based on ensembling weak prediction models, generally decision trees. It uses *gradient boosting*, a way to improve any machine learning model by iteratively training new models that specialize in addressing the weak points of the previous models. It may be one of the best algorithms for dealing with nonperceptual data today.

### 1.2.5 Back to neural networks
Around 2010, although neural networks were almost completely shunned by the scientific community at large, a number of people still working on neural networks started to make important breakthroughs. In 2011, Dan Ciresan from IDSIA began to win academic image-classification competitions with GPU-trained deep neural networks- the first practical success of modern deep learning. But the watershed moment came in 2012 when Geoffrey Hinton's group entered the image-classification competition ImageNet. The ImageNet challenge consited of classifying high-resolution color images into 1,000 different categories after training on 1.4 million images. The competition has been dominated by deep convolutional neural networks every year since. Since 2012, deep conovolutional neural networks (convnets) have become the go-to algorithm for all computer vision tasks. 

### 1.2.6 What makes deep learning different
The main reason deep learning took off so quickly is that is offered better performance on many problems. Deep learning also makes problem-solving much easier, because it completely automates what used to be the most crucial step in a machine learning workflow: feature engineering. With deep learning, you learn all features in one pass rather than having to engineer them yourself. This has greatly simplified machine learning workflows, often replacing sophisticated multistage pipelines with a single, simple, end-to-end deep learning model.

What is transformative about deep learning is that it allows a model to learn all layers of representation jointly, at the same time, rather than in succession (*greedily*). Whenever the model adjusts one of its internal features, all other features that depend on it automatically adapt to the change, without requiring human intervention. 

These are the two essential characteristics of how deep learning learns from data: the *incremental, layer-by-layer way in which increasingly complex representations are developed*, and the fact that *these intermediat incremental representations are learned jointly*, each layer being updated to follow both the representational needs of the layer above and needs of the layer below.

### 1.2.7 The modern machine-learning landscape
In 2016 and 2017, Kaggle was dominated by two approaches: gradient boosting machines and deep learning. Specifically, gradient boosting is used for problems where structured data is available, whereas deep learning is used to perceptual problems such as image classification. Practitioners of the former almost always use the excellent XGBoost library, which offers support for the two most popular languages of data science: Python and R. Most of the entrants using deep learning use the Keras library, due to its ease of use, flexibility, and support of Python.

## 1.3 Why deep learning? Why now?
In general, three technical forces are driving advancements in machine learning:
 - Hardware
 - Datasets and benchmarks
 - Algorithmic advances
 
Because the field is guided by experimental findings rather than by theory, algorithmic advances only become possible when appropriate data and hardware are available to try new ideas. Machine learning isn't mathematics or physics, where major advances can be done with a pen and a piece of paper. It's an enginering science.

### 1.3.1 Hardware
Between 1990 and 2010, off-the-shelf CPUs became faster by a factor of approximately 5,000. It's possible to run small deep learning models on your laptop today, which would not have been possible 25 years ago.

Typical deep learning models used in computer vision or speech recognition require orders of magnitude more computational power than what your laptop can deliver. In 2007, NVIDIA launched CUDA, a programming interface for its line of GPUs. Deep neural networks, consisting mostly of small matrix multiplications, are also highly parallizable; and around 2011, some researchers began to write CUDA implementations of neural nets.

What's more, the deep learning industry is starting to go beyond GPUs and is investing in increasingly specialized, efficient chips for deep learning. Google's tensor processing unit (TPU) is reportedly 10 times faster and far more energy efficient than top-of-the-line GPUs.

### 1.3.2 Data
AI is sometimes called the new industrial revolution. If deep learning is the steam engine, then data is the coal: the raw material that powers our intelligent machines. If there is one dataset that has been a catalyst for the rise of deep learning, it's the ImageNet dataset, consisting of 1.4 million images that have been hand annotated with 1,000 image categories (1 category per image).

### 1.3.3 Algorithms
In addition to hardware and data, until the late 2000s, we were missing a reliable way to train very deep neural networks. As a result, neural networks were still fairly shallow, using only one or two layers of representations; thus, they weren't able to shine against more-refined shallow methods such as SVMs and random forests. The key issue was that of *gradient propagation* through deep stacks of layers. The feedback signal used to train neural networks would fade away as the number of layers increased.

This changed around 2009-2010 with the advent of several simple but important algorithmic improvements that allowed for better gradient propagation:
 - Better *activation functions* for neural layers
 - Better *weight-initialization schemes*, starting with layer-wise penetrating, which was quickly abandoned
 - Better *optimization schemes*, such as RMSProp and Adam
 
Only when improvements began to allow for training models with 10 or more layers did deep learning start to shine. In 2014, 2015, and 2016, even more advanced ways to help gradient propagation were discovered, such as batch normalization, residual connections, and depth-wise separable convolutions. Today we can train from scratch models that are thousands of layers deep.

### 1.3.4 A new wave of investment
In 2011, right before deep learning took the spotlight, the total venture capital investment in AI was around $19 million, which went almost entirely to practical applications of shallow machine learning approaches. By 2014, it had risen to a staggering $394 million. As a result of this wave of investment, the number of people working on deep learning went from a few hundred to tens of thousands in just five years. There are currently no signs that this trend will slow any time soon.

### 1.3.5 The democratization of deep learning
Nowadays, basic Python scripting skills suffice to do advanced deep-learning research. This has been driven most notably by the development of Theano and then TensorFlow- two symbolic tensor-manipulation frameworks for Python that support autodifferentiation, greatly simplifying the implementation of new models- and by the rise of use-friendly libraries such as Keras, which makes deep learning as easy as manipulating LEGO bricks. After its release in 2015, Keras quickly became the go-to deep learning solution for large numbers of new startups, graduate students, and researchers pivoting into the field.

### 1.3.6 Will it last?
Deep learning has several properties that justify its status as an AI revolution, and it's here to stay. We may not be using neural networks two decades from now, but whatever we use will directly be inherited from modern deep learning and its core concepts. These important properties can be broadly sorted into three categories:
 - **Simplicity**- Deep learning removes the need for feature engineering, replacing complex, brittle, engineering-heavy pipelines with simple, end-to-end trainable models that are typically build using only five or six different tensor operations.
 - **Scalability**- Deep learning is highly amenable to parallelization on GPUs or TPUs, so it can take full advantage of Moore's law.
 - **Versatility and reusability**- Unlike many prior machine learning approaches, deep learning models can be trained on additional data without restarting from scratch, making them viable for continuous online learning- and important property for very large production models. Furthermore, trained deep learning models are repurposable and thus reusable: for instance, it's possible to take a deep learning model trained for image classification and drop it into a video processing pipeline.