# What is deep learning?

Deep learning is considered a subfield of machine learning. Even though thereare countless inspirations from real neurons, we will focus on modelingeverything with formulas, intuitions, and theories that actually work.

In practice, deep learning is the scaling up of computational structures calledneural networks.

Why do we take the time to develop such approaches?

Because it is the optimal solution when working with really large-scale data rightnow.

> It is important to keep in mind that deep learning is all about learning
**powerful representations**. 

There is a huge shift from extracting features to learning features, and that is what deep learning is all about.

# Deep learning applications

In this course, you will get a general perspective of a huge variety of problems that you can solve with deep learning.

First, you will learn to formulate problems in terms of machine and deep learning. That’s a crucial skill that you will use throughout your career and projects. We will be focussing our applications to computer vision related tasks.

Secondly, you will learn the most basic components that tackle some of the following tasks:
- Image classification
- Image regression
- Object detection
- Generative models
- Embedded deep learning

Deep learning has already transformed a variety of businesses such as websearch, augmented reality, social networks, automobiles, retail, cybersecurity, and manufacturing. But the most exciting thing is the potential novel applications thatmay appear in the future. These projects can radically transform every industry.

Some experts claim that **AI is the new electricity**.

While this may be a disputable idea, what is for certain is that **deep learning is one of the most sought after and well-paid skills**.

So why stay behind?


# Linear Classifiers

Explore linear classifiers, their principles, and their training process.

We will cover the following:

- What is a linear classifier?
- Training a classifier
- Loss function
- Optimization and training process

# What is a linear classifier?

Suppose we want to build a machine learning model to classify the following points into two categories based on their color. It is very easy to see that we can find a single point that separates them perfectly. The goal of our model is to find this point.


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig01.PNG)

The easiest way to do that is to build a linear classifier. Our classifier has the form $f(x,w) = w_{1}x_{1} + w_{2}$. The purpose of
$f(x, w)$ will be to find the parameters $w_{1}$ and $w_{2}$, so that any corresponding scalar point (1D) can be distinguished perfectly. If
$f (x, w) > 0$, the point belongs to the blue category. Otherwise, it belongs to the red.

Sounds easy?

Let’s extend this idea to 2D data points!

Each point will now be represented as $(x_{1} , x_{2})$.

For the 2D case, we need to find a line (instead of a point) that separates our 2Dpoints, so our classifier will be $(x, w) = w_{1}x_{1} + w_{2}x_{2} + w_{3}$. Again, the classifier should be trained to find the optimal
$w_{1} , w_{2}$ and $w_{3}$.


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig02.PNG)

This idea can be naturally extended to arbitrary (N) dimensions. The line in the2D space will be a plane in the 3D space and N-plane in higher dimensions. We call this the hyperplane of the N-dimensional space that can separate the space into 2 classes (red and blue).

We will utilize linear algebra and matrices to formulate it. To facilitate thereadability, matrices will be denoted in capital letter.

As a result, we now have $f(x, W) = Wx + b$, where $x$ and $b$ are N-dimensional vectors while $W$ is an $N$ x $N$ matrix. We need to find the correct values of $W$ and $b$ to define a hyperplane. When we have those, we can receive the category $y$ for any data-point $x$.

From now on, we’ll denote our classifier as $f(x_{i} ,W)$
![linear](images/fig03.png)

In Pytorch, we can build a linear classifier with 5 inputs and 10 outputs using just one line of code. The following code will initialize a trainable matrix and a vector and every time we use class instance classifier, it will perform the operation
$y = Wx + b$

In [None]:
## Basic imports
import torch
import torch.nn as nn
## initializes a matrix W and a vector b
classifier = nn.Linear(5, 10)

classifier

Linear(in_features=5, out_features=10, bias=True)

# Training a classifier

We know that we want to find the matrix $W$ and the vector $b$ in order to classify our examples. 

But how? 

First of all, we need training data. Training data are data-points ($x$) whose category (target class $t$) we are aware of. Thus, we can utilize it to “train” our classifier.

> “Training” a classifier refers to the notion of trying to find the matrix $W$ by feeding to its already known data points.

Because we know the “labels” (category) of the data, these training approaches are called **supervised**. The data are provided in pairs $(x,t)$. We use the $x$ as an input to the classifier and the labels $t$ to compute the loss (distance). Note that $y$ refers to the output of the classifier and will be equal to $y = Wx + b$

> Intuitively, we will push the randomly initialized model to learn this mapping from $x \rightarrow t$

Before we describe the process of training, we need to describe two more concepts.

# Loss function

**Loss (or cost) is a measure of how good or bad a classification of a data-point is.** Alternatively, it can be defined as how far the classifier’s prediction $y$ is, for the data-point $x$, from the actual class $t$. Let’s make that crystal clear:

Given a dataset $(x_{i} , t_{i})$ of N points where $x_{i}$ is an N-dimensional point in space and $t_{i}$ is an integer that defines the point’s category, loss is the distance between $f(x_{i} ,W)$ and $t_{i}$ .

$C_{i}(f (x_{i},W), t_{i})$ is the cost for a single example $x_{i}$.

The overall loss of the entire training data is simply the average of all the individual losses. However, in practice, we rarely average the loss over all datapoints.

Note that the choice of the loss function depends on the problem and the form of the data. In our case, from now on, we will use the mean squared error distance defined as:

> $C = \sum(f (x _{i},W) − t_{i})^{2}$

Notice that the sum is between the elements of the vector. Here is a code example:

In [None]:
import torch
import torch.nn as nn

# define a linear model
model = nn.Linear(10,3)

# define loss fn
loss = nn.MSELoss()

## dummy input x
input_vector = torch.randn(10)
## class number 3, denoted as a vector with the class index to 1
target = torch.tensor([0,0,1])
## y in math
pred = model(input_vector)
output = loss(pred, target)

print("Prediction: " ,pred)
print("Output: " , output)

Prediction:  tensor([ 1.0859, -0.3607, -1.1807], grad_fn=<AddBackward0>)
Output:  tensor(2.0215, grad_fn=<MseLossBackward0>)


It is important to understand that even though the target class is a scalar (3 in the example above), we convert it to a tensor. For three classes, you will have these possible target vectors $t$:

class 1 $\rightarrow$ [1,0,0]

class 2 $\rightarrow$ [0,1,0]

class 3 $\rightarrow$ [0,0,1]

This is also called **one-hot encoding** in machine learning.

# Optimization and training process

Optimization is the process of finding the weight matrix $W$ that minimizes the loss function. In other words, it is the process of selecting the individual weights $w_{i}$ so that the classifier’s prediction $y$ is as close as possible to the point’s real label $t$.

Mathematically this can be written as:

> $w′ = argmin_{w}(C(w))$

For now, let’s keep in mind that optimization is an abstract concept that describes how we select the matrix. We will dive into it in the next lesson where we will talk about neural networks.

Now, we can describe the training algorithm in its entirety:

Given a set of training examples $x_{i}$ with their labels $t_{i}$, we need to:

- Initialize the classifier $f(x_{i},W)$ with random weight $W$

- Feed a training example in the classifier and get the output $y$

- Compute the loss between the prediction $y_{i}$ and target $t_{i}$.

- Adjust the weights $W$ according to the loss $C_{i}$(next lesson).

- Repeat for all training examples.

This is the core idea behind all deep learning models. In the end, we will have a trained classifier that can be **generalized in previously UNSEEN examples**.

> The only step that should be unclear now is how we adjust the weights. We will discuss this in the next lesson.