<h2 id="Contents">Contents<a href="#Contents"></a></h2>
        <ol>
        <li><a class="" href="#Imports">Imports</a></li>
<li><a class="" href="#Course-2">Course 2</a></li>
<ol><li><a class="" href="#Improved-Implementation-of-softmax">Improved Implementation of softmax</a></li>
<li><a class="" href="#Measuring-Purity">Measuring Purity</a></li>
<ol><li><a class="" href="#Entropy">Entropy</a></li>
<li><a class="" href="#Gini-Index">Gini Index</a></li>
</ol><li><a class="" href="#Choosing-a-split:-Information-Gain">Choosing a split: Information Gain</a></li>
</ol><li><a class="" href="#Course-3">Course 3</a></li>
<ol><li><a class="" href="#Anomaly-detection-vs.-supervised-learning">Anomaly detection vs. supervised learning</a></li>
<ol><li><a class="" href="#Anomaly-detection">Anomaly detection</a></li>
<li><a class="" href="#Supervised-learning">Supervised learning</a></li>
</ol><li><a class="" href="#Recommender-Systems">Recommender Systems</a></li>
<ol><li><a class="" href="#Deep-Learning-for-recommender-systems">Deep Learning for recommender systems</a></li>
</ol><li><a class="" href="#Reinforcement-Learning">Reinforcement Learning</a></li>
<ol><li><a class="" href="#Markov-Decision-Process">Markov Decision Process</a></li>
<li><a class="" href="#The-Mars-Rover-Example">The Mars Rover Example</a></li>
<li><a class="" href="#State">State</a></li>
<li><a class="" href="#Action">Action</a></li>
<li><a class="" href="#Reward">Reward</a></li>
<li><a class="" href="#Return">Return</a></li>
<li><a class="" href="#Policy">Policy</a></li>
<li><a class="" href="#Review">Review</a></li>
<li><a class="" href="#State-Action-Value-Function:-Q">State-Action Value Function: Q</a></li>
<li><a class="" href="#Bellman-Equation">Bellman Equation</a></li>
<li><a class="" href="#Continuous-State">Continuous State</a></li>
<li><a class="" href="#Lunar-Lander">Lunar Lander</a></li>
<li><a class="" href="#Training-a-Neural-Network">Training a Neural Network</a></li>
<li><a class="" href="#$%5Cepsilon$---Greedy-Policy">$\epsilon$ - Greedy Policy</a></li>
</ol>

# Imports

In [5]:
import numpy as np
import tensorflow as tf

# Course 2

## Improved Implementation of softmax

In [3]:
res = 2/10000
print(f"{res:.18f}")

0.000200000000000000


In [4]:
res = 1+ 1/10000 - (1-1/10000)
print(f"{res:.18f}")

0.000199999999999978


This happens because computer have limited precision.

Now, for softmax, instead of using:
$$
a = g(z) = \frac{1}{1-e^{-z}}\\
\mathcal{L} = -\sum_{i=1}^n y_i \log a_i - (1-y_i) \log (1-a_i)
$$
We can put the $z$ into the log:
$$
\mathcal{L} = -\sum_{i=1}^n y_i \log \frac{1}{1-e^{-z_i}} - (1-y_i) \log (1-\frac{1}{1-e^{-z_i}})\\
$$

This gives Tensorflow flexibility to rearrange terms so it can calculate the loss more efficiently. The way we do this in Tensorflow is by using the `linear` as the last layer activation and then calculating the `BinaryCrossentropy` loss for logits:

In [7]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(25, input_shape=(1,), activation='relu'),
    tf.keras.layers.Dense(15, activation='relu'),
    tf.keras.layers.Dense(1, activation='linear')
])

model.compile(loss = tf.keras.losses.BinaryCrossentropy(from_logits=True))

> Logits is nothing but $\mathcal{L} = -\sum_{i=1}^n y_i \log a_i - (1-y_i) \log (1-a_i)$

If one of the z's really small than e to negative small number becomes very, very small or if one of the z's is a very large number, then e to the z can become a very large number and by rearranging terms, TensorFlow can avoid some of these very small or very large numbers and therefore come up with more actress computation for the loss function.

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(25, input_shape=(1,), activation='relu'),
    tf.keras.layers.Dense(15, activation='relu'),
    tf.keras.layers.Dense(10, activation='linear')
])

model.compile(loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))

Now, we need to use `tf.nn.sigmoid` to get the probability of the output as the last layer is now linear.

## Measuring Purity

>If the examples are all cats of a single class then that's very pure, if it's all not cats that's also very pure.

We'll have a look at two metrics to measure the purity of a node:
1. Entropy
2. Gini Index

### Entropy

Entropy is a measure of impurity of data.

![](Images/0101.png)

Entropy is defined as a function of $p_1$ which is the fraction of examples which contains the 1 class. The function is:
$$
H(p_1) = -p_1 \log_2 p_1 - (1-p_1) \log_2 (1-p_1)\\
H(p_1) = -p_1 \log_2 p_1 - p_0 \log_2 p_0
$$

If $p_0 = 0$ or $p_1=0$, the function should give zero.

### Gini Index

Gini Impurity is a measurement used to build Decision Trees to determine how the features of a dataset should split nodes to form the tree. More precisely, the Gini Impurity of a dataset is a number between 0-0.5, which indicates the likelihood of new, random data being misclassified if it were given a random class label according to the class distribution in the dataset.

Mathematically, the Gini Impurity of a dataset is defined as:
$$
G = 1 - \sum_{i=1}^n p_i^2
$$

Where $p_i$ is the probability of a random element in the dataset belonging to class $i$.

![](https://storage.googleapis.com/lds-media/images/gini-impurity-diagram.width-1200.png)

## Choosing a split: Information Gain

When building a decision tree, the way we'll decide what feature to split on at a node will be based on what choice of feature reduces entropy the most. Reduces entropy or reduces impurity, or maximizes purity. In decision tree learning, the reduction of entropy is called information gain. 

![](Images/0102.png)

We choose the feature which results in the highest resduction in entropy while splitting. In above example, using the Ear shape results in the highest reduction in entropy. Concretely, the information gain is:
$$
IG = H(p_1^\text{root}) - \left(\frac{N_\text{left}}{N_\text{root}} H(p_1^\text{left}) + \frac{N_\text{right}}{N_\text{root}} H(p_1^\text{right})\right)
$$
where $N_\text{root}$ is the number of examples in the root node, $N_\text{left}$ is the number of examples in the left node, and $N_\text{right}$ is the number of examples in the right node. $p_1^\text{left}$ is the fraction of examples in the left node which are of class 1. $p_1^\text{right}$ is the fraction of examples in the right node which are of class 1. While $p_1^\text{root}$ is the fraction of examples in the root node which are of class 1.

# Course 3

## Anomaly detection vs. supervised learning

### Anomaly detection

1. Use when very small number of positive examples and a large number of negative examples.
2. When there are many different types of anamolies. In this case, it's hard to get enough positive examples to train a classifier. Because you might get a different kind of anamoly in the test set in the future. This works because an anamoly detection model will easily detect anamolies that it has never seen before. That is, it tries to learn what is normal and then anything that is not normal is an anamoly.

### Supervised learning

1. When there are enough positive examples to train a classifier.
2. Supervised learning works by getting a sense of what positive examples look like and it then tries to classify future positive example. So, when there are not enough positive examples, it's hard to get a sense of what positive examples look like.

> For example, instead of try to learn every type of financial fraud (which is hard), we can try to learn what is normal and then anything that is not normal is an anamoly.

> For span email, there is a pattern, like they try to send us to a website or to click a link. In this case supervised learning will be better.

![](Images/0103.png)

## Recommender Systems

Using ratings by users, we can determine features of the movies and then use those features to predict the rating of a movie that a user hasn't seen yet.

>Collaborative filtering is a technique that can filter out items that a user might like on the basis of reactions by similar users.
>
> Content-based filtering is a technique that can filter out items that a user might like on the basis of a description of the item and a profile of the user's preferences.

### Deep Learning for recommender systems

Use NN's to create a feature vector for users and movies. Then take a dot product which gives a sense of whether the movie will be liked by the user or not.

The feature vectors of the movies can be used to find other similar movies.

## Reinforcement Learning

Reinforcement Learning is a type of Machine Learning. It allows machines and software agents to automatically determine the ideal behavior within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behavior; this is known as the reinforcement signal. 

### Markov Decision Process

A Markov Decision Process (MDP) model contains: 

* A set of possible world states S.
* A set of possible actions A.
* A real-valued reward function $R(s,a)$.
* A policy the solution of Markov Decision Process.

The term Markov in the MDP or Markov decision process refers to that the future only depends on the current state and not on anything that might have occurred prior to getting to the current state. In other words, in a Markov decision process, the future depends only on where you are now, not on how you got here.

![](Images/0107.png)

### The Mars Rover Example

![](Images/0104.png)

### State

A State is a set of tokens that represent every state that the agent can be in. In the Mars Rover example, the state is the position of the rover (six of them).

### Action

An Action A is a set of all possible actions. $A(s)$ defines the set of actions that can be taken being in state S. In our example, the action is the direction in which the rover can move (left or right).

### Reward

A Reward is a real-valued reward function. $R(s)$ indicates the reward for simply being in the state S. It is an incentive for the agent to learn to be in a state that has a high reward. In our example, the reward is 100 at state 1 and 40 at state 6.

### Return

$$
\text{Return} = \sum_{t=0}^T \gamma^t R_t
$$

Usually, $\gamma$ is set very near to 1.0. The return is the sum of the rewards from the current time step to the end of the episode. The discount factor $\gamma$ is used to reduce the importance of future rewards.

![](Images/0105.png)

### Policy

Policy $\pi(s)$ tells what action to take in a given state to maximize the return. It is a mapping from states to actions.

### Review

![](Images/0106.png)

### State-Action Value Function: Q

![](Images/0108.png)

So, the best possible return from state $s$ is $ \max_{a} Q(s,a)$ while the best action to take in state $s$ is $\arg\max_a Q(s,a)$.

### Bellman Equation

The Bellman equation is a recursive equation that relates the value of a state to the values of the successor states. Mathematically, it is defined as:
$$
Q(s,a) = R(s,a) + \gamma \max_{a'} Q(s',a')
$$

Here $R(s,a)$ is the reward for taking action $a$ in state $s$. $\gamma$ is the discount factor. $s'$ is the next state. $a'$ is the next action.

![](Images/0109.png)

The first term is the reward you get right away. The second term is the return if you behave optimally from the next state. The discount factor $\gamma$ is used to reduce the importance of future rewards.

If s is terminal, then $Q(s,a) = R(s,a)$. The second term is 0.

> In case of random actions, the value of the state-action pair is the average of the rewards. The Bellman equation takes form of:
> $$Q(s,a) = R(s,a) + \gamma E[\max_{a'}Q(s',a')]$$
> $E$ is the expectation operator.

### Continuous State

For real life problems, the state space is continuous and may consist more than one variable. For example, the state of a car may be defined by its position and velocity as well as the angle of the steering wheel. So, the state is
$$
s = 
\begin{bmatrix}
x\\
y\\
\theta\\
\dot{x}\\
\dot{y}\\
\dot{\theta}
\end{bmatrix}
$$

Similarly, the state of a helicopter may be defined by its position and velocity as well as the angle of the rotor blades. So, the state is
$$
s =
\begin{bmatrix}
x\\
y\\
z\\
\phi\\
\theta\\
\psi\\
\dot{x}\\
\dot{y}\\
\dot{z}\\
\dot{\phi}\\
\dot{\theta}\\
\dot{\psi}
\end{bmatrix}
$$

### Lunar Lander

The state is:
$$
s =
\begin{bmatrix}
x\\
y\\
\theta\\
\dot{x}\\
\dot{y}\\
\dot{\theta}\\
l\\
r\\
\end{bmatrix}
$$

Here $l$ and $r$ are the left and right leg contact with the ground. $l$ and $r$ are 0 or 1.

![](Images/0110.png)

### Training a Neural Network

The goal is the learn a policy $\pi$ that maximizes the return. Deep Q-Learning is used to learn the policy.

![](Images/0111.png)

For this, we start with taking random action and using the Bellman equation to generate a training set. Then we train a neural network to predict the Q value for a given state and action. Then we use the neural network to generate a training set. Then we train the neural network again. This process is repeated until the neural network converges.

![](Images/0112.png)

$s$ and $a$ are used to calculate $x$ and $R(s)$ and $s$ are used to calculate $y$.

![](Images/0113.png)

This algorithm is called Deep Q-Learning. Instead of outputing a single action, as shown in the above figure, it is more efficient to train a model which outputs all the possible actions. This way, we have to do the inference just one time.

![](Images/0114.png)

### $\epsilon$ - Greedy Policy

The $\epsilon$ - Greedy Policy is used to balance exploration and exploitation. With probability $\epsilon$, the agent takes a random action. With probability $1-\epsilon$, the agent takes the action that maximizes the return. This works better than just using a policy where we choose action that maximizes the return.

This can be explained by the fact the the NN initializes its parameter randomly and it is quite possible that these parameters are such that the NN outputs a very low Q value for some actions which will be helpful. In this case, the agent will always choose just a subset action. This is called the **exploration-exploitation dilemma**.

The exploration versus exploitation trade-off refers to how often do you take actions randomly or take actions that may not be the best in order to learn more, versus trying to maximize your return.

![](Images/0115.png)

> Reinforcement learning is more sensitive to the choices of the hyperparameters than supervised learning. Using the wrong hyperparameters can cause the agent to learn in a very long time or nothing at all.