# Artificial Intelligence A-Z : Udemy

### What will we learn?
- What is reinforcement learning?
- the Bellman Equation
- the 'Plan'
- Markov Decision Process (MDP)
- Policy vs Plan
- Adding a "Living Penalty"
- Q-Learning Intuition
- Temporal Difference
- Q-Learning Visualization

### Reinforcement Learning

An Agent -> Performs certain actions -> In a contrained environment/set of rules -> Resultant state or reward cycle
- Nothing is preprogrammed and the AI learns to reach the optimal solution by itself. 
- The only factors defined are the environment, the goal and the positive/negative feedback rewards

### The Bellman Equation

Richard Ernest Bellman - blind mathematician|

Concepts:
- s - state
- a - action
- R - reward
- Gamma - discount factor

The discount factor discounts the value of next or previous states so that an agent always has an idea of where to move and what direction to take based on the resultant max values from the equation:
![be.jpg](attachment:be.jpg)

### The 'Plan'

The plan is created by the agent after having calculated the max value of the bellman equation for whatever it's next move may be. Therefore, if it is set an any point/finds itself at a certain point within an environment (ex: maze), it knows the optimal next step.

### Markov Decision Process (MDP)

Deterministic Search - The agent will execute what it intends to at a given point with 100% chance of fulfilling that intention.
Non-Deterministic Search- There are smaller chances of other outcomes happening from what the agent itended due to the nature of the enviroment etc. IE: There is an element of randomess in the outcome within the environent, much like the real world.

A Markov process is a stochastic process that satisfies the Markov property (sometimes characterized as "memorylessness"). In simpler terms, it is a process for which predictions can be made regarding future outcomes based solely on its present state and—most importantly—such predictions are just as good as the ones that could be made knowing the process's full history. In other words, conditional on the present state of the system, its future and past states are independent.

In mathematics, a Markov decision process (MDP) is a discrete-time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.

Additional Reading - A Survey of Applications of Markov Decision Processes (By D.J. White - 1993)

### Policy vs Plan

A plan is when you know exactly what to do next in a given state based on deterministic or 100% probably actions. However, when there are environmental factors outside of your control, executing a plan may still lead to unexpected changes in state that weren't part of said 'plan'. This leads to formation of policies or possible desired routes based on outcomes with randomness factored in.

### The 'Living Penalty'

By adding a small living penalty ie: a negative reward at certain states, the agent is motivated to reach the final reward or positive outcome in the fastest time possible, thereby accumulating the smallest penalty. So tweaking this creates intuitive changes in the agents optimal policy and overall approach within the given environment.

### Q-Learning Intuition

Rather than computing the movements based on values of given states, Q learning means the quality of the action is analysed instead.

![ql.png](attachment:ql.png)

### Temporal Difference

A temporal difference is introduced so that an agent can effectively calculate its actions and its consequences in a non-deterministic environment with lots of recursive calls, which would otherwise be extremely challenging to compute accurately.

TD = difference between the Q value after taking an action and the Q value of the initial action before taking it

ie: TD(a,s) = R(s,a) + Gamma x maxQ(s',a') - Q(s,a)

To integrate this into what we already know:
    Q(t)(s,a) = Q(t-1)(s,a) + alpha x TD(t)(a,s)   where alpha is the learning rate. 
    
This is the rate with respect to time, hence the parameters at time t and t-1.
The full equation after substitution of TD in Q(s,a) is as follows:

![q1.jpg](attachment:q1.jpg)

## Deep Q-Learning

Deep Q-Learning is the result of combining Q-Learning with an Artificial Neural Network. The states of the environment are encoded by a vector which is passed as input into the Neural Network. Then the Neural Network will try to predict which action should be played, by returning as outputs a Q-value for each of the possible actions. Eventually, the best action to play is chosen by either taking the one that has the highest Q-value, or by overlaying a Softmax function. 

---------------------------------------------------------------------------------------------------------------------------

## Implementation #1

### Plan of Attack:
- Deep Q-Learning Intuition (Learning)
- Deep Q-Learning Intuition (Acting)
- Experience Replay
- Action Selection Policies

### Deep Q-Learning Intuition

- agent's current state in the environment is passed into a neural network as input
- the input passes through hidden layers to give all possible Q-values as output
- agent compares these output Q-values with pre-computed 'target Q-values' calculated and stored in previous iterations
- Loss function is calculated as total sum of the difference between output and target Q-values squared
- this is backpropagated into the network again for future iterations and to update weights accordingly
- the agent then chooses the best possible outcome in the current state using a softmax function (or other ASPs)

### Experience Replay

An agent stores a series of 'experiences' ie: all relevant data to a state (next state, reward, action, outcome etc) when an environment is too similar through multiple iterations so that learning is not skewed. Instead, it pulls a uniformly distributed set of experiences from this memory and uses that to learn. To add on to it, adding a rolling experience replay feature allows it to learn more about the environment quickly and even deal with rarer experiences which would otherwise take a long time to learn.

Additional Reading - 'Prioritized Experience Replay' (2016) by Tom Schaul - Google DeepMind

### Action Selection Policies

Some commonly used ASPs:
  - Softmax
  - epsilon Greedy
  - epsilon soft(1-epsilon)

ASPs are important to make sure the agent continually learns and explores the environment without getting stuck in a local maximum

## Course Material

https://www.superdatascience.com/pages/artificial-intelligence