## Reinforcement Learning Analogy
Consider the scenario of teaching a dog new tricks. The dog doesn't understand our language, so we can't tell him what to do. Instead, we follow a different strategy. We emulate a situation (or a cue), and the dog tries to respond in many different ways. If the dog's response is the desired one, we reward them with snacks. 

Now guess what, the next time the dog is exposed to the same situation, the dog executes a similar action with even more enthusiasm in expectation of more food. That's like learning "what to do" from positive experiences. Similarly, dogs will tend to learn what not to do when face with negative experiences.

That's exactly how Reinforcement Learning works in a broader sense:
- Dog is an "agent" that is exposed to the environment. 
- The situations they encounter is `state`. 
- Our agents react by performing an `action` to transition from on `state` to another.
- After the transition, they may recieve a `reward` or `penalty` in return.
- The `Policy` is the strategy of choosing an `action` given a `state` in expectation of better outcomes.

There are some important things to note:
1. <b>Being Greedy does'nt always work</b>:

    There are things that are easy to do for instant gratification, and there's things that provide long term rewards  The goal is to not be greedy by looking for the quick immediate rewards, but instead to optimize for maximum rewards over the whole training.
    
2. <b>Sequence matters in Reinforcement Learning</b>
The `reward` agent does not just depend on the current `state`, but the entire history of states.

## The Reinforcement Learning Process
<img src="https://www.learndatasci.com/documents/14/Reinforcement-Learning-Animation.gif"/>

If we break down Reinforcement Learning into steps then:
- Observation of the environment
- Deciding how to act using some strategy
- Acting accordingly
- Receiving a reward or penalty
- Learning from the experiences and refining our strategy
- Iterate until an optimal strategy is found

## Example Design: Self Driving Cab
In this, we design a self driving cab. The major goal is to demonstrate how to use Reinforment Learning to develop an efficient approach to the problem.

The SelfDriving Cab's job is to pick the passenger at one location and drop them off in another. Some few things to take care of
- Drop off the passenger to the right location
- Save passenger's time by taking minimum time possible to drop off.
- Take care of Passenger's safety and traffic rules

### 1. Rewards
Since the agent (the imaginary driver) is reward-motivated and is going to learn how to control the cab by trial experiences in the environment, we need to decide the `rewards` and/or `penalties` and their magnitude accordingly. Here a few points to consider:

- The agent should receive a high positive reward for a successful dropoff because this behavior is highly desired
- The agent should be penalized if it tries to drop off a passenger in wrong locations
- The agent should get a slight negative reward for not making it to the destination after every time-step. "Slight" negative because we would prefer our agent to reach late instead of making wrong moves trying to reach to the destination as fast as possible

### 2. State Space
In Reinforcement Learning, the `agent` encounters a state, and then takes `action` according to the state it's in.

The `State Space` is the set of all possible situations our taxi could inhabit. The state should contain useful information the agent needs to make the right action.

Let's say we have a training area for our Smartcab where we are teaching it to transport people in a parking lot to four different locations (R, G, Y, B):

<img src="https://storage.googleapis.com/lds-media/images/Reinforcement_Learning_Taxi_Env.width-1200.png"/>

Let's assume Smartcab is the only vehicle in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. Notice the current location state of our taxi is coordinate (3, 1).

You'll also notice there are four (4) locations that we can pick up and drop off a passenger: R, G, Y, B or `[(0,0), (0,4), (4,0), (4,3)]` in (row, col) coordinates. Our illustrated passenger is in location `Y` and they wish to go to location `R`.

When we also account for one (1) additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there's four (4) destinations and five (4 + 1) passenger locations.

### 3. Action Space
The `agent` encounters one of the 500 `states` and it takes an `action`. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.

In other words, we have six possible actions:
- south
- north
- east
- west
- pickup
- dropoff

This is the action space: the set of all the actions that our agent can take in a given state.

You'll notice in the illustration above, that the taxi cannot perform certain actions in certain states due to `walls`. In environment's code, we will simply provide a `-1 penalty` for every wall hit and the taxi won't move anywhere. This will just rack up penalties causing the taxi to consider going around the wall.

## Implementation
We will use `OpenAI Gym` Library for this. Firstly, Install `gym` by using the command
```bash
pip install gym['atari']
```

There are some requirements like `cmake` and `scipy` to be installed with it. Once installed, load `Taxi-v2` Game environment and render it...

In [2]:
import gym

# Load Taxi Environment
env = gym.make('Taxi-v3').env # Taxi-v2 -> Not Found
env.render()

+---------+
|R: | :[43m [0m:G|
| : | : : |
| : : : : |
| | : | : |
|[35mY[0m| : |[34;1mB[0m: |
+---------+



In [3]:
# Reset the Environment to Random State, and print some information
env.reset()
env.render()

print('Action Space: {}'.format(env.action_space))
print('State Space: {}'.format(env.observation_space))

+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| | : |[43m [0m: |
|Y| : |[34;1mB[0m: |
+---------+

Action Space: Discrete(6)
State Space: Discrete(500)


So, in the environment we have `Action Space` of size 6 and a `State Space` of size 500. The 6 Action States are:

* 0 = south
* 1 = north
* 2 = east
* 3 = west
* 4 = pickup
* 5 = dropoff

Recall that the 500 states correspond to a encoding of the taxi's location, the passenger's location, and the destination location.

Reinforcement Learning will learn a mapping of `states` to the optimal `action` to perform in that state by exploration, i.e. the agent explores the environment and takes actions based off rewards defined in the environment.

The optimal action for each state is the action that has the `highest cumulative long-term reward`.

Now, We can encode its `state` and give it to the `env` to render. Recall that we have the taxi at `row 3`, `column 1`, our passenger is at location `2`, and our destination is location `0`.

In [4]:
state = env.encode(3, 1, 2, 0) # (taxi row, taxi col, passenger index, destination index)
print('State: ', state)

env.s = state
env.render()

State:  328
+---------+
|[35mR[0m: | : :G|
| : | : : |
| : : : : |
| |[43m [0m: | : |
|[34;1mY[0m| : |B: |
+---------+



We are using our illustration's coordinates to generate a number corresponding to a state between 0 and 499, which turns out to be `328` for our illustration's state.

## The Reward Table
When the Taxi environment is created, there is an initial Reward table that's also created, called `P`. We can think of it like a matrix that has the number of states as rows and number of actions as columns, i.e. a   `states x actions` matrix.

Since every state is in this matrix, we can see the default reward values assigned to our illustration's state:

In [5]:
# Reward Table
env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

The dict has the structure `{action: [(probability, nextstate, reward, done)]}`

Some key points:
* The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at our current state in the illustration.
* In this env, `probability` is always 1.0.
* The `nextstate` is the state we would be in if we take the action at this index of the dict
* All the movement actions have a `-1` reward and the pickup/dropoff actions have `-10` reward in this particular state. If we are in a state where the taxi has a passenger and is on top of the right destination, we would see a reward of 20 at the dropoff action.
* `done` is used to tell us when we have successfully dropped off a passenger in the right location. Each successfull dropoff is the end of an `episode`

Note that if our agent chose to explore action two `(2)` in this `state` it would be going East into a wall. The source code has made it impossible to actually move the taxi across a wall, so if the taxi chooses that action, it will just keep accruing `-1` penalties, which affects the `long-term reward`.

## Solving the Problem without Reinforcement Learning
For comparison, let's use `BruteForce` to solve the Problem. Since we have our `P` table for default rewards in each `state`, we can try to have our taxi navigate just using that.

We'll create an infinite loop which runs until one passenger reaches one `destination` (one episode), or in other words, when the received reward is `20`. The `env.action_space.sample()` method automatically selects one random action from set of all possible actions.

In [6]:
env.s = 328 # Set Env to Illustration's state

epochs = 0
penalties, reward = 0, 0

frames = []
done = False

while not done:
    action = env.action_space.sample()
    state, reward, done, info = env.step(action)

    if reward == -10:
        penalties += 1
    
    # Put each rendered frame into dict for Animation
    frames.append({
        'frame': env.render(mode='ansi'),
        'state': state,
        'action': action,
        'reward': reward
    })
    epochs += 1

print(f'Timesteps taken: {epochs}')
print(f'Penalties incurred: {penalties}')

Timesteps taken: 999
Penalties incurred: 337


In [9]:
from IPython.display import clear_output
from time import sleep


def print_frames(frames):
    for i, frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'].getvalue())
        print(f"Timestep: {i + 1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)


print_frames(frames)


AttributeError: 'str' object has no attribute 'getvalue'