# Workshop Open Ai Gym and Reinforcement Learning

## Installation

In [None]:
%pip install gym
%pip install random
%pip install numpy
%pip install matplotlib

## Environement Setup

In [None]:
import gym
import random
import numpy as np
import matplotlib.pyplot as plt

### Context

We will be using the OpenAI [Gymnasium](https://gymnasium.farama.org/)(gym) library to create our AI algorithm. 
This library offers the opportunity to choose from a variety of environments

we will be working with the [MountainCar](https://gymnasium.farama.org/environments/classic_control/mountain_car/) environment. 
To get started, you will need to initialize the environment.

In [None]:
env = gym.make('MountainCar-v0')
# Create the env here
Qpts = None

## Training

Now that the environment has been created, we need to write the functions to train our AI.

We use this function to calculate the reward based on the car's position.

In [None]:
def rewards_calcul(pos):
    if (pos >= 0.5):
        return 2
    else:
        return (pos + 1.2) / 1.8 - 1

###  The stateSpace func returns the size of the state space as an numpy array

First, subtract the lower bound of the observation space from the upper bound. Then, scale the result using a numpy array of [10, 50].

Finally, round the result to the nearest integer, cast it as an int, and add 1.

In [None]:
def stateSpace(env):
    # Calculate the state space here
    return size_states

### The lower_state func returns the normalized and rounded state of the environment 

First, call the **'reset'** method on the environment to initialize its state.

Next, normalize the state of the environment by subtracting the lower bound of the observation space (observation_space.low) and scaling it using a numpy array of [10, 50].

Finally, round the result to the nearest integer and cast it as an int.

In [None]:
def lower_state(env):
    # Get the state and calculate the lower state here
    return low_state

### The update func return a delta to update the values in a Q-table using a Q-Learning algorithm.

To calculate the *delta* value, follow these steps:

1. Compute the target Q-value by using the *reward* of the *next state* and the *maximum Q-value* over all actions in the *adjusted next state*.

2. Calculate the *delta* as the *difference* between the *target Q-value* and the *current Q-value* for the *current state-action pair*.

3. The function should *return* the calculated *delta* value."

In [None]:
def update(learningRate, nextState, nextState_adj, Q_Table, currentState, action):
    # calculate the target Q-value here
    # calculate the delta here
    return delta

### Training function

Before we begin, we need to initialize some values. After this, we can start looping over the number of episodes.

To calculate the *next action*, two conditions are considered:

1. Compute the action by using numpy's *'argmax'* function on the *row* of the *Q-table for the current state*.

2. If the first condition is not met, the action is calculated *randomly* between *0* and the number of actions in the environment (*env.action_space.n*).

Next, calculate the new *adjusted state*, similar to the *'lower_states'* function, but using the *next state variable*.

Set the *final/terminate* state to the reward variable, or update delta and add it to the current state.

Finally, *update epsilon* if it is *greater* than the *minimum epsilon* by *multiplying/decreasing* it by *eps1*."

In [None]:
def training(env, learningrate, discount, epsilon, min_epsilon, episodes):

    reward_list = []
    average_rewards = []    
    eps1 = epsilon
    first = episodes + 1
    size_states = stateSpace(env)
    
    # initialisation of Q_table with random value
    Q_Table = np.random.uniform(low = -1, high = 0, size = (size_states[0], size_states[1], env.action_space.n))    
    Qinit = np.copy(Q_Table)
    
    for i in range(episodes):
        done = False
        tot_reward = 0
        reward = 0
        state = env.reset()
        
        state_adj = lower_state(env)
        
        while done != True:
            # Calculate next action
            # if np.random.random() < 1 - epsilon:
            # numpy argmax on the row of Q_table on the current state
            # else:
            # random between 0 and the env.action_space.n
                
            # we get the info when we execute the action into the env
            nextState, reward, done, info = env.step(action)
            
            # calculate nextState_adj here
            
            # if done and nextState[0] >= 0.5:
                # set reward  as value for the current final state
            # else:
                # Update delta with the update function and add it to th current state in Q_table
            
            # Say when first success occurs
            if nextState[0] >= 0.5 and i < first:
                first = i
                print('First time reaching goal on epsiode {}'.format(first + 1))
            
            tot_reward += rewards_calcul(nextState[0])
            state_adj = nextState_adj
        
        # if epsilon is greater than min_epsilon, multiply/decreseate it by eps1
            
        reward_list.append(tot_reward)
        
        if (i + 1) % 100 == 0:
            # Store the rewards list every 100 steps to calculate the average rewards 
            # print the average reward for this 100 steps
            reward_list = []
        
    env.close()
    # returns the average rewards, the final Q-Table and the initial Q-Table.
    return average_rewards, Q_Table, Qinit
        

### Now we can finnaly use our trainning function

To ensure everything works, *reset the environment*.

Next, call the *training* function with the following parameters: **env, 0.2, 0.9, 0.999, 0, and 10000**.

Once the training is finished, *save the Q-points* (Q-table) into a *file*. You'll need to *reshape* the numpy array from *3-dimensional to 2-dimensional*.

In [None]:
# reset env
rewards, Qpts, Qinit = training(env, 0.2, 0.9, 0.999, 0, 10000)
# save Qpts into file here

### Plot of the Rewards

Let's create a *plot* to visualize the *Average Reward per episode*.

Use *Episodes* as the *x-axis label* and *Average Reward* as the *y-axis label*.

In [None]:
plt.plot(100 * (np.arange(len(rewards)) + 1), rewards)
# set xlabel, ylabel, and title of the plot

## Testing

Now that our AI has finished training, we can finally test it and see if it can complete the challenge.

### Load_data from a csv

Let's write a function to *load* the data from the previously saved file so we can test our AI even after closing VSCode.

Use the *numpy loadtxt* attribute to load the *CSV* and *reshape* the contents of the file with the *Q_table_shape*.

In [None]:
def load_data():
    # Get file content
    size_states = stateSpace(env)
    Q_Table_shape = (size_states[0], size_states[1], env.action_space.n)
    # Reshape of file content
    return Qpts

### Testing function

First, if the *Q_table* is *equal* to *None*, call the *load_data* function.

In the loop, *render* the environment.

Calculate the action by taking the *argmax* of the *Q-values* for the *current adjusted state* in the *Q-table*.

Get the result of the step with this action, and finally, calculate the *adjusted state* like in the *lower_states* function but with the *next_state variable*.

In [None]:
def testing(env, Q_Table):
    # load data if Q_table == None
    state = env.reset()
    state_adj = lower_state(env)
    done = False
    while not done:
        env.render()
        # Calculate the action by taking the argmax of the Q-values for the current adjusted state in the Q-table.
        next_state, reward, done, _ = env.step(action)
        # calculate the adjusted state like in the lower_states function but with the next_state variable.
    env.close()

### Testing

To ensure that everything is working as expected, first call the reset method on the environment. Then, call the *testing* function.

A window should open and display the results.

In [None]:
env.reset()
testing(env, Qpts)

## More ?

Now you can attempt to optimize the algorithm by changing some parameters and observing the results. 

Alternatively, you can try a different algorithm or environment.