#### Installing Dependencies

Run:
1. conda env create -f environment.yml
2. conda activate sc3000_project

In [2]:
# # Install xvfb, python-opengl, ffmpeg and cmake with conda
# !conda install -c conda-forge xvfbwrapper pyopengl ffmpeg cmake
# !pip install gym pyvirtualdisplay > /dev/null 2>&1
# !pip install gym[classic_control]
# !pip install --upgrade setuptools 2>&1
# !pip install ez_setup > /dev/null 2>&1
# !pip install tensorflow
# !pip install matplotlib

#### Importing Dependencies and Define Helper Functions

In [3]:
import gym
from gym import logger as gymlogger
from gym.wrappers import RecordVideo
gymlogger.set_level(40) #error only
import tensorflow as tf
import numpy as np
import random
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

In [5]:
env = gym.make("CartPole-v1")
obs = env.reset()
print("Initial observation:", obs)

Initial observation: (array([-0.00014852,  0.04895825, -0.04297311, -0.04529496], dtype=float32), {})


# **Task 1** : Development of an RL agent
> Development of an RL agent. Demonstrate the correctness of the implementation by sampling a random state from the cart pole environment, inputting to the agent, and outputting a chosen action. Print the values of the state and chosen action in Jupyter notebook.


## Approach
We utilised Q-Learning via Temporal Difference (Epsilon Soft/Greedy function with decaying epsilon), optimised using our own method of hyperparameter sampling and analysis.


We analysed the set of hyperparameters used in our samples along with the rewards produced by them, to derive insights on the values of each parameter we should use in training the RL agent.

## How It Improves Our Agent
Approaching this problem for the first time, we do not know what a good set of hyperparameters are. By running 10 random sets of hyperparameters, we can see what hyperparameters do badly, and why some hyperparameters are better.

In doing so, we can also conclude findings for this environment - for example, by increasing gamma closer to 1, we manage to achieve higher rewards, as it is generally favourable to consider future rewards in the context of the cartpole environment. (the output is analysed below)

# **Cartpole Environment**
Step 1: We utilise OpenAI's Gym library to load the Cartpole-v1 environment, with all the rewards and conditions in place


In [7]:
env = gym.make("CartPole-v1")

Step 2: We check the action and observation space of it. The output "2" shows that we have two valid discrete actions, 0 and 1 (left & right)

In [8]:
actionNumber = env.action_space.n
print(actionNumber)

2


# **Initialisation of Global Variables**
Step 3: We define our hyper-parameter search space as we plan to use
random search and retrieve an optimised set of hyper-parameters


In [9]:
hyperparameter_space = {
    'gamma': np.linspace(0.9, 1, 10),
    'epsilonParameter': np.linspace(7000, 8000, 5),
    'noOfEpisodesForRandom': np.linspace(300, 600, 5, dtype=int),
    'numberOfBins': np.linspace(25, 30, 5, dtype=int),
    'epsilon': np.linspace(0.1, 0.4, 5),
    'alpha': np.linspace(0.1, 0.4, 5)
}

Below are some explanation for the hyperparameters:

1) gamma:
It represents the discount factor for future rewards, determining the importance of future rewards in the agent's decision-making process. A value closer to 1 indicates that the agent values future rewards highly, while a value closer to 0 indicates a preference for immediate rewards.

2) epsilonParameter:
It controls the balance between exploring new actions and exploiting known actions that have yielded favorable outcomes in the past. A higher epsilon value encourages more exploration, while a lower value favors exploitation of the current best action.

3) noOfEpisodesForRandom:
It specifies the number of episodes dedicated to random exploration during the training phase of a reinforcement learning algorithm.

4) numberOfBins:
It is used to discretize a continuous space into discrete bins, influencing the granularity of the state space representation, affecting the agent's ability to distinguish between different states.

5) epsilon:
It affects the exploration, representing the probability of taking a random action instead of following the current policy. Unlike epsilonParameter that controls the overall exploration-exploitation trade-off, epsilon determines the probability of taking a random action.

6) alpha:
It denotes the learning rate, which determines the extent to which newly acquired information overrides old information during the updating of the agent's knowledge. A higher alpha value indicates a greater reliance on recent experiences, potentially leading to faster adaptation to changes in the environment. However, a very high alpha can also cause the agent to forget valuable past experiences
