### Description of the environment
The CartPole-v2 environment is a control problem in reinforcement learning. The goal of the agent is to balance a pole on top of a cart that can move horizontally along a track. The pole is initially upright, and the agent must take actions to keep the pole from falling over.

The state of the environment is represented by a vector of four values:

* Cart position (from -4.8 to 4.8)
* Cart velocity (from -Inf to Inf)
* Pole angle (from -24 deg to 24 deg)
* Pole velocity at tip (from -Inf to Inf)

The agent can take one of three actions:

* Move the cart left
* Move the cart right
* Do nothing

The episode ends when the pole falls over or the agent keeps the pole balanced for 500 time steps. The agent receives a reward of +1 for each time step that the pole is balanced.

### principal steps

* play with `CartPole-v1` environment.
* create a new class `CartPole_V2` that inherits from the CartPoleEnv class and overrides the `__init__` and `step` functions to include the "stay" action. 
* implement `Random_agent` that takes a random actions from the action space.
* implement `DQL_agent` that uses a deep Q-learning network to estimate the Q-values of the actions in each state and select the best action according to an epsilon-greedy policy.
* run the code with different parameters and hyperparameters for a fixed network. 
* test it multiple times.

### Possible extension of the solution

* try different parameters and use the best ones.
* try tuning the hyperparameters of the DQL algorithm to improve the agent's performance.
* use transfer learning to initialize the neural network's weights from a pre-trained network.
* implement other reinforcement learning algorithms, such as policy gradient methods or actor-critic methods and compare them to DQL agent.
* visualize the agent's performance during training. For example, plot the agent's average reward over time or visualize the Q-values learned by the neural network.

### Possible shortcomings

* The DQL algorithm take a long time to converge. ( 50 min using the current network and parameters).
* The neural network used by the DQL algorithm is a black box, which makes it difficult to interpret the decisions maked by the agent and diagnose problems.
* The DQL algorithm is prone to overfitting to a specific environment. If the agent is trained on one environment and tested on a different environment, its performance may not generalize well.

In [3]:
from cartpole_v2 import CartPole_V2, Simu, Random_agent, DQL_agent

In [4]:
env = CartPole_V2(render_mode="human")
print("The action space is : ", env.action_space)
print("The observation space is : ", env.observation_space)
print("The state at t_0 : ", env.reset())
# do nothing action 
next_state, reward, terminated, truncated, info = env.step(2)
print("The new state at t_1 : ", next_state)
print("The reward is  : ", reward)
env.close()

The action space is :  Discrete(3)
The observation space is :  Box([-4.8000002e+00 -3.4028235e+38 -4.1887903e-01 -3.4028235e+38], [4.8000002e+00 3.4028235e+38 4.1887903e-01 3.4028235e+38], (4,), float32)
The state at t_0 :  (array([-0.03232957,  0.02868224, -0.04247316,  0.03301261], dtype=float32), {})
The new state at t_1 :  [-0.03175592  0.02929051 -0.04181291  0.01961767]
The reward is  :  1.0


### test the environment with a random agent 

In [5]:
sm = Simu()
sm.run_simu()

1--0--2--0--0--1--2--2--0--0--0--2--0--1--2--2--0--score : 17


### train the dql agent and test it 

In [2]:
sm = Simu(agent=DQL_agent)
sm.run_simu()

episode: 0 score : 16
episode: 1 score : 13
episode: 2 score : 19
episode: 3 score : 18
episode: 4 score : 16
episode: 5 score : 18
episode: 6 score : 25
episode: 7 score : 18
episode: 8 score : 30
episode: 9 score : 20
episode: 10 score : 30
episode: 11 score : 21
episode: 12 score : 15
episode: 13 score : 12
episode: 14 score : 10
episode: 15 score : 13
episode: 16 score : 12
episode: 17 score : 10
episode: 18 score : 10
episode: 19 score : 12
episode: 20 score : 9
episode: 21 score : 12
episode: 22 score : 9
episode: 23 score : 8
episode: 24 score : 10
episode: 25 score : 9
episode: 26 score : 11
episode: 27 score : 10
episode: 28 score : 8
episode: 29 score : 9
episode: 30 score : 11
episode: 31 score : 10
episode: 32 score : 11
episode: 33 score : 9
episode: 34 score : 9
episode: 35 score : 11
episode: 36 score : 10
episode: 37 score : 10
episode: 38 score : 11
episode: 39 score : 11
episode: 40 score : 8
episode: 41 score : 9
episode: 42 score : 10
episode: 43 score : 8
episode: 

### another simulation using dql agent

In [3]:
sm.run_simu()

1--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--0--1--0--1--0--1--0--1--0--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--0--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--1--0--1--0--1--0--1--0--1--2--0--1--0--1--0--1--0--1--0--1--0--1--0--1--0--1--0--1--0--1--0--1--0--2--2--1--0--2--2--1--0--2--1--0--2--2--1--0--2--1--0--2--1--0--2--1--0--2--2--1--0--2--1--0--2--1--0--2--2--1--0--2--1--0--2--2--1--0--2--1--0--2--2--1--0--2--2--1--0--2--1--0--2--2--1--0--2--2--1--0--2--2--2--1--0--2--2--1--0--2--2--1--0--2--2--1--0--2--2--2--1--0--2--2--2--1--0--2--2--2--1--0--2--2--2--1--0--2--2--2--2--1--0--2--2--2--2--1--0--2--2--2--2--2--1--0--2--2--2--2--2--1--0--2--2--2--2--2--2--1--0--2--2--2--2--2--2--1--0--2--2--2--2--2--2--2--2--2--2--2--2--1--0--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--1--0--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--2--0--1--0--1--0--1--0--2--2--2--2--2