<a href="https://colab.research.google.com/github/Seph-iroth/RoboLearning/blob/main/mecs6616_Spring2024_Project5_JL6080.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MECS6616 Spring 2024 - Project 5**

# **Introduction**

***IMPORTANT:***
- **Before starting, make sure to read the [Assignment Instructions](https://courseworks2.columbia.edu/courses/197115/pages/assignment-instructions) page on Courseworks to understand the workflow and submission requirements for this project.**

**FOR PROJECT 5!!!**
- Apart from the link to your notebook, you are also required to submit `q_network.pth` of Part 1 and `ppo_network.zip` (model checkpoints are loaded and saved by stable_baselines3 as zip files) of Part 2 to Coursework. You should put the link to your notebook in the comment entry

# Project Setup


In [None]:
# DO NOT CHANGE

# There will be error messages from this command. You can ignore those error messages
# as long as you see "Successfully installed setuptools-65.5.0" at the end.

# After installing setuptools, a pop-up window will appear and you will be prompted
# to restart the notebook environment. Click on the restart environment button before continuing

!pip install setuptools==65.5.0

**----------------------------**
**WAIT FOR NOTEBOOK TO RESTART**
**----------------------------**

In [None]:
# DO NOT CHANGE

# After running this cell, the folder 'mecs6616_sp23_project3' will show up in the file explorer on the left (click on the folder icon if it's not open)
# It may take a few seconds to appear
!git clone https://github.com/roamlab/mecs6616_sp24_project5.git

In [None]:
# DO NOT CHANGE

# copy all needed files into the working directory. This is simply to make accessing files easier
!cp -av /content/mecs6616_sp24_project5/* /content/

In [None]:
# DO NOT CHANGE

# There will be error messages from this command. You can ignore those error messages
# as long as you see "Successfully installed gym-0.21.0 stable-baselines3-1.5.0" at the end.

!pip install wheel==0.38.4
!pip install gym==0.21.0 stable-baselines3==1.5.0

Collecting nvidia-cublas-cu12==12.1.3.1 (from torch>=1.8.1->stable-baselines3==1.5.0)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)
Collecting nvidia-cufft-cu12==11.0.2.54 (from torch>=1.8.1->stable-baselines3==1.5.0)
  Using cached nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)
Collecting nvidia-curand-cu12==10.3.2.106 (from torch>=1.8.1->stable-baselines3==1.5.0)
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collecting nvidia-cusolver-cu12==11.4.5.107 (from torch>=1.8.1->stable-baselines3==1.5.0)
  Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
Collecting nvidia-cusparse-cu12==12.1.0.106 (from torch>=1.8.1->stable-baselines3==1.5.0)
  Using cached nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)
Collecting nvidia-nccl-cu12==2.19.3 (from torch>=1.8.1->stable-baselines3==1.5.0)
  Using cached nvidia_nccl_cu12-2.19.3-py3-n

# Part 1: Implement DQN

For this part, you will implement DQN from scratch. You SHOULD NOT use any RL libraries.

## Starter Code Explanation
In addition to code you are already familiar with from the previous project (i.e. arm dynamics, etc.) we are providing an "Environment" in the `ArmEnv` class. The environment "wraps around" the arm dynamics and provides the key functions that an RL algorithm expects: reset(...) and step(...). The implementation of `ArmEnv` follows the [OpenAI Gym](https://www.gymlibrary.dev/api/core/) API standard. It is a standard that is accepeted by many RL libraries and allows for our problem to be easily solved with various RL libraries. Take a moment to familiarize yourself with these functions! See [here](https://www.gymlibrary.dev/api/core/) for more information on the definition of the reset(...) and step(...) functions.

Important notes:

* The ArmEnv expects an action similar to the one used previously: a vector with a torque for every arm joint. Thus, the native action space for this environment is high-dimensional, and continuous. DQN will require an action space that is 1-dimensional and discrete. You will need to convert between these. For example, you can have an action space of [0, 1, 2,] where each number just represents the identity of an action candidate, and a conversion dictionary {0: [-0.1, -0.1], 1: [0.1, 0.1], 2: [0, 0]}. Then, when the Q network output an action 1, it will be converted into [0.1, 0.1] and used by the environment. Note that this is just an example method to implement the conversion and you do not have to follow the same procedure.
* The observation provided by the environment will comprise the same state vector as before, to which we append the current position of the end_effector and the goal for the end-effector. Since your policy must learn to reach arbitrary goals, the goal must be provided as part of the observation. So the observation will consist of 8 values: 4 for the state, 2 for the pos_ee and 2 for the goal.
* The maximum episode length of the environment is 200 steps. Each step is simulated for 0.01 second. This should be used for both training and testing.
* The reward function of this environment is by default r(s, a) = - dist(pos_ee, goal)^2 where represents the negative square of L2 distance between the current position of the end-effector and the goal position.

### Arm Environment Example
You are encouraged to view the `arm_env.py` file to understand the `random_goal()`, `reset()` and `step()`  functions but do not modify the file.

The `env.reset()` method, will reset the arm in the vertically downwards position and set a new random goal by calling the `random_goal()` method. By understanding how the goals are set you could guide your training in that direction. You can also provide your own goal as a (2,1) array to the reset function as an argument. This could come handy later when training the model.

The `env.step()` function takes an action as a (2,1) shaped array and outputs the next observation, reward, done and info. `info` is a dictionary with pos_ee and vel_ee values. This can come handy if you attempt to do some reward engineering.

The cell below provides an example of random policy interacting with the ArmEnv for 50 steps (0.5 seconds)

In [None]:
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np

# DO NOT CHANGE arm parameters
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
# ------------------

env = ArmEnv(arm, gui=False)

# Passing our own defined goal to the reset function
# goal = np.array([[0.5], [-1.5]])
# obs = env.reset(goal)

# Resetting the environment without the goal will set a random goal position
obs = env.reset()

for _ in range(1):
  rand_action = np.random.uniform(-1.5, 1.5, (2,1))
  obs, reward, done, info = env.step(rand_action)

### QNetwork
This class defines the architecture of your network. You must fill in the __init__(...) function which defines your network, and the forward(...) function which performs the forward pass.

Your action space should be discrete, with whatever cardinality you decide. The size of the output layer of your Q-Network should thus be the same as the cardinality of your action space. When selecting an action, a policy must choose the one that has the highest estimated Q-value for the current state. As part of the QNetwork class, we are providing the function select_discrete_action(...) which does exactly that.

The arm environment itself however expects a 2-dimensional, continuous action vector. Therefore, when it comes to send an action to the environment, you must provide the kind of action the environment expects. It is your job to determine how to convert between the discrete action space of your Q-Network and the continuous action space of the arm. You do this by filling in the action_discrete_to_continuous(...) function in your QNetwork. You can expect to call the step function of the environment like this:

```
self.env.step(self.q_network.action_discrete_to_continuous(discrete_action))
```

In [None]:
# @title Network
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import itertools
import numpy as np


class QNetwork(nn.Module):
  def __init__(self, env):
    super(QNetwork, self).__init__()
    self.env = env
    # Define network architecture

    # self.action_map = {0: (-1.0, -1.0), 1: (-1.0, -0.65), 2: (-1.0, -0.3), 3: (-1.0, 0.05), 4: (-1.0, 0.4), 5: (-1.0, 0.75), 6: (-1.0, 1.1), 7: (-0.65, -1.0), 8: (-0.65, -0.65), 9: (-0.65, -0.3), 10: (-0.65, 0.05), 11: (-0.65, 0.4), 12: (-0.65, 0.75), 13: (-0.65, 1.1), 14: (-0.3, -1.0), 15: (-0.3, -0.65), 16: (-0.3, -0.3), 17: (-0.3, 0.05), 18: (-0.3, 0.4), 19: (-0.3, 0.75), 20: (-0.3, 1.1), 21: (0.05, -1.0), 22: (0.05, -0.65), 23: (0.05, -0.3), 24: (0.05, 0.05), 25: (0.05, 0.4), 26: (0.05, 0.75), 27: (0.05, 1.1), 28: (0.4, -1.0), 29: (0.4, -0.65), 30: (0.4, -0.3), 31: (0.4, 0.05), 32: (0.4, 0.4), 33: (0.4, 0.75), 34: (0.4, 1.1), 35: (0.75, -1.0), 36: (0.75, -0.65), 37: (0.75, -0.3), 38: (0.75, 0.05), 39: (0.75, 0.4), 40: (0.75, 0.75), 41: (0.75, 1.1), 42: (1.1, -1.0), 43: (1.1, -0.65), 44: (1.1, -0.3), 45: (1.1, 0.05), 46: (1.1, 0.4), 47: (1.1, 0.75), 48: (1.1, 1.1)}
    # self.action_map = {0: (-0.5, -0.5), 1: (-0.5, -0.32), 2: (-0.5, -0.14), 3: (-0.5, 0.04), 4: (-0.5, 0.22), 5: (-0.5, 0.4), 6: (-0.5, 0.58), 7: (-0.32, -0.5), 8: (-0.32, -0.32), 9: (-0.32, -0.14), 10: (-0.32, 0.04), 11: (-0.32, 0.22), 12: (-0.32, 0.4), 13: (-0.32, 0.58), 14: (-0.14, -0.5), 15: (-0.14, -0.32), 16: (-0.14, -0.14), 17: (-0.14, 0.04), 18: (-0.14, 0.22), 19: (-0.14, 0.4), 20: (-0.14, 0.58), 21: (0.04, -0.5), 22: (0.04, -0.32), 23: (0.04, -0.14), 24: (0.04, 0.04), 25: (0.04, 0.22), 26: (0.04, 0.4), 27: (0.04, 0.58), 28: (0.22, -0.5), 29: (0.22, -0.32), 30: (0.22, -0.14), 31: (0.22, 0.04), 32: (0.22, 0.22), 33: (0.22, 0.4), 34: (0.22, 0.58), 35: (0.4, -0.5), 36: (0.4, -0.32), 37: (0.4, -0.14), 38: (0.4, 0.04), 39: (0.4, 0.22), 40: (0.4, 0.4), 41: (0.4, 0.58), 42: (0.58, -0.5), 43: (0.58, -0.32), 44: (0.58, -0.14), 45: (0.58, 0.04), 46: (0.58, 0.22), 47: (0.58, 0.4), 48: (0.58, 0.58)}
    # self.action_map = {0: (-0.2, -0.2), 1: (-0.2, -0.13), 2: (-0.2, -0.06), 3: (-0.2, 0.01), 4: (-0.2, 0.08), 5: (-0.2, 0.15), 6: (-0.2, 0.22), 7: (-0.13, -0.2), 8: (-0.13, -0.13), 9: (-0.13, -0.06), 10: (-0.13, 0.01), 11: (-0.13, 0.08), 12: (-0.13, 0.15), 13: (-0.13, 0.22), 14: (-0.06, -0.2), 15: (-0.06, -0.13), 16: (-0.06, -0.06), 17: (-0.06, 0.01), 18: (-0.06, 0.08), 19: (-0.06, 0.15), 20: (-0.06, 0.22), 21: (0.01, -0.2), 22: (0.01, -0.13), 23: (0.01, -0.06), 24: (0.01, 0.01), 25: (0.01, 0.08), 26: (0.01, 0.15), 27: (0.01, 0.22), 28: (0.08, -0.2), 29: (0.08, -0.13), 30: (0.08, -0.06), 31: (0.08, 0.01), 32: (0.08, 0.08), 33: (0.08, 0.15), 34: (0.08, 0.22), 35: (0.15, -0.2), 36: (0.15, -0.13), 37: (0.15, -0.06), 38: (0.15, 0.01), 39: (0.15, 0.08), 40: (0.15, 0.15), 41: (0.15, 0.22), 42: (0.22, -0.2), 43: (0.22, -0.13), 44: (0.22, -0.06), 45: (0.22, 0.01), 46: (0.22, 0.08), 47: (0.22, 0.15), 48: (0.22, 0.22)}
    # self.action_map = {0: (-0.5, -0.5), 1: (-0.5, -0.4), 2: (-0.5, -0.3), 3: (-0.5, -0.2), 4: (-0.5, -0.1), 5: (-0.5, -0.0), 6: (-0.5, 0.1), 7: (-0.5, 0.2), 8: (-0.5, 0.3), 9: (-0.5, 0.4), 10: (-0.5, 0.5), 11: (-0.4, -0.5), 12: (-0.4, -0.4), 13: (-0.4, -0.3), 14: (-0.4, -0.2), 15: (-0.4, -0.1), 16: (-0.4, -0.0), 17: (-0.4, 0.1), 18: (-0.4, 0.2), 19: (-0.4, 0.3), 20: (-0.4, 0.4), 21: (-0.4, 0.5), 22: (-0.3, -0.5), 23: (-0.3, -0.4), 24: (-0.3, -0.3), 25: (-0.3, -0.2), 26: (-0.3, -0.1), 27: (-0.3, -0.0), 28: (-0.3, 0.1), 29: (-0.3, 0.2), 30: (-0.3, 0.3), 31: (-0.3, 0.4), 32: (-0.3, 0.5), 33: (-0.2, -0.5), 34: (-0.2, -0.4), 35: (-0.2, -0.3), 36: (-0.2, -0.2), 37: (-0.2, -0.1), 38: (-0.2, -0.0), 39: (-0.2, 0.1), 40: (-0.2, 0.2), 41: (-0.2, 0.3), 42: (-0.2, 0.4), 43: (-0.2, 0.5), 44: (-0.1, -0.5), 45: (-0.1, -0.4), 46: (-0.1, -0.3), 47: (-0.1, -0.2), 48: (-0.1, -0.1), 49: (-0.1, -0.0), 50: (-0.1, 0.1), 51: (-0.1, 0.2), 52: (-0.1, 0.3), 53: (-0.1, 0.4), 54: (-0.1, 0.5), 55: (-0.0, -0.5), 56: (-0.0, -0.4), 57: (-0.0, -0.3), 58: (-0.0, -0.2), 59: (-0.0, -0.1), 60: (-0.0, -0.0), 61: (-0.0, 0.1), 62: (-0.0, 0.2), 63: (-0.0, 0.3), 64: (-0.0, 0.4), 65: (-0.0, 0.5), 66: (0.1, -0.5), 67: (0.1, -0.4), 68: (0.1, -0.3), 69: (0.1, -0.2), 70: (0.1, -0.1), 71: (0.1, -0.0), 72: (0.1, 0.1), 73: (0.1, 0.2), 74: (0.1, 0.3), 75: (0.1, 0.4), 76: (0.1, 0.5), 77: (0.2, -0.5), 78: (0.2, -0.4), 79: (0.2, -0.3), 80: (0.2, -0.2), 81: (0.2, -0.1), 82: (0.2, -0.0), 83: (0.2, 0.1), 84: (0.2, 0.2), 85: (0.2, 0.3), 86: (0.2, 0.4), 87: (0.2, 0.5), 88: (0.3, -0.5), 89: (0.3, -0.4), 90: (0.3, -0.3), 91: (0.3, -0.2), 92: (0.3, -0.1), 93: (0.3, -0.0), 94: (0.3, 0.1), 95: (0.3, 0.2), 96: (0.3, 0.3), 97: (0.3, 0.4), 98: (0.3, 0.5), 99: (0.4, -0.5), 100: (0.4, -0.4), 101: (0.4, -0.3), 102: (0.4, -0.2), 103: (0.4, -0.1), 104: (0.4, -0.0), 105: (0.4, 0.1), 106: (0.4, 0.2), 107: (0.4, 0.3), 108: (0.4, 0.4), 109: (0.4, 0.5), 110: (0.5, -0.5), 111: (0.5, -0.4), 112: (0.5, -0.3), 113: (0.5, -0.2), 114: (0.5, -0.1), 115: (0.5, -0.0), 116: (0.5, 0.1), 117: (0.5, 0.2), 118: (0.5, 0.3), 119: (0.5, 0.4), 120: (0.5, 0.5)}
    # self.action_map = {0: (-0.1, -0.1), 1: (-0.1, -0.072), 2: (-0.1, -0.044), 3: (-0.1, -0.016), 4: (-0.1, 0.012), 5: (-0.1, 0.04), 6: (-0.1, 0.068), 7: (-0.1, 0.096), 8: (-0.1, 0.124), 9: (-0.072, -0.1), 10: (-0.072, -0.072), 11: (-0.072, -0.044), 12: (-0.072, -0.016), 13: (-0.072, 0.012), 14: (-0.072, 0.04), 15: (-0.072, 0.068), 16: (-0.072, 0.096), 17: (-0.072, 0.124), 18: (-0.044, -0.1), 19: (-0.044, -0.072), 20: (-0.044, -0.044), 21: (-0.044, -0.016), 22: (-0.044, 0.012), 23: (-0.044, 0.04), 24: (-0.044, 0.068), 25: (-0.044, 0.096), 26: (-0.044, 0.124), 27: (-0.016, -0.1), 28: (-0.016, -0.072), 29: (-0.016, -0.044), 30: (-0.016, -0.016), 31: (-0.016, 0.012), 32: (-0.016, 0.04), 33: (-0.016, 0.068), 34: (-0.016, 0.096), 35: (-0.016, 0.124), 36: (0.012, -0.1), 37: (0.012, -0.072), 38: (0.012, -0.044), 39: (0.012, -0.016), 40: (0.012, 0.012), 41: (0.012, 0.04), 42: (0.012, 0.068), 43: (0.012, 0.096), 44: (0.012, 0.124), 45: (0.04, -0.1), 46: (0.04, -0.072), 47: (0.04, -0.044), 48: (0.04, -0.016), 49: (0.04, 0.012), 50: (0.04, 0.04), 51: (0.04, 0.068), 52: (0.04, 0.096), 53: (0.04, 0.124), 54: (0.068, -0.1), 55: (0.068, -0.072), 56: (0.068, -0.044), 57: (0.068, -0.016), 58: (0.068, 0.012), 59: (0.068, 0.04), 60: (0.068, 0.068), 61: (0.068, 0.096), 62: (0.068, 0.124), 63: (0.096, -0.1), 64: (0.096, -0.072), 65: (0.096, -0.044), 66: (0.096, -0.016), 67: (0.096, 0.012), 68: (0.096, 0.04), 69: (0.096, 0.068), 70: (0.096, 0.096), 71: (0.096, 0.124), 72: (0.124, -0.1), 73: (0.124, -0.072), 74: (0.124, -0.044), 75: (0.124, -0.016), 76: (0.124, 0.012), 77: (0.124, 0.04), 78: (0.124, 0.068), 79: (0.124, 0.096), 80: (0.124, 0.124)}
    # self.action_map = {0: (-0.1, -0.1), 1: (-0.1, -0.075), 2: (-0.1, -0.05), 3: (-0.1, -0.025), 4: (-0.1, 0.0), 5: (-0.1, 0.025), 6: (-0.1, 0.05), 7: (-0.1, 0.075), 8: (-0.1, 0.1), 9: (-0.075, -0.1), 10: (-0.075, -0.075), 11: (-0.075, -0.05), 12: (-0.075, -0.025), 13: (-0.075, 0.0), 14: (-0.075, 0.025), 15: (-0.075, 0.05), 16: (-0.075, 0.075), 17: (-0.075, 0.1), 18: (-0.05, -0.1), 19: (-0.05, -0.075), 20: (-0.05, -0.05), 21: (-0.05, -0.025), 22: (-0.05, 0.0), 23: (-0.05, 0.025), 24: (-0.05, 0.05), 25: (-0.05, 0.075), 26: (-0.05, 0.1), 27: (-0.025, -0.1), 28: (-0.025, -0.075), 29: (-0.025, -0.05), 30: (-0.025, -0.025), 31: (-0.025, 0.0), 32: (-0.025, 0.025), 33: (-0.025, 0.05), 34: (-0.025, 0.075), 35: (-0.025, 0.1), 36: (0.0, -0.1), 37: (0.0, -0.075), 38: (0.0, -0.05), 39: (0.0, -0.025), 40: (0.0, 0.0), 41: (0.0, 0.025), 42: (0.0, 0.05), 43: (0.0, 0.075), 44: (0.0, 0.1), 45: (0.025, -0.1), 46: (0.025, -0.075), 47: (0.025, -0.05), 48: (0.025, -0.025), 49: (0.025, 0.0), 50: (0.025, 0.025), 51: (0.025, 0.05), 52: (0.025, 0.075), 53: (0.025, 0.1), 54: (0.05, -0.1), 55: (0.05, -0.075), 56: (0.05, -0.05), 57: (0.05, -0.025), 58: (0.05, 0.0), 59: (0.05, 0.025), 60: (0.05, 0.05), 61: (0.05, 0.075), 62: (0.05, 0.1), 63: (0.075, -0.1), 64: (0.075, -0.075), 65: (0.075, -0.05), 66: (0.075, -0.025), 67: (0.075, 0.0), 68: (0.075, 0.025), 69: (0.075, 0.05), 70: (0.075, 0.075), 71: (0.075, 0.1), 72: (0.1, -0.1), 73: (0.1, -0.075), 74: (0.1, -0.05), 75: (0.1, -0.025), 76: (0.1, 0.0), 77: (0.1, 0.025), 78: (0.1, 0.05), 79: (0.1, 0.075), 80: (0.1, 0.1)}

    # action self.action_map ={0: (-1.0, -1.0), 1: (-1.0, -0.5), 2: (-1.0, 0.0), 3: (-1.0, 0.5), 4: (-1.0, 1.0), 5: (-0.5, -1.0), 6: (-0.5, -0.5), 7: (-0.5, 0.0), 8: (-0.5, 0.5), 9: (-0.5, 1.0), 10: (0.0, -1.0), 11: (0.0, -0.5), 12: (0.0, 0.0), 13: (0.0, 0.5), 14: (0.0, 1.0), 15: (0.5, -1.0), 16: (0.5, -0.5), 17: (0.5, 0.0), 18: (0.5, 0.5), 19: (0.5, 1.0), 20: (1.0, -1.0), 21: (1.0, -0.5), 22: (1.0, 0.0), 23: (1.0, 0.5), 24: (1.0, 1.0)}
    # self.action_map = {0: (-1.0, -1.0), 1: (-1.0, -0.66667), 2: (-1.0, -0.33333), 3: (-1.0, 0.0), 4: (-1.0, 0.33333), 5: (-1.0, 0.66667), 6: (-1.0, 1.0), 7: (-0.66667, -1.0), 8: (-0.66667, -0.66667), 9: (-0.66667, -0.33333), 10: (-0.66667, 0.0), 11: (-0.66667, 0.33333), 12: (-0.66667, 0.66667), 13: (-0.66667, 1.0), 14: (-0.33333, -1.0), 15: (-0.33333, -0.66667), 16: (-0.33333, -0.33333), 17: (-0.33333, 0.0), 18: (-0.33333, 0.33333), 19: (-0.33333, 0.66667), 20: (-0.33333, 1.0), 21: (0.0, -1.0), 22: (0.0, -0.66667), 23: (0.0, -0.33333), 24: (0.0, 0.0), 25: (0.0, 0.33333), 26: (0.0, 0.66667), 27: (0.0, 1.0), 28: (0.33333, -1.0), 29: (0.33333, -0.66667), 30: (0.33333, -0.33333), 31: (0.33333, 0.0), 32: (0.33333, 0.33333), 33: (0.33333, 0.66667), 34: (0.33333, 1.0), 35: (0.66667, -1.0), 36: (0.66667, -0.66667), 37: (0.66667, -0.33333), 38: (0.66667, 0.0), 39: (0.66667, 0.33333), 40: (0.66667, 0.66667), 41: (0.66667, 1.0), 42: (1.0, -1.0), 43: (1.0, -0.66667), 44: (1.0, -0.33333), 45: (1.0, 0.0), 46: (1.0, 0.33333), 47: (1.0, 0.66667), 48: (1.0, 1.0)}
    # self.action_map = {0: (-0.8, -0.8), 1: (-0.8, -0.53333), 2: (-0.8, -0.26667), 3: (-0.8, 0.0), 4: (-0.8, 0.26667), 5: (-0.8, 0.53333), 6: (-0.8, 0.8), 7: (-0.53333, -0.8), 8: (-0.53333, -0.53333), 9: (-0.53333, -0.26667), 10: (-0.53333, 0.0), 11: (-0.53333, 0.26667), 12: (-0.53333, 0.53333), 13: (-0.53333, 0.8), 14: (-0.26667, -0.8), 15: (-0.26667, -0.53333), 16: (-0.26667, -0.26667), 17: (-0.26667, 0.0), 18: (-0.26667, 0.26667), 19: (-0.26667, 0.53333), 20: (-0.26667, 0.8), 21: (0.0, -0.8), 22: (0.0, -0.53333), 23: (0.0, -0.26667), 24: (0.0, 0.0), 25: (0.0, 0.26667), 26: (0.0, 0.53333), 27: (0.0, 0.8), 28: (0.26667, -0.8), 29: (0.26667, -0.53333), 30: (0.26667, -0.26667), 31: (0.26667, 0.0), 32: (0.26667, 0.26667), 33: (0.26667, 0.53333), 34: (0.26667, 0.8), 35: (0.53333, -0.8), 36: (0.53333, -0.53333), 37: (0.53333, -0.26667), 38: (0.53333, 0.0), 39: (0.53333, 0.26667), 40: (0.53333, 0.53333), 41: (0.53333, 0.8), 42: (0.8, -0.8), 43: (0.8, -0.53333), 44: (0.8, -0.26667), 45: (0.8, 0.0), 46: (0.8, 0.26667), 47: (0.8, 0.53333), 48: (0.8, 0.8)}
    # self.action_map = {0: (0, -0.08), 1: (-1.0, -0.66667), 2: (-1.0, -0.33333), 3: (-1.0, 0.0), 4: (-1.0, 0.33333), 5: (-1.0, 0.66667), 6: (-1.0, 1.0), 7: (-0.66667, -1.0),
    #           8: (-0.66667, -0.66667), 9: (-0.66667, -0.33333), 10: (-0.66667, 0.0), 11: (-0.66667, 0.33333), 12: (-0.66667, 0.66667), 13: (-0.66667, 1.0),
    #           14: (-0.33333, -1.0), 15: (-0.33333, -0.66667), 16: (-0.33333, -0.33333), 17: (-0.33333, 0.0), 18: (-0.33333, 0.33333), 19: (-0.33333, 0.66667),
    #           20: (-0.33333, 1.0), 21: (0.0, -1.0), 22: (0.0, -0.66667), 23: (0.0, -0.33333), 24: (0.0, 0.0), 25: (0.0, 0.33333), 26: (0.0, 0.66667), 27: (0.0, 1.0),
    #           28: (0.33333, -1.0), 29: (0.33333, -0.66667), 30: (0.33333, -0.33333), 31: (0.33333, 0.0), 32: (0.33333, 0.33333), 33: (0.33333, 0.66667), 34: (0.33333, 1.0),
    #           35: (0.66667, -1.0), 36: (0.66667, -0.66667), 37: (0.66667, -0.33333), 38: (0.66667, 0.0), 39: (0.66667, 0.33333), 40: (0.66667, 0.66667), 41: (0.66667, 1.0),
    #           42: (1.0, -1.0), 43: (1.0, -0.66667), 44: (1.0, -0.33333), 45: (1.0, 0.0), 46: (1.0, 0.33333), 47: (1.0, 0.66667),48: (0, 0.08), 49: (0, 0.1),50: (0, -0.1),
    #           51: (0.08,0), 52: (0.1,0),53: ( -0.1,0),0: (-0.08,0)}
    self.action_map = {0: (-1.0, -1.0), 1: (-1.0, -0.5), 2: (-1.0, 0.0), 3: (-1.0, 0.5), 4: (-1.0, 1.0), 5: (-0.5, -1.0),
                       6: (-0.5, -0.5), 7: (-0.5, 0.0), 8: (-0.5, 0.5), 9: (-0.5, 1.0), 10: (0.0, -1.0), 11: (0.0, -0.5),
                       12: (0.0, 0.0), 13: (0.0, 0.5), 14: (0.0, 1.0), 15: (0.5, -1.0), 16: (0.5, -0.5), 17: (0.5, 0.0),
                       18: (0.5, 0.5), 19: (0.5, 1.0), 20: (1.0, -1.0), 21: (1.0, -0.5), 22: (1.0, 0.0), 23: (1.0, 0.5),
                       24: (1.0, 1.0),25:(0.25, 0),26:(0, 0.25),27:(-0.25, 0),28:(0, -0.25),29:(0.25, 0.25),30:(-0.25, -.25),
                       31:(-0.25, 0.25),32:(0.25, -0.25),33: (-1.5, -1.5),34: (1.5, 1.5),35: (-2, -2),36: (2, 2)}

    # self.action_map = {0: (-1.0, -1.0), 1: (-1.0, -0.525), 2: (-1.0, -0.05), 3: (-1.0, 0.425), 4: (-1.0, 0.9), 5: (-0.525, -1.0), 6: (-0.525, -0.525), 7: (-0.525, -0.05), 8: (-0.525, 0.425), 9: (-0.525, 0.9), 10: (-0.05, -1.0), 11: (-0.05, -0.525), 12: (-0.05, -0.05), 13: (-0.05, 0.425), 14: (-0.05, 0.9), 15: (0.425, -1.0), 16: (0.425, -0.525), 17: (0.425, -0.05), 18: (0.425, 0.425), 19: (0.425, 0.9), 20: (0.9, -1.0), 21: (0.9, -0.525), 22: (0.9, -0.05), 23: (0.9, 0.425), 24: (0.9, 0.9)}


    # happy = 0.5
    # self.action_map = {0: (happy, happy),1: (-happy, -happy),2: (happy, -happy),3: (-happy, happy),
    #                    4: (0.0, 0.0),
    #                    5: (-happy, 0.0),6: (happy, 0.0),7: (0.0, -happy),8: (0.0, happy)}

    # self.fc1 = nn.Linear(8, 512)
    # self.fc2 = nn.Linear(512, 256)
    # self.fc3 = nn.Linear(256, len(self.action_map))

    self.fc1 = nn.Linear(8, 412)
    self.fc2 = nn.Linear(412, 156)
    self.fc3 = nn.Linear(156, 128)
    self.fc4 = nn.Linear(128, len(self.action_map))

    # self.fc1 = nn.Linear(8, 300)
    # self.fc2 = nn.Linear(300, 128)
    # self.fc3 = nn.Linear(128, 80)
    # self.fc4 = nn.Linear(80, len(self.action_map))


    # self.fc1 = nn.Linear(8, 512)
    # self.fc2 = nn.Linear(512, 256)
    # self.fc3 = nn.Linear(256, 256)
    # self.fc4 = nn.Linear(256, 256)
    # self.fc5 = nn.Linear(128, 128)
    # self.fc6 = nn.Linear(128, len(self.action_map))
  # def __init__(self, env,n_obs, n_actions):
  #   super(QNetwork, self).__init__()
  #   self.env = env
  #   # Define network architecture
  #   self.fc1 = nn.Linear(n_obs, 500)
  #   self.fc2 = nn.Linear(500, 400)
  #   self.fc3 = nn.Linear(400, 300)
  #   self.fc4 = nn.Linear(300, 200)
  #   self.fc5 = nn.Linear(200, 100)
  #   self.fc6 = nn.Linear(100, n_actions)


    # Define action mapping for discrete to continuous conversion
    # Possible Action

  def forward(self, x, device):
    x = torch.tensor(x, device=device, dtype=torch.float32)  # ensure input tensor is on the correct device and type
    x = F.relu(self.fc1(x))
    x = F.relu(self.fc2(x))
    x = F.relu(self.fc3(x))
    x = self.fc4(x)

    # x = F.relu(self.fc1(x))
    # x = F.relu(self.fc2(x))
    # x = F.relu(self.fc3(x))
    # x = F.relu(self.fc4(x))
    # x = F.relu(self.fc5(x))
    # x = self.fc6(x)
    # x = F.relu(self.fc6(x))
    # x = self.fc7(x)
    return x


  def select_discrete_action(self, obs, device):
    obs = torch.from_numpy(obs).float().unsqueeze(0).to(device)  # Process input
    with torch.no_grad():
        q_values = self.forward(obs, device)
    '''rerturn the position of the biggest q value  (discrete)'''
    # print(q_values,torch.argmax(q_values, dim=1).tolist()[0])
    discrete_action = torch.argmax(q_values, dim=1).tolist()[0]
    return discrete_action
    # return int(torch.argmax(q_values, dim=1).cpu().numpy())
    # it gives you the action with highest q-value
  def action_discrete_to_continuous(self, discrete_action):
    # action_map = {
    #   0: np.array([-0.1, -0.1]) ,  # Reduce torque to half
    #   1: np.array([-0.1, 0.1]),
    #   2: np.array([0.1, -0.1]),
    #   3: np.array([0.1, 0]), # No movement, remains the same
    #   4: np.array([0, 0.1]),
    #   5: np.array([0,0]),
    # }
    # action_map = {0: (-0.2, -0.2), 1: (-0.2, -0.1), 2: (-0.2, 0.0), 3: (-0.2, 0.10000000000000003), 4: (-0.2, 0.2), 5: (-0.1, -0.2), 6: (-0.1, -0.1), 7: (-0.1, 0.0), 8: (-0.1, 0.10000000000000003), 9: (-0.1, 0.2), 10: (0.0, -0.2), 11: (0.0, -0.1), 12: (0.0, 0.0), 13: (0.0, 0.10000000000000003), 14: (0.0, 0.2), 15: (0.10000000000000003, -0.2), 16: (0.10000000000000003, -0.1), 17: (0.10000000000000003, 0.0), 18: (0.10000000000000003, 0.10000000000000003), 19: (0.10000000000000003, 0.2), 20: (0.2, -0.2), 21: (0.2, -0.1), 22: (0.2, 0.0), 23: (0.2, 0.10000000000000003), 24: (0.2, 0.2)}
    # print(discrete_action)
    # action_map = {0: (-0.15, -0.15), 1: (-0.15, -0.1), 2: (-0.15, -0.05), 3: (-0.15, 0.0),
    #               4: (-0.15, 0.05), 5: (-0.15, 0.1), 6: (-0.15, 0.15), 7: (-0.1, -0.15),
    #               8: (-0.1, -0.1), 9: (-0.1, -0.05), 10: (-0.1, 0.0), 11: (-0.1, 0.05),
    #               12: (-0.1, 0.1), 13: (-0.1, 0.15), 14: (-0.05, -0.15), 15: (-0.05, -0.1),
    #               16: (-0.05, -0.05), 17: (-0.05, 0.0), 18: (-0.05, 0.05), 19: (-0.05, 0.1),
    #               20: (-0.05, 0.15), 21: (0.0, -0.15), 22: (0.0, -0.1), 23: (0.0, -0.05),
    #               24: (0.0, 0.0), 25: (0.0, 0.05), 26: (0.0, 0.1), 27: (0.0, 0.15),
    #               28: (0.05, -0.15), 29: (0.05, -0.1), 30: (0.05, -0.05), 31: (0.05, 0.0),
    #               32: (0.05, 0.05), 33: (0.05, 0.1), 34: (0.05, 0.15), 35: (0.1, -0.15),
    #               36: (0.1, -0.1), 37: (0.1, -0.05), 38: (0.1, 0.0), 39: (0.1, 0.05),
    #               40: (0.1, 0.1), 41: (0.1, 0.15), 42: (0.15, -0.15), 43: (0.15, -0.1),
    #               44: (0.15, -0.05), 45: (0.15, 0.0), 46: (0.15, 0.05), 47: (0.15, 0.1),
    #               48: (0.15, 0.15)}
    # action_map = {0: (-1.0, -1.0), 1: (-1.0, -0.65), 2: (-1.0, -0.3), 3: (-1.0, 0.05), 4: (-1.0, 0.4), 5: (-1.0, 0.75), 6: (-1.0, 1.1), 7: (-0.65, -1.0), 8: (-0.65, -0.65), 9: (-0.65, -0.3), 10: (-0.65, 0.05), 11: (-0.65, 0.4), 12: (-0.65, 0.75), 13: (-0.65, 1.1), 14: (-0.3, -1.0), 15: (-0.3, -0.65), 16: (-0.3, -0.3), 17: (-0.3, 0.05), 18: (-0.3, 0.4), 19: (-0.3, 0.75), 20: (-0.3, 1.1), 21: (0.05, -1.0), 22: (0.05, -0.65), 23: (0.05, -0.3), 24: (0.05, 0.05), 25: (0.05, 0.4), 26: (0.05, 0.75), 27: (0.05, 1.1), 28: (0.4, -1.0), 29: (0.4, -0.65), 30: (0.4, -0.3), 31: (0.4, 0.05), 32: (0.4, 0.4), 33: (0.4, 0.75), 34: (0.4, 1.1), 35: (0.75, -1.0), 36: (0.75, -0.65), 37: (0.75, -0.3), 38: (0.75, 0.05), 39: (0.75, 0.4), 40: (0.75, 0.75), 41: (0.75, 1.1), 42: (1.1, -1.0), 43: (1.1, -0.65), 44: (1.1, -0.3), 45: (1.1, 0.05), 46: (1.1, 0.4), 47: (1.1, 0.75), 48: (1.1, 1.1)}
    # action_map = {0: (-0.5, -0.5), 1: (-0.5, -0.32), 2: (-0.5, -0.14), 3: (-0.5, 0.04), 4: (-0.5, 0.22), 5: (-0.5, 0.4), 6: (-0.5, 0.58), 7: (-0.32, -0.5), 8: (-0.32, -0.32), 9: (-0.32, -0.14), 10: (-0.32, 0.04), 11: (-0.32, 0.22), 12: (-0.32, 0.4), 13: (-0.32, 0.58), 14: (-0.14, -0.5), 15: (-0.14, -0.32), 16: (-0.14, -0.14), 17: (-0.14, 0.04), 18: (-0.14, 0.22), 19: (-0.14, 0.4), 20: (-0.14, 0.58), 21: (0.04, -0.5), 22: (0.04, -0.32), 23: (0.04, -0.14), 24: (0.04, 0.04), 25: (0.04, 0.22), 26: (0.04, 0.4), 27: (0.04, 0.58), 28: (0.22, -0.5), 29: (0.22, -0.32), 30: (0.22, -0.14), 31: (0.22, 0.04), 32: (0.22, 0.22), 33: (0.22, 0.4), 34: (0.22, 0.58), 35: (0.4, -0.5), 36: (0.4, -0.32), 37: (0.4, -0.14), 38: (0.4, 0.04), 39: (0.4, 0.22), 40: (0.4, 0.4), 41: (0.4, 0.58), 42: (0.58, -0.5), 43: (0.58, -0.32), 44: (0.58, -0.14), 45: (0.58, 0.04), 46: (0.58, 0.22), 47: (0.58, 0.4), 48: (0.58, 0.58)}
    # action_map = {0: (-0.2, -0.2), 1: (-0.2, -0.13), 2: (-0.2, -0.06), 3: (-0.2, 0.01), 4: (-0.2, 0.08), 5: (-0.2, 0.15), 6: (-0.2, 0.22), 7: (-0.13, -0.2), 8: (-0.13, -0.13), 9: (-0.13, -0.06), 10: (-0.13, 0.01), 11: (-0.13, 0.08), 12: (-0.13, 0.15), 13: (-0.13, 0.22), 14: (-0.06, -0.2), 15: (-0.06, -0.13), 16: (-0.06, -0.06), 17: (-0.06, 0.01), 18: (-0.06, 0.08), 19: (-0.06, 0.15), 20: (-0.06, 0.22), 21: (0.01, -0.2), 22: (0.01, -0.13), 23: (0.01, -0.06), 24: (0.01, 0.01), 25: (0.01, 0.08), 26: (0.01, 0.15), 27: (0.01, 0.22), 28: (0.08, -0.2), 29: (0.08, -0.13), 30: (0.08, -0.06), 31: (0.08, 0.01), 32: (0.08, 0.08), 33: (0.08, 0.15), 34: (0.08, 0.22), 35: (0.15, -0.2), 36: (0.15, -0.13), 37: (0.15, -0.06), 38: (0.15, 0.01), 39: (0.15, 0.08), 40: (0.15, 0.15), 41: (0.15, 0.22), 42: (0.22, -0.2), 43: (0.22, -0.13), 44: (0.22, -0.06), 45: (0.22, 0.01), 46: (0.22, 0.08), 47: (0.22, 0.15), 48: (0.22, 0.22)}
    action_map = self.action_map
    # print(action_map[discrete_action])
    # print(discrete_action)
    return self.action_map[discrete_action]

  def get_mapSize(self):
    # print(len(self.action_map))
    return len(self.action_map)

    #---------------------------------------
# action_map = {0: (-0.5, -0.5), 1: (-0.5, -0.4), 2: (-0.5, -0.3), 3: (-0.5, -0.2), 4: (-0.5, -0.1), 5: (-0.5, -0.0), 6: (-0.5, 0.1), 7: (-0.5, 0.2), 8: (-0.5, 0.3), 9: (-0.5, 0.4), 10: (-0.5, 0.5), 11: (-0.4, -0.5), 12: (-0.4, -0.4), 13: (-0.4, -0.3), 14: (-0.4, -0.2), 15: (-0.4, -0.1), 16: (-0.4, -0.0), 17: (-0.4, 0.1), 18: (-0.4, 0.2), 19: (-0.4, 0.3), 20: (-0.4, 0.4), 21: (-0.4, 0.5), 22: (-0.3, -0.5), 23: (-0.3, -0.4), 24: (-0.3, -0.3), 25: (-0.3, -0.2), 26: (-0.3, -0.1), 27: (-0.3, -0.0), 28: (-0.3, 0.1), 29: (-0.3, 0.2), 30: (-0.3, 0.3), 31: (-0.3, 0.4), 32: (-0.3, 0.5), 33: (-0.2, -0.5), 34: (-0.2, -0.4), 35: (-0.2, -0.3), 36: (-0.2, -0.2), 37: (-0.2, -0.1), 38: (-0.2, -0.0), 39: (-0.2, 0.1), 40: (-0.2, 0.2), 41: (-0.2, 0.3), 42: (-0.2, 0.4), 43: (-0.2, 0.5), 44: (-0.1, -0.5), 45: (-0.1, -0.4), 46: (-0.1, -0.3), 47: (-0.1, -0.2), 48: (-0.1, -0.1), 49: (-0.1, -0.0), 50: (-0.1, 0.1), 51: (-0.1, 0.2), 52: (-0.1, 0.3), 53: (-0.1, 0.4), 54: (-0.1, 0.5), 55: (-0.0, -0.5), 56: (-0.0, -0.4), 57: (-0.0, -0.3), 58: (-0.0, -0.2), 59: (-0.0, -0.1), 60: (-0.0, -0.0), 61: (-0.0, 0.1), 62: (-0.0, 0.2), 63: (-0.0, 0.3), 64: (-0.0, 0.4), 65: (-0.0, 0.5), 66: (0.1, -0.5), 67: (0.1, -0.4), 68: (0.1, -0.3), 69: (0.1, -0.2), 70: (0.1, -0.1), 71: (0.1, -0.0), 72: (0.1, 0.1), 73: (0.1, 0.2), 74: (0.1, 0.3), 75: (0.1, 0.4), 76: (0.1, 0.5), 77: (0.2, -0.5), 78: (0.2, -0.4), 79: (0.2, -0.3), 80: (0.2, -0.2), 81: (0.2, -0.1), 82: (0.2, -0.0), 83: (0.2, 0.1), 84: (0.2, 0.2), 85: (0.2, 0.3), 86: (0.2, 0.4), 87: (0.2, 0.5), 88: (0.3, -0.5), 89: (0.3, -0.4), 90: (0.3, -0.3), 91: (0.3, -0.2), 92: (0.3, -0.1), 93: (0.3, -0.0), 94: (0.3, 0.1), 95: (0.3, 0.2), 96: (0.3, 0.3), 97: (0.3, 0.4), 98: (0.3, 0.5), 99: (0.4, -0.5), 100: (0.4, -0.4), 101: (0.4, -0.3), 102: (0.4, -0.2), 103: (0.4, -0.1), 104: (0.4, -0.0), 105: (0.4, 0.1), 106: (0.4, 0.2), 107: (0.4, 0.3), 108: (0.4, 0.4), 109: (0.4, 0.5), 110: (0.5, -0.5), 111: (0.5, -0.4), 112: (0.5, -0.3), 113: (0.5, -0.2), 114: (0.5, -0.1), 115: (0.5, -0.0), 116: (0.5, 0.1), 117: (0.5, 0.2), 118: (0.5, 0.3), 119: (0.5, 0.4), 120: (0.5, 0.5)}
# print(len(action_map))

In [None]:
# @title Fold (skipped)
%%script false --no-raise-error
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np



class QNetwork(nn.Module):
  def __init__(self, env):
    super(QNetwork, self).__init__()
    #--------- YOUR CODE HERE --------------

    #---------------------------------------

  def forward(self, x, device):
    #--------- YOUR CODE HERE --------------

    return x
    #---------------------------------------


  def select_discrete_action(self, obs, device):
    # Put the observation through the network to estimate q values for all possible discrete actions
    est_q_vals = self.forward(obs.reshape((1,) + obs.shape), device)
    # Choose the discrete action with the highest estimated q value
    discrete_action = torch.argmax(est_q_vals, dim=1).tolist()[0]
    return discrete_action

  def action_discrete_to_continuous(self, discrete_action):
    #--------- YOUR CODE HERE --------------
    return
    #---------------------------------------


We provide you with code to use the replay buffer in your RL implementation. You do not need to change the ReplayBuffer class.
```
rb = ReplayBuffer()
```
After creating a ReplayBuffer object you can add samples in the buffer using `put()`:
```
rb.put((obs, action, reward, next_obs, done))
```
Take random samples from the buffer using:
```
obs, actions, rewards, next_obses, dones = rb.sample(batch_size)
```


In [None]:
import collections
import random
import numpy as np


class ReplayBuffer():
    def __init__(self, buffer_limit):
        self.buffer = collections.deque(maxlen=buffer_limit)

    def put(self, transition):
        self.buffer.append(transition)

    def sample(self, n):
        mini_batch = random.sample(self.buffer, n)
        s_lst, a_lst, r_lst, s_prime_lst, done_mask_lst = [], [], [], [], []

        for transition in mini_batch:
            s, a, r, s_prime, done_mask = transition
            s_lst.append(s)
            a_lst.append(a)
            r_lst.append(r)
            s_prime_lst.append(s_prime)
            done_mask_lst.append(done_mask)

        return np.array(s_lst), np.array(a_lst), \
               np.array(r_lst), np.array(s_prime_lst), \
               np.array(done_mask_lst)

### TrainDQN
Here, you must fill in the train(...) function that actually trains your network.

We are providing a helper function called save_model(...) that will save the current Q-network. Use this as you see fit.

To set one network equal to another one, you can use code like this:
```
target_network.load_state_dict(self.q_network.state_dict())
```

If you would like to be graded with a specific seed for the random number generators, make sure to change the default seed in the initialization of the TrainDQN class.

The time taken to train the model will depend mainly on how big is your model architecture and the number of episodes you run the training for. As a reference, the time taken to train a model on 1500 episodes, which passed all evaluation metrics was about an hour.
* Reference value for clipping the gradient value as mentioned in class: 0.2
* Reference value for a typical size of Replay Buffer: >10k
* Reference value for batch size while training: 64 - 512

Note that these are just reference values and larger is not always better as it may slow things down.

It is good practice in RL to ensure simpler things are working before complicating environments or training techniques.

If you think your training method is not working at all, you could pass a fixed goal to the `env.reset()` method during the training loop to ensure that your model is learning.

In [None]:
# @title Trainning center , MAX 7
!rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
import random
import math
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

def generate_dict(num_elements):

    return {i:0 for i in range(num_elements)}

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cpu')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00005, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size      = 287
        self.discount       = 0.897 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq  = 28
        self.num_episodes     = 1200
        self.epsilon        = 1.2
        self.epsilon_decay     = 0.98
        self.min_epsilon      = 0.4 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3  - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3  - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5,goal4,goal5,goal4,goal5,goal4,goal5,goal4,goal5,goal4,goal5,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    # def save_model(self, episode_num, save_dir='models'):
    #     timestr = time.strftime("%m-%d_%H-%M-%S")
    #     model_dir = os.path.join(save_dir, timestr)
    #     if not os.path.exists(model_dir):
    #         os.makedirs(model_dir)

    #     savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
    #     torch.save(self.q_network.state_dict(), savepath)
    #     print(f'model saved to {savepath}\n')
    #     return savepath

    def train(self):
        countAciontFreqency = generate_dict(self.q_network.get_mapSize())

        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        best_path = ''
        best_grade = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            # random_radian1 = random.uniform(0, 2 * math.pi)
            # random_radian2 = random.uniform(0, 2 * math.pi)
            episode_reward = 0
            if episode <= 700:
              obs = self.env.reset()

            if episode > 700:
              obs = self.env.reset(self.goal_list[whichGoal])
            # print(obs)
            # obs[0],obs[1] = random_radian1,random_radian2
            # print(obs)

            whichGoal += 1
            if whichGoal == len(self.goal_list):
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              # obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1
              countAciontFreqency[disc_action] += 1

              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])
              # obs = obs_next
              # current_distance = np.sqrt(np.abs(reward))


              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''
              episode_reward += reward
              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              obs = obs_next

            # print(len(self.rb.buffer))

            # self.episode_returns.append(episode_reward)
            starting =400
            if episode <= starting:
              clear_output(wait=True)
              print(f'{episode}' )
              print(f'{countAciontFreqency}')
              print(f'episode_reward: {episode_reward}')
              print(f'epsilon: {self.epsilon}')
            if episode > starting:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              grade = self.scoreTheThing(savepath)

              clear_output(wait=True)
              print(f'{episode}')
              print(f'{str(model_dir)}')
              print(f'{countAciontFreqency}')
              print(f'episode_reward: {episode_reward}')
              print(f'epsilon: {self.epsilon}')
              print(f'grade: {grade}')

              if grade >= 6.5 and good_boy < 10:
                good_boy += 1
                savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                torch.save(self.q_network.state_dict(), savepath)
                print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >= best_grade:
                  best_grade = grade
                  best_path = savepath
              if grade == 7.5:
                savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                torch.save(self.q_network.state_dict(), savepath)
                if grade >= best_grade:
                  best_grade = grade
                  best_path = savepath
                print(f"dingdingding")
                print(f"dingdingding")
                print(f"dingdingding")
                print(f"dingdingding")
                print(f"dingdingding")
                break

            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return best_path

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _         = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device= self.device, dtype=torch.float)
        action    = torch.tensor(action,   device= self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device= self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device= self.device, dtype=torch.float)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)
        action_Q     = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()
        whatIjustLearn = reward + self.discount * max_target_Q
        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score


# arm = Robot(
#         ArmDynamics(
#             num_links=2,
#             link_mass=0.1,
#             link_length=1,
#             joint_viscous_friction=0.1,
#             dt=0.01,
# 	    			gravity=False
#         )
#     )
# arm.reset()
# env = ArmEnv(arm, gui=False)
# tqdn = TrainDQN(env)
# # ---------------

# # Call your trin function here
# model_path,LotsOfReturn = tqdn.train()






# PASS EXPERIENCE (ALL SKIPPED)

In [None]:
%%script false --no-raise-error
# @title Trainning center 3 (currently working on)
# /content/models/2024-04-25_03-16-52 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(800)
        self.batch_size = 66
        self.discount = 0.60 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 20
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.25 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            # obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(episode)


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}____{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 4 (currently working on)

# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(11111)
        self.batch_size = 256
        self.discount = 0.60 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 20
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.25 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            # obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(episode)


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 5 (currently working on)

# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        # self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(11111)
        self.batch_size = 256
        self.discount = 0.80 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 15
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.28 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(episode)


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cuda')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 6 (Lots of 6.5)
#lots of 6.5 /content/models/2024-04-25_12-33-13 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        # self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 512
        self.discount = 0.899 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 30
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.32 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(episode)


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 7 (16 13 19

# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        # self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 512
        self.discount = 0.899 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 70
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.35 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(episode)


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cuda')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 8 (currently working on)
#lots of 6.5 /content/models/2024-04-25_12-33-13 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        # self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(16000)
        self.batch_size = 700
        self.discount = 0.899 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 100
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.321 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cuda')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 9
#lots of 6.5 /content/models/2024-04-25_12-33-13 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=8848):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        # self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 512
        self.discount = 0.899 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 50
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.35 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(episode)


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 10 18 12 55

# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        # self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 128
        self.discount = 0.899 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 35
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.32 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cuda')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 11

# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.0001,weight_decay=0.0001)
        # self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 128
        self.discount = 0.899 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 20
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.973
        self.min_epsilon = 0.32 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25 - np.pi/2.0)
        goal2 = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        goal3 = polar2cartesian(1.8, 0.3 - np.pi/2.0)
        goal4 = polar2cartesian(1.5, 0.3 - np.pi/2.0)
        goal5 = polar2cartesian(1.6, 0.40 - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def save_model(self, episode_num, save_dir='models'):
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join(save_dir, timestr)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)

        savepath = os.path.join(model_dir, f'q_network_ep_{episode_num:04d}.pth')
        torch.save(self.q_network.state_dict(), savepath)
        print(f'model saved to {savepath}\n')
        return savepath

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cuda')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    # def saveEPS(self, episode, total_reward, model_dir):
    #     """Handles the end of an episode including logging and model saving."""
    #     if episode % 10 == 0:
    #         savepath = os.path.join(model_dir, f'q_network_ep_{episode:04d}.pth')
    #         torch.save(self.q_network.state_dict(), savepath)
    #         print(f"Ep: {episode}  {savepath}  TR: {total_reward}")

# start = 0.96
# decay = 0.988
# min = 0.3
# for i in range(1500):
#   if i % 10 == 0 and i != 0:
#     start = max(min, start * decay)
#     print(f"{i:04d}  {start}")
# from robot import Robot
# from arm_dynamics import ArmDynamics
# # !rm -rf /content/models/
# # DO NOT CHANGE
# # ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()






In [None]:
%%script false --no-raise-error
# @title Trainning center 13, good amount 6.5
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00005, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount = 0.99 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 30
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.98
        self.min_epsilon = 0.2 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):


              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)

            if episode % 1 == 0 or episode > 900:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6.5:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        # nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 14, with 49 action space
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00005, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount = 0.99 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 30
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.98 #0.9999
        self.min_epsilon = 0.32 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6.5:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()

# /content/models/2024-04-26_02-22-36 success

In [None]:
%%script false --no-raise-error
# @title Trainning center 15, with smaller discount 0.88
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00005, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount = 0.88 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 30
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.99 #0.9999
        self.min_epsilon = 0.33 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0

        success = os.path.join('models', str(timestr+" success"))
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                if grade >= 6.5:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")



            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 15-2, with smaller discount 0.88 with specifiy goal
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00004, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount =  0.95 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 10
        self.num_episodes = 15000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.30 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = '/content/models/2024-04-26_05-11-07/ep_0508_-24.5.pth'
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 16
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=8848):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00004, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount =  0.95 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 10
        self.num_episodes = 5000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.4 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 16-1
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=999):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00004, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount =  0.95 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 10
        self.num_episodes = 5000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.4 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 16-2            2024-04-26_21-07-56

# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=999):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00004, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount =  0.93 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 30
        self.num_episodes = 4000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.4 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 17
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=999):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00004, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount =  0.92 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 37
        self.num_episodes = 4000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.4 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 18
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=999):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00008, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 256
        self.discount =  0.92 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 37
        self.num_episodes = 4000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.32 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        # self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        self.goal_list = [goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):
        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            # obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 2:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")

              total_reward = self.run_that_episode(savepath)


              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''

            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

    def run_that_episode(self,model_path,goalgo):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = run_episode(qnet, env, device,goal = goalgo)
      return score


arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 19  models/2024-04-26_18-02-48
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=999):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00008, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(4000)
        self.batch_size = 256
        self.discount =  0.92 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 37
        self.num_episodes = 15000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.6 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)
        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 20 with clipping models/2024-04-26_15-13-24
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=999):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00008, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(5000)
        self.batch_size = 512
        self.discount =  0.92 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 37
        self.num_episodes = 15000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.6 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 21
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=999):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00008, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 512
        self.discount =  0.92 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 37
        self.num_episodes = 5000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.15 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])

              # current_distance = np.sqrt(np.abs(reward))
              episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 22, dist smaller than 0.05 extra reward
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00004, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 512
        self.discount =  0.9 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 30
        self.num_episodes = 10000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.33 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])
              if current_distance < 0.05:
                reward += 0.05
              # current_distance = np.sqrt(np.abs(reward))
              # episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



In [None]:
%%script false --no-raise-error
# @title Trainning center 23, dist smaller than 0.01 extra reward
# /content/models/2024-04-26_00-20-41 success
# !rm -rf /content/models/
import os
import torch
import torch.optim as optim
from torch.utils.data import DataLoader
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import time
import os
# os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

# from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np
from geometry import polar2cartesian
from IPython.display import clear_output

class TrainDQN:
    def __init__(self, env, seed=44):

        torch.manual_seed(seed)

        np.random.seed(seed)

        # torch.cuda.manual_seed_all(seed)
        self.env = env
        # self.device = torch.device('cpu')
        self.device = torch.device('cuda')
        self.q_network = QNetwork(env).to(self.device) #action network
        self.target_network = QNetwork(env).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        # self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.00005,weight_decay=0.0001)
        self.optimizer =torch.optim.AdamW(self.q_network.parameters(), lr=0.00004, betas=(0.9, 0.999), eps=1e-08, amsgrad=False)
        self.rb = ReplayBuffer(12000)
        self.batch_size = 512
        self.discount =  0.9 # (0.8, 0.9, 0.95, 0.99)
        self.target_update_freq = 30
        self.num_episodes = 6000
        self.epsilon = 1
        self.epsilon_decay = 0.999 #0.9999
        self.min_epsilon = 0.33 # not working 0.01 0.05 0.1 0.15 0.25
        self.episode_returns = []
        goal1 = polar2cartesian(1.9, -0.25  - np.pi/2.0)
        goal2 = polar2cartesian(1.6,  0.25  - np.pi/2.0)
        goal3 = polar2cartesian(1.8,  0.3   - np.pi/2.0)
        goal4 = polar2cartesian(1.5,  0.3   - np.pi/2.0)
        goal5 = polar2cartesian(1.6,  0.40  - np.pi/2.0)

        self.goal_list = [goal1,goal2,goal3,goal4,goal5]
        model_path = ''
        if model_path:
            self.q_network.load_state_dict(torch.load(model_path))
            self.target_network.load_state_dict(self.q_network.state_dict())
            print(f"Loaded model weights from {model_path}")

    def train(self):

        goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
        timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
        model_dir = os.path.join('models', timestr)
        os.makedirs(model_dir, exist_ok=True)  # Ensure directory exists
        # step_count = 0
        best_TR = float('inf')
        whichGoal = 0
        good_boy = 0
        success = os.path.join(model_dir, "success")
        if not os.path.exists(success):
          os.makedirs(success)

        for episode in range(self.num_episodes):
            '''Initialize episode_reward = 0; s1 = random start state; reset robot to match s1'''
            episode_reward = 0

            # obs = self.env.reset(self.goal_list[whichGoal])#goal = [[0.5],[-1.5]]
            obs = self.env.reset()#goal = [[0.5],[-1.5]]
            whichGoal += 1
            if whichGoal == 4:
              whichGoal = 0
            done = False
            last_distance = np.inf
            if episode % 11 == 0 :
              num_random_action = 0
              num_maxQ_action = 0

            if episode % 10 == 0 and episode != 0:
              # Update exploration rate
                self.epsilon = max(self.min_epsilon, self.epsilon * self.epsilon_decay)
                # print(self.epsilon)
            for t in range(200):

               #Adaptive Epsilon:
              # obs = self.env.reset()
              obs = self.env.get_obs()
              """Selects an action using epsilon-greedy strategy.
              With probability epsilon, a_t = random; otherwise a_t = max_a A^A(s_t,a)"""
              if np.random.rand() > self.epsilon:
                  disc_action = self.q_network.select_discrete_action(obs, self.device)
                  num_maxQ_action +=1
              else:
                  disc_action = np.random.randint(0, self.q_network.get_mapSize())
                  num_random_action +=1


              # print(self.q_network.get_mapSize())
              # action = self.q_network.select_discrete_action(obs, self.device)

              # disc_action = q_network.select_discrete_action(obs,device)
              # cont_action = q_network.action_discrete_to_continuous(disc_action)
              # obs, reward, done, _ = env.step(cont_action)
              '''get the action '''
              cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
              '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
              last_distance = np.linalg.norm(obs[[[4, 5]]] - obs[[[6, 7]]])


              obs_next, reward, done, _ = self.env.step(cont_action)
              current_distance = np.linalg.norm(obs_next[[[4, 5]]] - obs_next[[[6, 7]]])
              if current_distance < 0.05:
                reward += 0.01
              # current_distance = np.sqrt(np.abs(reward))
              # episode_reward += reward

              # if current_distance < last_distance:
              #     reward += 0.1  # Bonus for moving closer to the goal
              # else:
              #     reward -= 0.3
              '''episode reward += rt+1'''

              '''Store transition (st, at, rt+1, st+1) in D (remove old data if  needed)'''
              # if current_distance < last_distance:
                # print(current_distance,last_distance,current_distance < last_distance)
              self.rb.put((obs, disc_action, reward, obs_next, done))
              last_distance = current_distance
              # obs, actions, rewards, next_obs, dones = self.env.step(cont_action)
              #Train network

              if len(self.rb.buffer) > self.batch_size:
                self.learn()
              '''Set st = st+1'''
              # obs = obs_next

            # print(len(self.rb.buffer))

            self.episode_returns.append(episode_reward)
            if episode <=300:
              clear_output(wait=True)
              print(f'{episode}' )
            if episode > 300:#and total_reward > -50
              # if episode_reward > best_TR:
              #   best_TR = episode_reward
              #   print(f"------------------------------------------------------------")
              clear_output(wait=True)
              print(f'{episode}    {str(model_dir)} ' )


              savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              torch.save(self.q_network.state_dict(), savepath)
              perper = round(num_maxQ_action / (num_maxQ_action+num_random_action),3)
              # if episode <= 0:
              #   savepath = os.path.join(model_dir, f'ep_{episode:04d}_{round(episode_reward,2)}.pth')
              #   torch.save(self.q_network.state_dict(), savepath)
              #   grade = self.scoreTheThing(savepath)
              #   print(f"Ep: {episode:04d}  {savepath}  {perper} ")
              if episode > 1:
                grade = self.scoreTheThing(savepath)
                # if grade <= 6:
                #   !rm -rf savepath

                if grade >= 6.5 and good_boy < 80:
                  good_boy += 1
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"Ep: {episode:04d}  {savepath}  {perper} {grade} ")
                if grade >=7:
                  savepath = os.path.join(success, f'ep_{episode:04d}_{round(episode_reward,2)}_{grade}_lets_Freaking_go.pth')
                  torch.save(self.q_network.state_dict(), savepath)
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")
                  print(f"dingdingding")


            '''Every k episodes: set target network QT=QA'''
            if episode % self.target_update_freq == 0 and episode != 0:
              # print('target net updated',"  epsilon: ", self.epsilon)
              self.target_network.load_state_dict(self.q_network.state_dict())


        return model_dir, _

    def learn(self):
        # print("learnining")
        """Sample random minibatch of  transitions from D"""
        obs, action, reward, next_obs, _ = self.rb.sample(self.batch_size)
        obs       = torch.tensor(obs,      device=self.device, dtype=torch.float)
        action    = torch.tensor(action,   device=self.device, dtype=torch.long)
        reward    = torch.tensor(reward,   device=self.device, dtype=torch.float)
        next_obs  = torch.tensor(next_obs, device=self.device, dtype=torch.float)
        # done      = torch.tensor(done, device=self.device, dtype=torch.bool)

        # disc_action = self.q_network.select_discrete_action(obs, self.device)
        # cont_action = self.q_network.action_discrete_to_continuous(disc_action)
              # print(action,action_from_the_net)
        '''Execute at and observe rt+1, st+1; If  st+1 is terminal: break (end episode)'''
        # obs_next, reward, done, _ = self.env.step(cont_action)


        action_Q = self.q_network(obs, self.device).gather(1, action.unsqueeze(1)).squeeze(1)
        max_target_Q = self.target_network(next_obs, self.device).max(1)[0].detach()

        whatIjustLearn = reward + self.discount * max_target_Q

        '''Perform batch GD on QA with L = F( QA(st, at) - [ rt + 𝛾 maxa QT(st+1, at+1)] )'''
        loss = F.mse_loss(action_Q, whatIjustLearn)
        # loss = nn.SmoothL1Loss(current_q_values, whatIjustLearn)
        self.optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(list(self.q_network.parameters()), 0.2)
        self.optimizer.step()

    def scoreTheThing(self,model_path):
      arm = Robot(
        ArmDynamics(num_links=2,link_mass=0.1,link_length=1,joint_viscous_friction=0.1,dt=0.01,gravity=False))
      arm.reset()
      env = ArmEnv(arm, gui=False)
      device = torch.device('cpu')
      qnet = QNetwork(env).to(device)
      qnet.load_state_dict(torch.load(str(model_path)))
      qnet.eval()
      score = compute_score(qnet, env, device)
      return score

arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path,LotsOfReturn = tqdn.train()



# Start Trainning

In [None]:
from robot import Robot
from arm_dynamics import ArmDynamics

# DO NOT CHANGE
# ---------------
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
env = ArmEnv(arm, gui=False)
tqdn = TrainDQN(env)
# ---------------

# Call your trin function here
model_path = tqdn.train()

To keep track of your experiments, it is good practice to plot and check how well is your model trained based on the returns vs episodes plot. With a large number of episodes, this  plot may look very jagged making it difficult to ascertain how well you are doing. We are proving code to smoothen out the plot by. This will take a large list of returns in every episode and plot a smoothened version of the list. Feel free to use it if it helps.
```
import seaborn as sns
returns = __
smoothing = 10

smoothened = [sum(returns[i:i+smoothing])/smoothing for i in range(0, len(returns), smoothing)]
sns.lineplot(smoothened)
```

### Load your model and test its performance
Change your model path and the goal to see how well your learnt model is performing

In [None]:
# @title Skip Load your model and test its performance
%%script false --no-raise-error
import collections
import random
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
from render import Renderer
from arm_env import ArmEnv
import numpy as np
import os
from math import dist
import seaborn as sns
from robot import Robot
from arm_dynamics import ArmDynamics
from geometry import polar2cartesian



# DO NOT CHANGE arm parameters
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
# ------------------

env = ArmEnv(arm, gui=False)
model_path = '' # Fill in the model_path
device = torch.device('cpu')
qnet = QNetwork(env).to(device)
qnet.load_state_dict(torch.load(model_path))
qnet.eval()
goal = polar2cartesian(1.6, 0.25 - np.pi/2.0)
done = False
obs = env.reset(goal)

episode_return = 0
while not done:
  action = qnet.select_discrete_action(obs, device)
  action = qnet.action_discrete_to_continuous(action)
  new_obs, reward, done, info = env.step(action)
  episode_return += reward

  pos_ee = info['pos_ee']
  vel_ee = info['vel_ee']
  dist = np.linalg.norm(pos_ee - goal)

  obs = new_obs
print('Episode return: ', episode_return)


### Grading and Evaluation
You will be evaluated on 5 different goal positions worth 1.5 points each. You must pass the best `model_path` for your network. The scoring function will run one episode for every goal position and find the total reward (aka return) for the episode. For every goal you get:

* 1 Point if `easy target < total reward < hard target`
* 1.5 Points if `hard target < total reward`

In [None]:
from score import compute_score
import torch.nn as nn
import torch
import torch.nn.functional as F
from render import Renderer
from arm_env import ArmEnv
from robot import Robot
from arm_dynamics import ArmDynamics
import numpy as np


# DO NOT CHANGE arm parameters
arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01,
	    			gravity=False
        )
    )
arm.reset()
# ------------------

env = ArmEnv(arm, gui=False)
# model_path = '/content/models/2024-04-27_19-33-37/success/ep_0470_-8.78_7.5_lets_Freaking_go.pth' # Fill in the model_path
model_path='q_network.pth'
device = torch.device('cpu')
qnet = QNetwork(env).to(device)
qnet.load_state_dict(torch.load(model_path))
qnet.eval()
score = compute_score(qnet, env, device)

---Computing score---

Goal 1:


  x = torch.tensor(x, device=device, dtype=torch.float32)  # ensure input tensor is on the correct device and type


Total reward: -4.716701375045821
easy target: -7
hard target: -5
points: 1.5

Goal 2:
Total reward: -4.029325360503279
easy target: -7
hard target: -5
points: 1.5

Goal 3:
Total reward: -4.391548792094321
easy target: -7
hard target: -5
points: 1.5

Goal 4:
Total reward: -4.80163454409217
easy target: -7
hard target: -5
points: 1.5

Goal 5:
Total reward: -6.010789964008882
easy target: -10
hard target: -7
points: 1.5


Final score: 7.5


# Part 2: PPO with an open source RL library

In this part, you will use one of the most popular open source RL libraries ([Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)) to solve the same goal reaching problem as Part 1. We will use the same `ArmEnv` gym environment. The algorithm you should choose to use is PPO.

## PPO training

We provide the code to construct parallel environments. Parallel environments can be very useful if you have good CPUs and it can speed up training.

In [None]:
# DO NOT CHANGE

from stable_baselines3.common.vec_env.subproc_vec_env import SubprocVecEnv
from stable_baselines3.common.vec_env.vec_monitor import VecMonitor
from copy import deepcopy
from robot import Robot
from arm_dynamics import ArmDynamics
from arm_env import ArmEnv

class EnvMaker:
    def __init__(self,  arm, seed):
        self.seed = seed
        self.arm = arm

    def __call__(self):
        arm = deepcopy(self.arm)
        env = ArmEnv(arm)
        env.seed(self.seed)
        return env

def make_vec_env(arm, nenv, seed):
    return VecMonitor(SubprocVecEnv([EnvMaker(arm, seed  + 100 * i) for i in range(nenv)]))

# conveniet function to create a robot arm
def make_arm():
    arm = Robot(
        ArmDynamics(
            num_links=2,
            link_mass=0.1,
            link_length=1,
            joint_viscous_friction=0.1,
            dt=0.01
        )
    )
    arm.reset()
    return arm


You will need to complete the code to train the policy using the [PPO class](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) from stable_baselines3. We provide the code to generate the name of the directory to save the checkpoint, an example is `ppo_models/2024-04-13_01-14-13`. Your checkpoint model should be named `ppo_network.zip`. See the [save](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#stable_baselines3.ppo.PPO.save) function. Training should take less than 40 minutes.

In [None]:

from stable_baselines3.ppo import PPO
import os
import time
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.callbacks import CheckpointCallback

# Default parameters
timesteps = 500000
nenv = 8  # number of parallel environments. This can speed up training when you have good CPUs
seed = 8
batch_size = 2048
save_freqss = 1000
num_ep = 100
# Generate path of the directory to save the checkpoint
timestr = time.strftime("%Y-%m-%d_%H-%M-%S")
save_dir = os.path.join('ppo_models', timestr)

# Set random seed
set_random_seed(seed)

# Create arm
arm = make_arm()

# Create parallel envs
vec_env = make_vec_env(arm=arm, nenv=nenv, seed=seed)

'''
PPO(policy, env, learning_rate=0.0003, n_steps=2048, batch_size=64, n_epochs=10, gamma=0.99, gae_lambda=0.95,
    clip_range=0.2, clip_range_vf=None, normalize_advantage=True, ent_coef=0.0, vf_coef=0.5, max_grad_norm=0.5,
    use_sde=False, sde_sample_freq=-1, rollout_buffer_class=None, rollout_buffer_kwargs=None, target_kl=None,
    stats_window_size=100, tensorboard_log=None, policy_kwargs=None, verbose=0, seed=None,
    device='auto', _init_setup_model=True)
'''

# ------ IMPLEMENT YOUR TRAINING CODE HERE ------------
model = PPO("MlpPolicy", vec_env, verbose=1,
            batch_size=batch_size,
            tensorboard_log="./ppo_tensorboard/",
            learning_rate=0.0001,
            device = 'cpu',
            clip_range=0.2,
            n_epochs=num_ep)

#https://stable-baselines3.readthedocs.io/en/master/guide/callbacks.html
model.learn(total_timesteps=timesteps)
model_path = os.path.join(save_dir, 'ppo_network.zip')
model.save(model_path)
print(model_path)

## Grading and evaluation

The total number of points for Part 2 is 7.5. We will evaluate your trained model on 5 random goal locations. For each test, we assign points based on the distance between the end effector and the goal location at the end of the episode.

- If 0 < distance < 0.05, you get 1.5 points.
- If 0.05 <= distance < 0.1, you get 1 point.
- If distance >= 0.1, you get 0 point.



In [None]:
from score import score_policy
from stable_baselines3 import PPO
from stable_baselines3.common.utils import set_random_seed
from robot import Robot
from arm_dynamics import ArmDynamics
from render import Renderer
import time

# Set the path to your model
# model_path = '/content/ppo_models/2024-04-27_19-08-50/ppo_network.zip'
print(model_path)
model_path = model_path
set_random_seed(seed=100)

# Create arm robot
arm = make_arm()

# Create environment
env = ArmEnv(arm, gui=False)
env.seed(100)

# Load and test policy
policy = PPO.load(model_path)
score_policy(policy, env)