# Project Report: Robot Arm Motion Planning with Reinforcement Learning

## Introduction


This project focuses on solving the problem of robotic arm motion planning using Reinforcement Learning (RL). The robotic arm works in the PandaReachDense-v3 environment, which simulates a real-world task of reaching a specified target in a dense reward setting.

The problem can be seen as teaching the robotic arm how to move in a controlled and efficient manner to reach a target. This involves using RL to allow the agent to learn from its interactions with the environment.

The project is divided into three main parts:

* Training the model from scratch.
* Fine-tuning the pre-trained model.
* Running and evaluating the fine-tuned model in the environment.

At first we had a completely different idea for the project. We were to create our own PPO training model with pytorch and use pybullet as an environment. So we got working, and combining torch and RL to pybullet was trickier than we originally thought. Later we scrapped this idea and chose a pretrained model from stable-baselines3 to train and fine-tune on. We also changed the environment to a gym environment, that uses a model from panda_gym. 

## Training the Model

The training process is the foundation of this project, the agent learns to solve the task by interacting with the environment. The goal of training is to create a policy that enables the robotic arm to reach its target effectively.

Dense rewards provide the agent with immediate feedback based on how close it is to the target, allowing it to adjust its actions effectively.

For this task, we chose the Proximal Policy Optimization (PPO) algorithm, which is well-suited for continuous problems like this one. PPO is known for its stability and efficiency, making it ideal for tasks requiring precise control.

The model was initialized with the following settings:

* Policy Type: MultiInputPolicy to handle environments with complex observation spaces.
* Verbosity: Set to 1 to display training progress.

Training was made over 100,000 timesteps, and during that, the agent interacted with the environment to maximize cumulative rewards. The process involved:

* The agent observing the state of the environment.
* The agent taking an action based on its current policy.
* The environment providing feedback in the form of rewards.

In [None]:
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
import os

LOG_DIR = "./ppo_logs/"
MODEL_PATH = "ppo_robot_arm.zip"

def train_model(env_id="PandaReachDense-v3", total_timesteps=100000):
    """
    Train a PPO model on the specified environment.

    Args:
        env_id (str): The environment ID.
        total_timesteps (int): Number of timesteps for training.
    """
    os.makedirs(LOG_DIR, exist_ok=True)

    print(f"Creating environment: {env_id}")
    env = gym.make(env_id)
    env = DummyVecEnv([lambda: env])

    print("Initializing PPO model...")
    model = PPO("MultiInputPolicy", env, verbose=1)

    print("Starting training...")
    model.learn(total_timesteps=total_timesteps)
    print("Training completed!")

    print(f"Saving the model to {MODEL_PATH}...")
    model.save(MODEL_PATH)

    print("Testing the trained model...")
    obs = env.reset()

    for _ in range(1000):
        action, _states = model.predict(obs)
        obs, rewards, dones, info = env.step(action)
        env.render()

if __name__ == "__main__":
    train_model()


## Fine-Tuning

The objective of fine-tuning is to leverage the knowledge gained during the initial training phase while refining the model’s behavior in the environment. This approach allows for:

* Improved task-specific performance.
* Reduced training time compared to training from scratch.
* The ability to adapt the model to changes in environment parameters or reward structures

The model was initialized with fine-tuning-specific hyperparameters, such as:

* Learning rate: 
1
×
1
0
^
−
4
  (slower to avoid overwriting learned knowledge).
* Batch size: 64 (for efficient training without overwhelming memory).
* Discount factor (gamma): 0.98 (slightly lower to prioritize near-term rewards).

To monitor performance improvements, the EvalCallback is used. It saved the best-performing model and logged evaluation metrics at regular intervals.

A custom evaluation function is implemented here to measure rewards over five episodes. This function helped validate whether fine-tuning improved the model’s ability to reach the target.



In [None]:
import torch
import torch.optim as optim
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.callbacks import EvalCallback
import gymnasium as gym
import panda_gym
import os

# Path to the saved model
SAVED_MODEL_PATH = "ppo_robot_arm.zip"
FINE_TUNED_MODEL_PATH = "ppo_robot_arm_fine_tuned.zip"

# Hyperparameters for fine-tuning
FINE_TUNE_PARAMS = {
    "learning_rate": 1e-4,
    "n_steps": 2048,
    "batch_size": 64,
    "gamma": 0.98,
    "gae_lambda": 0.95,
    "clip_range": 0.2,
    "ent_coef": 0.01
}

# Fine-tuning script
def fine_tune(env_id="PandaReachDense-v3", total_timesteps=100000):
    """
    Fine-tune a pre-trained PPO model on a specified environment.

    Args:
        env_id (str): The environment ID (default is PandaReachDense-v3).
        total_timesteps (int): Number of timesteps for fine-tuning.
    """
    # Create the environment and wrap with Monitor
    print(f"Creating environment: {env_id}")
    base_env = gym.make(env_id)
    env = make_vec_env(lambda: Monitor(base_env), n_envs=1)

    if not os.path.exists(SAVED_MODEL_PATH):
        raise FileNotFoundError(f"Saved model not found at {SAVED_MODEL_PATH}")

    print("Loading pre-trained model...")
    model = PPO.load(SAVED_MODEL_PATH, env=env, **FINE_TUNE_PARAMS)

    def custom_eval():
        print("Running custom evaluation...")
        eval_rewards = []
        for episode in range(5):
            obs = env.reset()
            total_reward = 0
            done = False
            while not done:
                action, _ = model.predict(obs)
                obs, reward, done, info = env.step(action)
                total_reward += reward
            eval_rewards.append(total_reward)
        print(f"Evaluation rewards: {eval_rewards}")
        print(f"Average reward: {sum(eval_rewards) / len(eval_rewards)}")

    eval_callback = EvalCallback(env, best_model_save_path="./logs/",
                                  log_path="./logs/",
                                  eval_freq=10000,
                                  deterministic=True,
                                  render=False)

    # Fine-tune the model
    print("Starting fine-tuning...")
    model.learn(total_timesteps=total_timesteps, callback=eval_callback)

    # Custom evaluation
    custom_eval()

    # Save the fine-tuned modelö
    print(f"Saving fine-tuned model to {FINE_TUNED_MODEL_PATH}...")
    model.save(FINE_TUNED_MODEL_PATH)
    print("Fine-tuning completed!")

if __name__ == "__main__":
    fine_tune()

## Environment

When the fine-tuned model is saved, it is loaded and tested in the environment. This involves observing how well the agent performes the task of reaching the target while rendering the environment in real time.

During this phase, the model predicts actions based on observations, and the robotic arm moves accordingly. The rendering helpes to visualize the agent’s behavior and evaluate its success.

The environment was my own main task for the project

## main.py

I chose to include this script here because it contains everything togheter

In [None]:
from trainer.training_script import train_model
from trainer.fine_tuning import fine_tune
import gymnasium as gym
import panda_gym
import time

def main():
    """
    Main function to manage the training, fine-tuning, and running the environment.
    """
    print("Starting the Robot Arm Motion Planning Project!")

    # Step 1: Training the model from scratch
    print("\nStep 1: Training the model...")
    try:
        train_model(env_id="PandaReachDense-v3", total_timesteps=100000)
        print("Model training completed successfully!")
    except Exception as e:
        print(f"Error during training: {e}")
        return

    # Step 2: Fine-tuning the pre-trained model
    print("\nStep 2: Fine-tuning the model...")
    try:
        fine_tune(env_id="PandaReachDense-v3", total_timesteps=50000)
        print("Model fine-tuning completed successfully!")
    except Exception as e:
        print(f"Error during fine-tuning: {e}")
        return

    # Step 3: Running the environment using the fine-tuned model
    print("\nStep 3: Running the environment with the fine-tuned model...")
    try:
        env = gym.make("PandaReachDense-v3", render_mode="human")
        obs, _ = env.reset()

        from stable_baselines3 import PPO
        model = PPO.load("ppo_robot_arm_fine_tuned.zip", env=env)

        for _ in range(1000):
            time.sleep(0.3)
            action, _ = model.predict(obs)
            obs, reward, done, info, _ = env.step(action)
            env.render()
            if done:
                obs, _ = env.reset()

        print("Environment run completed successfully!")
    except Exception as e:
        print(f"Error during environment run: {e}")
    finally:
        env.close()

if __name__ == "__main__":
    main()

# Results

So how did the fine-tuning and training improve the model?

Before the training had been applied i used a different model. The model was acting very randomly then and really didn't have the logic there at all, as can be seen in the enviorment file "robot_arm_env_gym.py. Then we tested the current model and the same thing applied to this one as well.

After the training the model and fine-tuning it, you could immidiatly see a drastic change. This time we also decided to slow down the arm to better see how it did perform. it is not 100% accurate because you can see the flaws when moving the ball/object from time to time, but it is an improvement nonetheless.

If we were to improve it further we could try:

* Changing the amount of time steps
* Experiment with the fine tuning parameters


## Conclusion
So what did the project give in the end?

This project demonstrated the effectiveness of this PPO algorithm for robotic arm motion planning. By combining training from scratch and fine-tuning, the model achieved reliable performance in solving the task in the end.