## Initialize the Custom `Flappyenv` Environment

This environment simulates a Flappy Bird–style RL task.

### Action Space
- **Discrete(2)**
  - `0` — Do nothing  
  - `1` — Flap

### Observation Space
The observation consists of:

1. Bird’s Y-axis position  
2. Bird’s vertical velocity  
3. Center Y-position of the gap between the pipes  
4. Horizontal distance to the next set of pipes  

In [1]:
from gymnasium import spaces,Env
import numpy as np
from flappy import reset_game,step_game,render_game,close
from stable_baselines3 import PPO,A2C,DQN
import os

class Flappyenv(Env):
    
    def __init__(self):
        self.width = 288
        self.height = 512
        self.action_space = spaces.Discrete(2)
        self.observation_space = spaces.Box(
            low=np.array([0, -20, -self.width, 0], dtype=np.float32),
            high=np.array([self.height, 20, self.width, self.height], dtype=np.float32),
            dtype=np.float32
        )
        self.curren_obs = None
        super().__init__()
    def reset(self, *, seed = None, options = None,**kargs):
        self.curren_obs = np.array([reset_game()])
        return self.curren_obs,{}
    def step(self, action):
        return step_game(action=action)
    def render(self,mode="human"):
        render_game(mode=mode)
        return None
    def close(self):
        close()
        return None 
        
        

  from pkg_resources import resource_stream, resource_exists


## Training the RL Agent

The following function trains a reinforcement learning agent using a chosen Stable-Baselines3 algorithm.  
Each training session runs multiple iterations, periodically saving model checkpoints.

### Function: `train_agent`

**Parameters:**
- `algo` — The RL algorithm class (e.g., `PPO`, `A2C`, `DQN`)
- `algoname` — Name of the algorithm (used in saved model filenames)
- `model_dir` — Directory where model checkpoints will be saved
- `log_dir` — Directory for TensorBoard logs

### Training Logic
- Initializes the custom environment: `Flappyenv()`
- Creates the model using:
  - `"MlpPolicy"`
  - CUDA device
  - TensorBoard logging enabled
- Trains in loops of **100,000 timesteps**
- Saves the model after every iteration
- Stops after **5 iterations** (total 500,000 timesteps)

### Code Overview

In [None]:
def train_agent(algo,algoname,model_dir,log_dir):
    env = Flappyenv()
    model = algo("MlpPolicy", env, verbose=1, tensorboard_log=log_dir, device="cuda")
    iters = 0
    TIMESTEPS = 100000
    while True:
        iters += 1
        model.learn(total_timesteps=TIMESTEPS, reset_num_timesteps=False)
        model.save(f"{model_dir}/{algoname}_{TIMESTEPS*iters}")
        if iters >= 5:
            break
for i in range(5):
    model_dir = f"models_test/{i}"
    log_dir = f"logs_test/{i}"
    os.makedirs(model_dir, exist_ok=True)
    os.makedirs(log_dir, exist_ok=True)
    train_agent(PPO,"PPO",model_dir=model_dir,log_dir=log_dir)


## Testing the Trained Agent

This script loads a saved PPO model and runs it inside the custom `Flappyenv` environment.  
The agent acts deterministically to visualize its learned behavior.

### Code Overview

In [4]:

env = Flappyenv()
model = PPO.load("PPO_500000.zip", env=env)
obs, info = env.reset()

while True:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    env.render()
    if terminated:
        break
env.close()

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
