# TeamsClone-RL Training with Stable-Baselines3

This notebook demonstrates how to train a reinforcement learning agent using PPO (Proximal Policy Optimization) on the TeamsClone environment.

## Prerequisites
- Backend server running on `http://localhost:3001`
- Stable-Baselines3 and Gym installed

## 1. Install Dependencies

In [None]:
!pip install stable-baselines3 gym matplotlib requests numpy

## 2. Import Libraries

In [None]:
import sys
import os
import gym
from gym import spaces
import numpy as np
import matplotlib.pyplot as plt
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3.common.callbacks import EvalCallback

# Add python_agent to path
sys.path.append(os.path.join(os.path.dirname(os.getcwd()), 'python_agent'))
from client import TeamsEnvClient

print("‚úÖ All imports successful!")

## 3. Create Gym-Compatible Environment Wrapper

Wrap the TeamsEnvClient in a Gym environment for compatibility with Stable-Baselines3.

In [None]:
class TeamsGymEnv(gym.Env):
    """Custom Gym environment for TeamsClone-RL"""
    
    metadata = {'render.modes': ['human']}
    
    def __init__(self, base_url="http://localhost:3001", task_type="greeting_response"):
        super().__init__()
        
        self.client = TeamsEnvClient(base_url)
        self.task_type = task_type
        
        # Define action and observation spaces
        # 5 discrete actions: send_message, switch_channel, react_to_message, join_call, set_status
        self.action_space = spaces.Discrete(5)
        
        # Observation space: simplified state representation
        # 10 features: [step_count, message_count, channel_id, user_status, ...]
        self.observation_space = spaces.Box(
            low=0, high=100, shape=(10,), dtype=np.float32
        )
        
        self.action_map = [
            {"type": "send_message", "content": "Hello!"},
            {"type": "switch_channel", "channelId": "general"},
            {"type": "react_to_message", "messageId": "msg-1", "reaction": "üëç"},
            {"type": "join_call"},
            {"type": "set_status", "status": "available"}
        ]
        
    def _get_obs(self, state_response):
        """Convert environment state to observation vector"""
        state = state_response.get("state", {})
        
        obs = np.zeros(10, dtype=np.float32)
        obs[0] = state.get("stepCount", 0)
        obs[1] = len(state.get("messages", []))
        obs[2] = hash(state.get("currentChannel", "")) % 100
        obs[3] = hash(state.get("userPresence", "")) % 10
        
        return obs
    
    def reset(self):
        """Reset the environment"""
        episode_info = self.client.reset(task_type=self.task_type)
        state_response = self.client.get_state()
        return self._get_obs(state_response)
    
    def step(self, action):
        """Execute action in environment"""
        # Map integer action to environment action
        env_action = self.action_map[action]
        
        # Execute step
        result = self.client.step(env_action)
        
        # Get new state
        state_response = self.client.get_state()
        obs = self._get_obs(state_response)
        
        reward = result.get("reward", 0.0)
        done = result.get("done", False)
        info = {"step_result": result}
        
        return obs, reward, done, info
    
    def render(self, mode='human'):
        """Render the environment (optional)"""
        pass
    
    def close(self):
        """Cleanup"""
        pass

print("‚úÖ TeamsGymEnv class defined successfully!")

## 4. Create and Test Environment Instance

In [None]:
# Create environment
env = TeamsGymEnv()

# Test reset
obs = env.reset()
print(f"Initial observation shape: {obs.shape}")
print(f"Initial observation: {obs}")

# Test single step
obs, reward, done, info = env.step(0)
print(f"\nAfter one step:")
print(f"  Observation: {obs}")
print(f"  Reward: {reward}")
print(f"  Done: {done}")

## 5. Initialize PPO Agent

In [None]:
# Wrap in DummyVecEnv for Stable-Baselines3 compatibility
vec_env = DummyVecEnv([lambda: TeamsGymEnv()])

# Initialize PPO agent
model = PPO(
    "MlpPolicy",
    vec_env,
    verbose=1,
    learning_rate=3e-4,
    n_steps=128,
    batch_size=64,
    n_epochs=4,
    gamma=0.99,
    tensorboard_log="./ppo_teams_tensorboard/"
)

print("‚úÖ PPO agent initialized!")

## 6. Train the Agent

Train for a small number of timesteps for demonstration purposes.

In [None]:
# Train the model
print("üöÄ Starting training...")
model.learn(total_timesteps=2000)
print("‚úÖ Training complete!")

# Save the model
model.save("ppo_teams_agent")
print("üíæ Model saved as 'ppo_teams_agent'")

## 7. Evaluate the Trained Agent

In [None]:
# Load the trained model
model = PPO.load("ppo_teams_agent")

# Evaluate over multiple episodes
num_eval_episodes = 5
eval_rewards = []
eval_steps = []

for episode in range(num_eval_episodes):
    obs = env.reset()
    done = False
    total_reward = 0
    step_count = 0
    
    while not done and step_count < 50:
        action, _states = model.predict(obs, deterministic=True)
        obs, reward, done, info = env.step(action)
        total_reward += reward
        step_count += 1
    
    eval_rewards.append(total_reward)
    eval_steps.append(step_count)
    print(f"Episode {episode + 1}: Reward = {total_reward:.2f}, Steps = {step_count}")

print(f"\nüìä Average Reward: {np.mean(eval_rewards):.2f}")
print(f"üìä Average Steps: {np.mean(eval_steps):.2f}")

## 8. Visualize Training Results

In [None]:
# Plot evaluation results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot rewards
ax1.plot(eval_rewards, marker='o', color='green')
ax1.set_xlabel('Episode')
ax1.set_ylabel('Total Reward')
ax1.set_title('PPO Agent - Episode Rewards')
ax1.grid(True, alpha=0.3)

# Plot steps
ax2.plot(eval_steps, marker='s', color='blue')
ax2.set_xlabel('Episode')
ax2.set_ylabel('Steps')
ax2.set_title('PPO Agent - Steps per Episode')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary

This notebook demonstrated:
1. Creating a Gym-compatible wrapper for TeamsClone-RL
2. Training a PPO agent using Stable-Baselines3
3. Evaluating the trained agent's performance
4. Visualizing training results

### Next Steps
- Tune hyperparameters for better performance
- Train for more timesteps
- Try different RL algorithms (DQN, A2C, SAC)
- Improve state representation
- Add curriculum learning for complex tasks