# Multi-Objective RL (HalfCheetah) - Google Colab Defteri

Bu defter kurumsal düzeyde bir Çok Amaçlı Pekiştirmeli Öğrenme (Multi-Objective RL, MORL) pipeline'ını Google Colab üzerinde hızlıca çalıştırmak için hazırlanmıştır.

İçerik:
- HalfCheetah ortamı üzerine özel MORL wrapper (vektör ödül + normalizasyon + freeze)
- PPO + GRPO (grup tabanlı grad projection) ajanı
- Tercih (ağırlık) örnekleme (Dirichlet / köşeler)
- Metrikler: Hypervolume (Monte Carlo & opsiyonel exact), IGD, IGD+
- Embedding manifold (UMAP / t-SNE) üretimi (opsiyonel)
- Toplu koşu ve özet tablo yapısına uyum
- Colab'de MuJoCo kurulumu başarısız olursa düşen bir Dummy Env fallback

Not: Colab'de GPU'yu (Runtime -> Change runtime type -> GPU) açmayı unutmayın.


## 1. Gerekli Kütüphanelerin Kurulumu (requirements)
Aşağıdaki hücre Colab ortamına gerekli paketleri kurar. MuJoCo artık gymnasium ile birlikte pip üzerinden kurulsa da Colab çekirdeklerindeki sürüm farklılıklarında hata alırsanız alternatif kurulumu deneyin.

İlk deneme hızlı kurulum (minimum):
```
!pip install gymnasium mujoco pymoo numpy pandas matplotlib seaborn scikit-learn umap-learn python-ternary
```

Eğer depoda bir `requirements.txt` dosyanız varsa (Drive'a yüklemişseniz) şu şekilde de yapabilirsiniz:
```
from google.colab import drive
drive.mount('/content/drive')
!cp /content/drive/MyDrive/patikaniz/requirements.txt .
!pip install -r requirements.txt
```

Aşağıda otomatik tespit + kurulum yapan bir hücre vereceğiz.

In [None]:
# Otomatik kurulum
%%bash
pip install gymnasium mujoco pymoo numpy pandas matplotlib seaborn scikit-learn umap-learn python-ternary > /dev/null 2>&1

echo "Kurulum tamamlandi."

## 2. Kütüphanelerin İçe Aktarılması
Kurulum tamamlandıysa gerekli modülleri içe aktaralım. MuJoCo kurulumu bazen GPU/renderer kurulumu gerektirebilir; hata alırsanız fallback env devreye girecek.


In [None]:
import os, math, json, random, time, csv, sys
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import MultivariateNormal

# pymoo ve diğerleri
try:
    from pymoo.indicators.igd import IGD
    from pymoo.util.nds.non_dominated_sorting import NonDominatedSorting
    try:
        from pymoo.indicators.hv import HV as ExactHV
    except Exception:
        ExactHV = None
except Exception as e:
    print("pymoo import hatasi, bazı metrikler devre dışı:", e)
    IGD = None
    NonDominatedSorting = None
    ExactHV = None

# MuJoCo / Gymnasium kurulumu test
_use_dummy = False
try:
    import gymnasium as gym
    _ = gym.make("HalfCheetah-v4")
    print("HalfCheetah-v4 basariyla yüklendi.")
except Exception as e:
    print("Gerçek HalfCheetah yüklenemedi, DummyEnv kullanılacak:", e)
    _use_dummy = True
    import gymnasium as gym

class DummyHalfCheetah(gym.Env):
    metadata = {"render_modes": []}
    def __init__(self):
        self.observation_space = gym.spaces.Box(-np.inf, np.inf, shape=(17,), dtype=np.float32)
        self.action_space = gym.spaces.Box(-1.0, 1.0, shape=(6,), dtype=np.float32)
        self.t = 0
        self.max_t = 300
        self.reward_space = gym.spaces.Box(-np.inf, np.inf, shape=(3,), dtype=np.float32)
    def reset(self, seed=None, options=None):
        self.t = 0
        obs = np.zeros(self.observation_space.shape, dtype=np.float32)
        return obs, {}
    def step(self, action):
        self.t += 1
        # yapay hız, enerji ve smoothness
        r_speed = np.random.randn()*0.1 + 1.0
        r_energy = -0.1 * float(np.sum(np.square(action)))
        r_smooth = -0.05 * float(np.sum(np.square(action)))
        r_vec_raw = np.array([r_speed, r_energy, r_smooth], dtype=np.float32)
        r_vec = (r_vec_raw - r_vec_raw.mean()) / (r_vec_raw.std()+1e-6)
        done = self.t >= self.max_t
        obs = np.random.randn(*self.observation_space.shape).astype(np.float32)
        info = {"reward_vec": r_vec, "reward_vec_raw": r_vec_raw}
        return obs, 0.0, done, False, info

print("Dummy mod:", _use_dummy)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)


## 3. Kodların Colab'a Uygun Hale Getirilmesi
Bu bölümde depodaki Python dosyalarının minimal birleştirilmiş (inline) sürümünü kullanacağız. İsterseniz kendi Github/Drive kaynağınızdan `!wget` veya `!git clone` ile çekip import edebilirsiniz.

Aşağıda çevrim içi (self-contained) bir MORL HalfCheetah wrapper (ve Dummy fallback), PPO+GRPO ajanı ve yardımcı fonksiyonlar veriliyor.


In [None]:
# === MORL HalfCheetah Wrapper (sadeleştirilmiş) ===
import dataclasses
from dataclasses import dataclass

class RunningMeanStd:
    def __init__(self, shape):
        self.mean = np.zeros(shape, dtype=np.float64)
        self.var = np.ones(shape, dtype=np.float64)
        self.count = 1e-4
    def update(self, x):
        x = np.asarray(x, dtype=np.float64)
        if x.ndim == 1: x = x[None, :]
        batch_mean = x.mean(axis=0); batch_var = x.var(axis=0); batch_count = x.shape[0]
        delta = batch_mean - self.mean; tot = self.count + batch_count
        new_mean = self.mean + delta * (batch_count / tot)
        m_a = self.var * self.count; m_b = batch_var * batch_count
        M2 = m_a + m_b + (delta**2) * self.count * batch_count / tot
        self.mean, self.var, self.count = new_mean, M2 / tot, tot

@dataclass
class HCConfig:
    speed_mode: str = "target_speed"
    target_speed: float = 2.0
    alpha_energy: float = 0.1
    beta_smooth: float = 0.05
    normalize: bool = True
    norm_clip: float = 5.0
    freeze_after_steps: int | None = 10000

if not _use_dummy:
    class HalfCheetahMORL(gym.Wrapper):
        def __init__(self, cfg: HCConfig):
            env = gym.make("HalfCheetah-v4")
            super().__init__(env)
            self.cfg = cfg
            self.dt = float(getattr(self.env.unwrapped, 'dt', 0.01))
            self.prev_x = 0.0
            self.prev_action = None
            self.rms = RunningMeanStd((3,)) if cfg.normalize else None
            self._norm_steps = 0
            self.reward_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(3,), dtype=np.float32)
        def reset(self, **kwargs):
            obs, info = self.env.reset(**kwargs)
            try: self.prev_x = float(self.env.unwrapped.data.qpos[0])
            except Exception: self.prev_x = 0.0
            self.prev_action = np.zeros(self.env.action_space.shape, dtype=np.float32)
            return obs, info
        def step(self, action):
            obs, _, term, trunc, info = self.env.step(action)
            try: x = float(self.env.unwrapped.data.qpos[0])
            except Exception: x = 0.0
            vx = (x - self.prev_x)/self.dt; self.prev_x = x
            if self.cfg.speed_mode == 'target_speed':
                r_speed = -abs(vx - self.cfg.target_speed)
            else:
                r_speed = vx
            r_energy = -self.cfg.alpha_energy * np.sum(np.square(action))
            r_smooth = -self.cfg.beta_smooth * np.sum(np.square(action - self.prev_action))
            self.prev_action = action.astype(np.float32, copy=False)
            r_vec_raw = np.array([r_speed, r_energy, r_smooth], dtype=np.float32)
            r_vec = r_vec_raw.copy()
            if self.rms is not None:
                if not (self.cfg.freeze_after_steps and self._norm_steps >= self.cfg.freeze_after_steps):
                    self.rms.update(r_vec)
                self._norm_steps += 1
                std = np.sqrt(np.clip(self.rms.var, 1e-6, None))
                r_vec = (r_vec - self.rms.mean)/std
                r_vec = np.clip(r_vec, -self.cfg.norm_clip, self.cfg.norm_clip)
            info = dict(info); info['reward_vec'] = r_vec; info['reward_vec_raw'] = r_vec_raw
            return obs, 0.0, term, trunc, info
else:
    # Dummy modunda reward_space zaten env icinde
    HalfCheetahMORL = DummyHalfCheetah


def make_env():
    if _use_dummy:
        return HalfCheetahMORL()
    return HalfCheetahMORL(HCConfig())

# === Yardımcı Fonksiyonlar ===

def update_global_nd(global_front: np.ndarray, new_points: np.ndarray):
    if new_points.size == 0: return global_front
    if global_front.size == 0: combined = new_points
    else: combined = np.vstack([global_front, new_points])
    if NonDominatedSorting is None:
        # fallback: basit filtre (O(n^2))
        nd = []
        for i,p in enumerate(combined):
            dominated = False
            for j,q in enumerate(combined):
                if j!=i and np.all(q>=p) and np.any(q>p):
                    dominated = True; break
            if not dominated: nd.append(p)
        return np.array(nd, dtype=np.float32)
    nds = NonDominatedSorting().do(combined, only_non_dominated_front=True)
    return combined[nds]


def compute_normalized_igd(global_front: np.ndarray, test_points: np.ndarray, min_ref: np.ndarray, max_ref: np.ndarray):
    if global_front.size==0 or test_points.size==0 or IGD is None:
        return np.nan, min_ref, max_ref
    min_ref = np.minimum(min_ref, global_front.min(axis=0))
    max_ref = np.maximum(max_ref, global_front.max(axis=0))
    rng = max_ref - min_ref; rng[rng<=1e-9]=1.0
    norm_ref = (global_front - min_ref)/rng
    norm_test = (test_points - min_ref)/rng
    igd_calc = IGD(norm_ref)
    return float(igd_calc(norm_test)), min_ref, max_ref


def monte_carlo_hv(norm_points: np.ndarray, samples: int = 3000):
    if norm_points.size==0: return np.nan
    d = norm_points.shape[1]
    # ND filtre
    nd = update_global_nd(np.empty((0,d),dtype=np.float32), norm_points).astype(np.float32)
    U = np.random.rand(samples, d)
    dominate = (nd[None,...] >= U[:,None,:]).all(axis=2).any(axis=1)
    return float(dominate.mean())


def exact_hv(points: np.ndarray, ref=None):
    if ExactHV is None or points.size==0: return np.nan
    try:
        d = points.shape[1]
        if ref is None: ref = np.zeros(d, dtype=np.float32)
        hv = ExactHV(ref_point=ref)
        return float(hv(points))
    except Exception:
        return np.nan


def igd_plus(ref: np.ndarray, approx: np.ndarray):
    if ref.size==0 or approx.size==0: return np.nan
    dists = []
    for r in ref:
        diff = approx - r
        diff_pos = np.clip(diff, 0, None)
        dists.append(np.linalg.norm(diff_pos, axis=1).min())
    return float(np.mean(dists)) if dists else np.nan

# === PPO + GRPO ===
class FiLMLayer(nn.Module):
    def __init__(self, input_dim, cond_dim):
        super().__init__(); self.fc = nn.Sequential(nn.Linear(cond_dim,128), nn.ReLU(), nn.Linear(128,2*input_dim)); self.input_dim = input_dim
    def forward(self, x, cond):
        gb = self.fc(cond); g,b = gb[:,:self.input_dim], gb[:,self.input_dim:]
        return g * x + b

class Actor(nn.Module):
    def __init__(self, s_dim, a_dim, w_dim, h=256):
        super().__init__(); self.base = nn.Sequential(nn.Linear(s_dim,h), nn.Tanh()); self.film=FiLMLayer(h,w_dim); self.head=nn.Sequential(nn.Linear(h,h), nn.Tanh(), nn.Linear(h,a_dim)); self.action_var=nn.Parameter(torch.full((a_dim,),0.5))
    def forward(self,s,w): h=self.base(s); h=self.film(h,w); return self.head(h)
    def evaluate(self,s,a,w):
        mean=self.forward(s,w); var=self.action_var.expand_as(mean); dist=MultivariateNormal(mean, torch.diag_embed(var)); lp=dist.log_prob(a); ent=dist.entropy(); return lp, mean, ent

class Critic(nn.Module):
    def __init__(self, s_dim, w_dim, h=256):
        super().__init__(); self.base=nn.Sequential(nn.Linear(s_dim,h), nn.Tanh()); self.film=FiLMLayer(h,w_dim); self.head=nn.Sequential(nn.Linear(h,h), nn.Tanh(), nn.Linear(h,1))
    def forward(self,s,w): h=self.base(s); h=self.film(h,w); return self.head(h)

class Memory:
    def __init__(self):
        self.actions=[]; self.states=[]; self.prefs=[]; self.logprobs=[]; self.rewards=[]; self.raw_rewards=[]; self.is_terminals=[]
    def clear(self):
        self.actions.clear(); self.states.clear(); self.prefs.clear(); self.logprobs.clear(); self.rewards.clear(); self.raw_rewards.clear(); self.is_terminals.clear()

class PPO(nn.Module):
    def __init__(self, s_dim,a_dim,w_dim, hidden=256, lr=3e-4, betas=(0.9,0.999), gamma=0.99, K=10, eps_clip=0.2, ent_coef=0.01, use_grpo=True, group_mode='knn', knn_delta=0.15, gae_lambda=0.95, target_kl=0.02):
        super().__init__(); self.gamma=gamma; self.eps_clip=eps_clip; self.K=K; self.ent_coef=ent_coef; self.use_grpo=use_grpo; self.group_mode=group_mode; self.knn_delta=knn_delta; self.gae_lambda=gae_lambda; self.target_kl=target_kl
        self.actor=Actor(s_dim,a_dim,w_dim,hidden).to(device); self.old_actor=Actor(s_dim,a_dim,w_dim,hidden).to(device); self.old_actor.load_state_dict(self.actor.state_dict())
        self.critic=Critic(s_dim,w_dim,hidden).to(device); self.opt=torch.optim.Adam(list(self.actor.parameters())+list(self.critic.parameters()), lr=lr, betas=betas); self.mse=nn.MSELoss()
    def select_action(self,s,w,memory:Memory):
        s_t=torch.as_tensor(s,dtype=torch.float32,device=device).unsqueeze(0); w_t=torch.as_tensor(w,dtype=torch.float32,device=device).unsqueeze(0)
        with torch.no_grad(): mean=self.old_actor(s_t,w_t); var=self.old_actor.action_var.expand_as(mean); dist=MultivariateNormal(mean, torch.diag_embed(var)); a=dist.sample(); lp=dist.log_prob(a)
        memory.states.append(s_t); memory.prefs.append(w_t); memory.actions.append(a); memory.logprobs.append(lp); return a.squeeze(0).cpu().numpy()
    def update(self,memory:Memory):
        states=torch.cat(memory.states); actions=torch.cat(memory.actions); prefs=torch.cat(memory.prefs); old_logp=torch.cat(memory.logprobs).detach(); rewards=torch.as_tensor(memory.rewards,dtype=torch.float32,device=device); dones=torch.as_tensor(memory.is_terminals,dtype=torch.float32,device=device)
        with torch.no_grad(): values=self.critic(states,prefs).squeeze(-1)
        adv=torch.zeros_like(rewards); last=0.0
        for t in reversed(range(len(rewards))):
            next_non_term=1.0-(dones[t+1] if t<len(rewards)-1 else 0.0)
            next_val=values[t+1] if t<len(rewards)-1 else 0.0
            delta=rewards[t]+self.gamma*next_val*next_non_term - values[t]
            last=delta + self.gamma*self.gae_lambda*next_non_term*last
            adv[t]=last
        returns=adv+values; adv=(adv-adv.mean())/(adv.std()+1e-8)
        for _ in range(self.K):
            logp,_,ent=self.actor.evaluate(states,actions,prefs); new_values=self.critic(states,prefs).squeeze(-1); ratios=torch.exp(logp-old_logp)
            surr1=ratios*adv; surr2=torch.clamp(ratios,1-self.eps_clip,1+self.eps_clip)*adv; policy_core=-torch.min(surr1,surr2).mean(); value_loss=self.mse(new_values,returns); entropy_loss=ent.mean()
            loss=policy_core+0.5*value_loss - self.ent_coef*entropy_loss
            self.opt.zero_grad(); loss.backward(); self.opt.step()
        self.old_actor.load_state_dict(self.actor.state_dict())
        with torch.no_grad(): new_logp,_,_=self.actor.evaluate(states,actions,prefs)
        kl=(old_logp-new_logp).mean().item(); return value_loss.item(), kl, float(1 - torch.var(returns-new_values)/(torch.var(returns)+1e-8))

# === Tercih Örnekleme ===
def sample_prefs(K=24,m=3,include_corners=True,seed=0):
    rng=np.random.default_rng(seed); prefs=[]
    if include_corners:
        for i in range(m):
            e=np.zeros(m,dtype=np.float32); e[i]=1.0; prefs.append(e)
    while len(prefs)<K:
        x=rng.random(m); x/=x.sum(); prefs.append(x.astype(np.float32))
    return np.stack(prefs[:K],axis=0)

print('Kod blokları yüklendi.')


## 4. Örnek Eğitim & Metrik Hesaplama
Bu bölümde birkaç yüz adımlık kısa bir eğitim döngüsü çalıştırıp metrikleri (IGD, IGD+, Monte Carlo HV) hesaplayacağız. Uzun koşular için `max_episodes` ve `update_timestep` değerlerini artırabilirsiniz. Colab zaman sınırı nedeniyle küçük bir demo yapılır.


In [None]:
# Küçük demo eğitimi
max_episodes = 30
update_timestep = 800  # adım bazlı güncelleme eşiği (yaklaşık 2-3 episode)
max_ep_len = 400

env = make_env()
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
if hasattr(env,'reward_space'):
    pref_dim = env.reward_space.shape[0]
else:
    pref_dim = 3

ppo = PPO(state_dim, action_dim, pref_dim, K=5)
memory = Memory()

train_prefs = sample_prefs(K=16, m=pref_dim, include_corners=True, seed=42)
# Test tercihleri (farklı seed)
test_prefs = sample_prefs(K=16, m=pref_dim, include_corners=True, seed=123)

global_front = np.empty((0,pref_dim), dtype=np.float32)
min_ref = np.full(pref_dim, np.inf, dtype=np.float32)
max_ref = np.full(pref_dim, -np.inf, dtype=np.float32)

results = []
time_step = 0
steps_since_update = 0

for ep in range(1, max_episodes+1):
    s, _ = env.reset()
    # Eğitim tercihi rastgele seç
    w = train_prefs[np.random.randint(len(train_prefs))]
    ep_vec_norm = np.zeros(pref_dim, dtype=np.float32)
    for t in range(max_ep_len):
        a = ppo.select_action(s, w, memory)
        s2, _, term, trunc, info = env.step(a)
        done = term or trunc
        r_vec = info.get('reward_vec') or info.get('reward_vec_raw')
        scalar_r = float(np.dot(r_vec, w))
        memory.rewards.append(scalar_r)
        memory.raw_rewards.append(r_vec)
        memory.is_terminals.append(done)
        ep_vec_norm += r_vec
        s = s2
        time_step += 1
        steps_since_update += 1
        if steps_since_update >= update_timestep:
            vloss, kl, ev = ppo.update(memory)
            memory.clear()
            steps_since_update = 0
            # Test rollout metrikleri
            test_returns = []
            for tw in test_prefs:
                st,_ = env.reset(); vec_sum = np.zeros(pref_dim, dtype=np.float32)
                for _ in range(200):
                    with torch.no_grad():
                        mean = ppo.old_actor(torch.as_tensor(st,dtype=torch.float32,device=device).unsqueeze(0), torch.as_tensor(tw,dtype=torch.float32,device=device).unsqueeze(0))
                        act = mean.squeeze(0).cpu().numpy()
                    st2, _, d1, d2, inf2 = env.step(act)
                    rv = inf2.get('reward_vec') or inf2.get('reward_vec_raw')
                    vec_sum += rv
                    st = st2
                    if d1 or d2: break
                test_returns.append(vec_sum)
            test_returns = np.array(test_returns, dtype=np.float32)
            igd_m, min_ref, max_ref = compute_normalized_igd(global_front, test_returns, min_ref, max_ref)
            igd_p = igd_plus(global_front, test_returns) if global_front.size>0 else np.nan
            if global_front.size>0:
                rng = max_ref - min_ref; rng[rng<=1e-9]=1.0
                train_norm = np.clip((global_front - min_ref)/rng,0,1)
                hv_train = monte_carlo_hv(train_norm)
            else: hv_train=np.nan
            if test_returns.size>0 and global_front.size>0:
                rng = max_ref - min_ref; rng[rng<=1e-9]=1.0
                test_norm = np.clip((test_returns - min_ref)/rng,0,1)
                hv_test = monte_carlo_hv(test_norm)
            else: hv_test=np.nan
            results.append(dict(episode=ep, vloss=vloss, kl=kl, ev=ev, igd=igd_m, igd_plus=igd_p, hv_train=hv_train, hv_test=hv_test))
            print(f"[Update] Ep {ep} KL {kl:.4f} IGD {igd_m:.4f} HV(train) {hv_train:.3f}")
        if done:
            break
    global_front = update_global_nd(global_front, ep_vec_norm.reshape(1,-1))

print('Egitim tamamlandi. ND front boyutu:', len(global_front))
res_df = pd.DataFrame(results)
res_df.head()

### 4.1 Sonuçların Görselleştirilmesi
Aşağıdaki hücre metriklerin zaman içindeki seyrini ve Pareto front yaklaşımını çizer.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')

if not res_df.empty:
    fig, axes = plt.subplots(1,3, figsize=(15,4))
    res_df.plot(x='episode', y='igd', ax=axes[0], title='IGD')
    res_df.plot(x='episode', y='hv_train', ax=axes[1], title='HV Train (MC)')
    res_df.plot(x='episode', y='hv_test', ax=axes[2], title='HV Test (MC)')
    plt.show()
else:
    print('Sonuç DataFrame bos.')

# ND front scatter (ilk iki amaç)
if len(global_front)>0:
    plt.figure(figsize=(5,4))
    plt.scatter(global_front[:,0], global_front[:,1], c=global_front[:,2], cmap='viridis')
    plt.colorbar(label='Obj3')
    plt.xlabel('Obj1'); plt.ylabel('Obj2'); plt.title('Yaklaşılan Pareto Noktaları')
    plt.show()


### 4.2 İleri Öneriler ve Hızlandırma
- Daha uzun eğitim: `max_episodes` ve `update_timestep` artırın.
- Çoklu seed: Aynı hücreyi farklı `random.seed()` ile döngüye alıp `pd.concat` ile özetleyin.
- Profiling: `%%bash` ile `pip install torch-tb-profiler` + `torch.profiler` kullanılabilir.
- Embedding manifold: Ek bir buffer ve UMAP / t-SNE hesaplaması ile aksiyon-preference uzayı incelenebilir.
- Exact HV: Amaç sayısı <=3 ise `pymoo.indicators.hv.HV` kullanılabilir (hesaplama maliyetli olabilir).

Defter burada temel uçtan uca akışı göstermektedir. Depodaki tam script (training_morl_example.py) daha kapsamlı logging + aggregate analiz içerir.
