# Dataset

## Description
We use a reinforcement learning problem to visualize different trajectories of algorithms that find a path from starting state to the goal state. The environment we selected is ***Cliff Walking*** provided by ***gym***: https://www.gymlibrary.dev/environments/toy_text/cliff_walking/

We explore the behaviour of 4 different algorithms: **random policy**, **SARSA** in combination with an epsilon greedy policy, **Q Learning** with an epsilon greedy policy, and **Expected SARSA** with an epsilon greedy policy. 

For each of these algorithms, we computed 5000 episodes to ensure convergence. <TODO> insert hyperparameters for each algorithm.

The dataset consists of 5000 episodes of each algorithm, where each episode contains a list of (state, action, reward, next_state) tuples. 

In [3]:
# Feel free to add dependencies, but make sure that they are included in environment.yml

#disable some annoying warnings
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

#plots the figures in place instead of a new window
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import altair as alt
from altair import datum
alt.data_transformers.disable_max_rows()

from sklearn import manifold
from sklearn.decomposition import PCA
from openTSNE import TSNE
from umap import UMAP

In [4]:
with open('expected_sarsa.npy', 'rb') as f:
    expected_sarsa = np.load(f, allow_pickle=True)
with open('q_learning.npy', 'rb') as f:
    q_learning = np.load(f, allow_pickle=True)
with open('random.npy', 'rb') as f:
    random = np.load(f, allow_pickle=True)
with open('sarsa.npy', 'rb') as f:
    sarsa = np.load(f, allow_pickle=True)

## Create a PD Dataframe

TODO: we need to decide on how to filter the data such that the plots are not too crowded but the convergence behaviour is still visible. 

In [8]:
import pandas as pd

# Algorithm variable
algorithms = ["Expected SARSA", "Q LEARNING", "RANDOM", "SARSA"]
trajectories_algos = [expected_sarsa, q_learning, random, sarsa]

# List to hold data for the DataFrame
data = []

for trajectories_algo, algo in zip(trajectories_algos, algorithms):
    for episode_index, trajectory in enumerate(trajectories_algo):
        # TODO: add/adapt here some filtering strategy such that we get a smaller dataset
        # if episode_index % 2 == 0:
            episode_length = len(trajectory)
            
            for step_index, step in enumerate(trajectory):
                state, action, reward, next_state, done = step
                
                # Determine cp value
                if step_index == 0:
                    cp = 'start'
                elif step_index == episode_length - 1:
                    cp = 'end'
                else:
                    cp = 'intermediate'
                
                # Append the data for the DataFrame
                data.append({
                    'line': episode_index,
                    'cp': cp,
                    'algorithm': algo,
                    'state': state,
                    'action': action,
                    'reward': reward,
                    'next_state': next_state
                })

# Create the DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
df.head(5000)

Unnamed: 0,line,cp,algorithm,state,action,reward,next_state
0,0,start,Expected SARSA,36,3,-1,24
1,0,intermediate,Expected SARSA,24,0,-1,24
2,0,intermediate,Expected SARSA,24,1,-1,36
3,0,end,Expected SARSA,36,2,-100,37
4,1,start,Expected SARSA,36,1,-1,36
...,...,...,...,...,...,...,...
4995,254,intermediate,Expected SARSA,14,2,-1,15
4996,254,intermediate,Expected SARSA,15,2,-1,16
4997,254,intermediate,Expected SARSA,16,2,-1,17
4998,254,intermediate,Expected SARSA,17,2,-1,18


In [10]:
df.to_csv('cliff_walking.csv', index=False) 

In [None]:
meta_data = df.iloc[:, :3]
proj_data = df.iloc[:, 3:]
proj_data.head()

In [None]:
# TODO: do we need to one hot encode the data?
one_hot_df = pd.get_dummies(proj_data, columns=['state', 'action', 'reward', 'next_state'], 
                            prefix=['state', 'action', 'reward', 'next_state'])
one_hot_df.head()