## TradingEnv-v0

### Open AI 'Gym' for reinforcement-learning based trading algorithms

This gym implements a very simple trading environment for reinforcement learning.

The gym provides daily observations based on real market data pulled from Quandl on, by default, the SPY etf.  An episode is defined as 252 contiguous days sampled from the overall dataset.  Each day is one 'step' within the gym and for each step, the algo has a choice:

 - SHORT (0)
 - FLAT (1)
 - LONG  (2)
 
If you trade, you will be charged, by default, 10 BPS of the size of your trade.  Thus, going from short to long costs twice as much as going from short to/from flat.  Not trading also has a default cost of 1 BPS per step.  Nobody said it would be easy!
 
At the beginning of your episode, you are allocated 1 unit of cash.  This is your starting Net Asset Value (NAV). 

### Beating the trading game 

For our purposes, we'll say that beating a buy & hold strategy, on average, over one hundred episodes will notch a win to the proud ai player.  We'll illustrate exactly what that means below.

### Let's look at some code using the environment


###  imports

In [None]:
import gym
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import interactive
interactive(True)
import gym_trading


### create the environment

This may take a moment as we are pulling historical data from quandl.

In [None]:
env = gym.make('trading-v0')
#env.time_cost_bps = 0 # 


### the trading model

Each time step is a day.  Each episode is 252 trading days - a year.  Each day, we can choose to be short (0), flat (1) or long (2) the single instrument in our trading universe.

Let's run through a day and stay flat.

In [None]:
observation = env.reset()
done = False
navs = []
while not done:
    action = 1 # stay flat
    observation, reward, done, info = env.step(action)
    navs.append(info['nav'])
    if done:
        #print 'Annualized return: ',navs[len(navs)-1]-1
        pd.DataFrame(navs).plot()


### Note that you are charged just for playing - to the tune of 1 basis point per day!


### Rendering

For now, no rendering has been implemented for this gym, but with each step, the following datum are provided which you can easily graph and otherwise visualize as we see above with the NAV:

 - pnl - how much did we make or lose between yesterday and today?
 - costs  - how much did we pay in costs today
 - nav    - our current nav
 

## utility methods: running strategies once or repeatedly

Although the gym can be 'exercised' directly as seen above, we've also written utility methods which allow for the running of a strategy once or over many episodes, facilitating training or other sorts of analysis.

To utilize these methods, strategies should be exposed as a function or lambda with the following signature:

`Action a = strategy( observation, environment )`
    
Below, we define some simple strategies and look briefly at their behavior to better understand the trading gym. 

In [None]:
import trading_env as te

stayflat     = lambda o,e: 1   # stand pat
buyandhold   = lambda o,e: 2   # buy on day #1 and hold
randomtrader = lambda o,e: e.action_space.sample() # retail trader

# to run singly, we call run_strat.  we are returned a dataframe containing 
#  all steps in the sim.
env = env.unwrapped
bhdf = env.run_strat(buyandhold)

print(bhdf.head())

# we can easily plot our nav in time:
bhdf.bod_nav.plot(title='buy & hold nav')


### running the same strategy multiple times will likely yield different results as underlying data changes

In [None]:
env.run_strat(buyandhold).bod_nav.plot(title='same strat, different results')
env.run_strat(buyandhold).bod_nav.plot()
env.run_strat(buyandhold).bod_nav.plot()

### comparing the buyandhold and random traders

In [None]:
# running a strategy multiple times should yield insights 
#   into its expected behavior or give it oppty to learn
bhdf = env.run_strats(buyandhold,100)
rdf = env.run_strats(randomtrader,100)

comparo = pd.DataFrame({'buyhold':bhdf.mean(),
                        'random': rdf.mean()})
comparo

## Object of the game

From the above examples, we can see that buying and holding will, over the long run, give you the market return with low costs.

Randomly trading will instead destroy value rather quickly as costs overwhelm.

### So, what does it mean to win the trading game?  

For our purposes, we'll say that beating a buy & hold strategy, on average, over one hundred episodes will notch a win to the proud ai player.

To support this, the trading environment maintains the *mkt_return* which can be compared with the *sim_return*.

Note that the *mkt_return* is frictionless while the *sim_return* incurs both trading costs and the decay cost of 1 basis point per day, so overcoming the hurdle we've set here should be challenging.


### Playing the game: purloined policy gradients

I've taken and adapted (see [code](policy_gradient.py) for details) a policy gradient implementation based on tensorflow to try to play the single-instrument trading game.  Let's see how it does.

In [1]:
import gym
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import interactive
interactive(True)
import gym_trading

env = gym.make('trading-v0')
#env.time_cost_bps = 0 # 

env = env.unwrapped

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: Environment '<class 'gym_trading.envs.trading_env.TradingEnv'>' has deprecated methods. Compatibility code invoked.[0m


In [2]:
import tensorflow as tf
import policy_gradient

  return f(*args, **kwds)
INFO:policy_gradient:policy_gradient logger started.


In [3]:
# create the tf session
sess = tf.InteractiveSession()

# create policygradient
pg = policy_gradient.PolicyGradient(sess, obs_dim=6, num_actions=2, learning_rate=1e-2 )

# and now let's train it and evaluate its progress.  NB: this could take some time...
df,sf = pg.train_model(env, episodes=25001, log_freq=100)#, load_model=True)


0.00147891588413
0.00434732536291
0.00532422275251
0.00561084147205
0.00807496861787
0.00833446084831
0.0107575165779
0.0107282836155
0.0101432803609
0.0119689177002
0.0129997181651
0.012438926793
0.013642002062
0.0175077457403
0.0217396461308
0.0251575077147
0.0273777526149
0.0274227496605
0.0300536844112
0.0330552366783
0.0331669299121
0.0358202518217
0.0378596263648
0.0403238329384
0.0398505889726
0.0404763195713
0.0389654109804
0.0386707555371
0.0380667967114
0.0391485265282
0.0385399576068
0.0383864931882
0.0408227953547
0.0409888148714
0.0392066090256
0.0389656259179
0.0388763016649
0.038602746022
0.0405797948036
0.0404405277266
0.0393511577078
0.0381543476971
0.0381966152381
0.0370060271175
0.0365335670564
0.0399311768303
0.0423210533506
0.0406189504251
0.0381192234501
0.0387936682922
0.0411744717572
0.0410275138069
0.0394431516334
0.0434556979649
0.0439727129609
0.0434661456111
0.043775907647
0.0423136181926
0.0435946009151
0.0427382094628
0.0432246104443
0.0428543942084
0.0418

KeyboardInterrupt: 

### Results

Policy gradients beat the trading game!  That said, it doesn't work every time and it seems, looking at the charts below, as though it's a bit of a lucky thing.  But luck counts in the trading game as in life!


In [None]:
sf['net'] = sf.simror - sf.mktror
#sf.net.plot()
sf.net.expanding().mean().plot()
sf.net.rolling(100).mean().plot()

In [None]:
sf.net.rolling(100).mean().tail()