# Readme
In order to perform the taular style q-learning, I have to make some sacrifice on the state
representation. The original state representation is in the form of:

`[server_type, response_time_on_server_0, response_time_on_server_1, ..., response_time_on_server_10]`,
    
which is impossible to store the corresponding q values due to the infinite possible states. 

The changes I made is to perform a sorting on the response times over all servers and use the relative 
scale indices of them as the new state, so the new state now looks like this:

`[server_type, argsort(original_response_times_of_the_servers)]`
    
For example, if the orginal state is:

`[1, 22.2324, 5.231, 0.1645, 3.21]`
    
then the new state will be:

`[1, 2, 3, 1, 0]`,
    
where the first digits in both states represent the server type (1 for CPU and 0 for I/O).

In [2]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.append('..')
from env_test import env
from env_test import policies as baseline_policy
from q_learning import trainer as QTrainer
from q_learning import QLearner

The initialization template is either None or invalid, the servers and jobs will be generated randomly


In [3]:
def policy_reward(env, policy):
    env.reset()
    while not env.is_terminal():
        env.step(policy())
    return env.cum_reward()
print("Reward by bseline policies:\n"
      "random:      {:3f}\n"
      "earlist:     {:3f}\n"
      "round_robin: {:3f}\n"
      "sensible:    {:3f}\n"
      "bestfit:     {:3f}".format(policy_reward(env, baseline_policy.random_policy),
                           policy_reward(env, baseline_policy.earlist_policy),
                           policy_reward(env, baseline_policy.round_robin_policy),
                           policy_reward(env, baseline_policy.sensible_policy),
                           policy_reward(env, baseline_policy.bestfit_policy)))

Reward by bseline policies:
random:      -2391.301937
earlist:     -2370.109287
round_robin: -2396.273717
sensible:    -2392.052417
bestfit:     -2275.335028


In [4]:
q_learner = QLearner(env, 0.05, 0.99)
q_trainer = QTrainer(q_learner, 50)

In [5]:
q_trainer.train(episodes=1000000, report_interval=100) 

Episode 100, Average_updates: 0.238341, iterations: 49900,reward: -2379.088556
Episode 200, Average_updates: 0.239164, iterations: 99800,reward: -2389.359307
Episode 300, Average_updates: 0.239757, iterations: 149700,reward: -2393.931562
Episode 400, Average_updates: 0.238656, iterations: 199600,reward: -2385.240810
Episode 500, Average_updates: 0.238345, iterations: 249500,reward: -2381.485139
Episode 600, Average_updates: 0.237800, iterations: 299400,reward: -2373.944281
Episode 700, Average_updates: 0.237621, iterations: 349300,reward: -2373.738106
Episode 800, Average_updates: 0.239764, iterations: 399200,reward: -2395.778802
Episode 900, Average_updates: 0.238902, iterations: 449100,reward: -2386.371301
Episode 1000, Average_updates: 0.238775, iterations: 499000,reward: -2385.719060
Episode 1100, Average_updates: 0.238723, iterations: 548900,reward: -2386.772713
Episode 1200, Average_updates: 0.240000, iterations: 598800,reward: -2392.445082
Episode 1300, Average_updates: 0.237229

KeyboardInterrupt: 