# Bit flipping game with DQN solver

This is the implementation of the DQN solver for the bit flipping game in [**Hindsight Experience Replay**](https://arxiv.org/abs/1707.01495).

**Rerefence**:

1. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba, Hindsight Experience Replay


In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from bitflipping import bitflipping as bf
from DQN import DQN

plt.rcParams['figure.figsize'] = [15, 20]
%matplotlib inline



## Set up the bit flipping game environment

In [2]:
init_state = np.array([0,1])
goal = np.ones((2,))
n = 3
bf_env = bf(init_state, goal, n)

## Build up the DQN neural network

In [3]:
tf.reset_default_graph()

x = tf.placeholder(tf.float32, shape=(None, 2*n))
y = tf.placeholder(tf.float32, shape=(None, 1))


hid = [256]
agent = DQN(x, hid, n, discount=0.98, eps=0.9, annealing=0.8, tau = 0.95, replay_buffer_size=1e6, batch_size=128)

In [None]:
losses, success_all = agent.train_Q(x, y, epoch=10, cycles = 25, episode=16, T=n, iteration=40)

Epoch 0 Cycle 0 Episode 15: loss is 0.242
Epoch 0 Cycle 1 Episode 15: loss is 0.114
Epoch 0 Cycle 2 Episode 15: loss is 0.0869
Epoch 0 Cycle 3 Episode 15: loss is 0.0611
Epoch 0 Cycle 4 Episode 15: loss is 0.0513
Epoch 0 Cycle 5 Episode 15: loss is 0.0589
Epoch 0 Cycle 6 Episode 15: loss is 0.0728
Epoch 0 Cycle 7 Episode 15: loss is 0.0605
Epoch 0 Cycle 8 Episode 15: loss is 0.0928
Epoch 0 Cycle 9 Episode 15: loss is 0.111
Epoch 0 Cycle 10 Episode 15: loss is 0.0774
Epoch 0 Cycle 11 Episode 15: loss is 0.0919
Epoch 0 Cycle 12 Episode 15: loss is 0.099
Epoch 0 Cycle 13 Episode 15: loss is 0.0907
Epoch 0 Cycle 14 Episode 15: loss is 0.145
Epoch 0 Cycle 15 Episode 15: loss is 0.136
Epoch 0 Cycle 16 Episode 15: loss is 0.124
Epoch 0 Cycle 17 Episode 15: loss is 0.136
Epoch 0 Cycle 18 Episode 15: loss is 0.136
Epoch 0 Cycle 19 Episode 15: loss is 0.117
Epoch 0 Cycle 20 Episode 15: loss is 0.097
Epoch 0 Cycle 21 Episode 15: loss is 0.147
Epoch 0 Cycle 22 Episode 15: loss is 0.16
Epoch 0 Cycl

Epoch 7 Cycle 20 Episode 15: loss is 2.76
Epoch 7 Cycle 21 Episode 15: loss is 2.32
Epoch 7 Cycle 22 Episode 15: loss is 2.23
Epoch 7 Cycle 23 Episode 15: loss is 1.78
Epoch 7 Cycle 24 Episode 15: loss is 2.29
Epoch 8 Cycle 0 Episode 15: loss is 2.21
Epoch 8 Cycle 1 Episode 15: loss is 1.95
Epoch 8 Cycle 2 Episode 15: loss is 2.04
Epoch 8 Cycle 3 Episode 15: loss is 2.7
Epoch 8 Cycle 4 Episode 15: loss is 1.87
Epoch 8 Cycle 5 Episode 15: loss is 3.2
Epoch 8 Cycle 6 Episode 15: loss is 1.65
Epoch 8 Cycle 7 Episode 15: loss is 2.82
Epoch 8 Cycle 8 Episode 15: loss is 2.07
Epoch 8 Cycle 9 Episode 15: loss is 2.5
Epoch 8 Cycle 10 Episode 15: loss is 2.47
Epoch 8 Cycle 11 Episode 15: loss is 3.16
Epoch 8 Cycle 12 Episode 15: loss is 1.98
Epoch 8 Cycle 13 Episode 15: loss is 2.55
Epoch 8 Cycle 14 Episode 15: loss is 2.15
Epoch 8 Cycle 15 Episode 15: loss is 2.96
Epoch 8 Cycle 16 Episode 15: loss is 1.76
Epoch 8 Cycle 17 Episode 15: loss is 2.13
Epoch 8 Cycle 18 Episode 15: loss is 2.94
Epoch

In [None]:
plt.figure()
plt.plot(losses)
plt.show()
plt.figure()
plt.plot(success_all)
plt.show

## Test DQN

In [None]:
with tf.Session() as sess:
    saver = tf.train.Saver()
    saver.restore(sess, '/tmp/model.ckpt')
    
    success = 0
    for i in range(100):
        
        s_0 = agent._sample_state()
        goal = agent._sample_state()
        while np.array_equal(s_0, goal):
            goal = agent._sample_state()

        env = bf(s_0, goal, n)

        for i in range(n):
            X = np.concatenate((env.state.reshape((1,-1)),goal.reshape((1,-1))), axis=1)
            Q = sess.run(agent.targetModel, feed_dict={x: X})
            action = np.argmax(Q)
            env.update_state(action)
            if (env.reward(env.state)==0):
                print('Success! Initial state:{0}\t Goal state:{1}'.format(s_0, goal))
                success += 1
                break
            elif (i==n-1):
                print('Fail! Initial state:{0}\t Goal state:{1}'.format(s_0, goal))
                
    print('Success rate {}%'.format(success))

In [None]:
a=np.array([[1,2,3,2,1,3]])

In [None]:
a.shape

In [None]:
s=np.argmax(a)

In [None]:
tf.trainable_variables()