# Bit flipping game with DQN solver

This is the implementation of the DQN solver for the bit flipping game in [**Hindsight Experience Replay**](https://arxiv.org/abs/1707.01495).

**Rerefence**:

1. Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, Wojciech Zaremba, Hindsight Experience Replay


In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from bitflipping import bitflipping as bf
from DQN import DQN

plt.rcParams['figure.figsize'] = [15, 20]
%matplotlib inline

## Set up the bit flipping game environment

In [2]:
init_state = np.array([0,1])
goal = np.ones((2,))
n = 4
bf_env = bf(n)

## Build up the DQN neural network

In [3]:
tf.reset_default_graph()


x = tf.placeholder(tf.float32, shape=(None, 2*n))
y = tf.placeholder(tf.float32, shape=(None, 1))


hid = [256]
agent = DQN(x, hid, n, discount=0.98, eps=0.5, tau = 0.95, replay_buffer_size=1e5, batch_size=128)

In [None]:
losses, success_all = agent.train_Q(x, y, epoch=5, cycles=50, episode=16, iteration=40)

Epoch 0 Cycle 0: loss is 0.0314
Epoch 0 Cycle 1: loss is 0.0159
Epoch 0 Cycle 2: loss is 0.0179
Epoch 0 Cycle 3: loss is 0.0266
Epoch 0 Cycle 4: loss is 0.0204
Epoch 0 Cycle 5: loss is 0.0319
Epoch 0 Cycle 6: loss is 0.0181
Epoch 0 Cycle 7: loss is 0.0303
Epoch 0 Cycle 8: loss is 0.0157
Epoch 0 Cycle 9: loss is 0.011
Epoch 0 Cycle 10: loss is 0.03
Epoch 0 Cycle 11: loss is 0.0308
Epoch 0 Cycle 12: loss is 0.017
Epoch 0 Cycle 13: loss is 0.0333
Epoch 0 Cycle 14: loss is 0.00907
Epoch 0 Cycle 15: loss is 0.036
Epoch 0 Cycle 16: loss is 0.0545
Epoch 0 Cycle 17: loss is 0.0512
Epoch 0 Cycle 18: loss is 0.0397
Epoch 0 Cycle 19: loss is 0.035
Epoch 0 Cycle 20: loss is 0.037
Epoch 0 Cycle 21: loss is 0.00845
Epoch 0 Cycle 22: loss is 0.0458
Epoch 0 Cycle 23: loss is 0.0397
Epoch 0 Cycle 24: loss is 0.0412
Epoch 0 Cycle 25: loss is 0.0425
Epoch 0 Cycle 26: loss is 0.0608
Epoch 0 Cycle 27: loss is 0.0452
Epoch 0 Cycle 28: loss is 0.0555
Epoch 0 Cycle 29: loss is 0.0485
Epoch 0 Cycle 30: loss is

In [None]:
plt.figure()
plt.plot(losses)
plt.show()

## Test DQN

In [None]:
with tf.Session() as sess:
    saver = tf.train.Saver()
    saver.restore(sess, '/tmp/model.ckpt')
    
    success = 0
    for i in range(100):
        
        bf_env.reset()

        for i in range(n):
            X = np.concatenate((bf_env.state.reshape((1,-1)),bf_env.goal.reshape((1,-1))), axis=1)
            Q = sess.run(agent.targetModel, feed_dict={x: X})
            action = np.argmax(Q)
            bf_env.update_state(action)
            if (bf_env.reward(bf_env.state)==0):
                print('Success! state:{0}\t Goal state:{1}'.format(bf_env.state, bf_env.goal))
                success += 1
                break
            elif (i==n-1):
                print('Fail! state:{0}\t Goal state:{1}'.format(bf_env.state, bf_env.goal))
                
    print('Success rate {}%'.format(success))

In [None]:
a=np.array([[1,2,3,2,1,3]])

In [None]:
a.shape

In [None]:
s=np.argmax(a)