# Reinforcement Learning
### For the second part of this project we will be implementing a simple Q-Learning algorithm on an RL environment called Cart Pole. The idea of Q-Learning is to try to estimate the expected future reward or Q-value of taking a certain action. Then at any given step we take the action with the most expected future reward.

### In reinforcement learning, we refer to algorithms that attempt to solve environments as "agents", so in this part of the project we will be making a Deep Q Network Agent that will solve the Cart Pole environment.

In [7]:
!pip install gym tqdm

Collecting gym
  Using cached https://files.pythonhosted.org/packages/9b/50/ed4a03d2be47ffd043be2ee514f329ce45d98a30fe2d1b9c61dea5a9d861/gym-0.10.5.tar.gz
Collecting tqdm
  Using cached https://files.pythonhosted.org/packages/78/bc/de067ab2d700b91717dc5459d86a1877e2df31abfb90ab01a5a5a5ce30b4/tqdm-4.23.0-py2.py3-none-any.whl
Collecting pyglet>=1.2.0 (from gym)
  Using cached https://files.pythonhosted.org/packages/1c/fc/dad5eaaab68f0c21e2f906a94ddb98175662cc5a654eee404d59554ce0fa/pyglet-1.3.2-py2.py3-none-any.whl
Building wheels for collected packages: gym
  Running setup.py bdist_wheel for gym ... [?25ldone
[?25h  Stored in directory: /Users/Radhika/Library/Caches/pip/wheels/cb/14/71/f4ab006b1e6ff75c2b54985c2f98d0644fffe9c1dddc670925
Successfully built gym
Installing collected packages: pyglet, gym, tqdm
Successfully installed gym-0.10.5 pyglet-1.3.2 tqdm-4.23.0
[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip insta

# Part 1: Setup the Environment

In [8]:
import gym
env = gym.make('CartPole-v0')

[33mWARN: gym.spaces.Box autodetected dtype as <type 'numpy.float32'>. Please provide explicit dtype.[0m


# Part 2: Create The DQN Agent

In [26]:
import keras 
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Activation
from collections import deque
import random
from keras.optimizers import Adam

import numpy as np


class DQNAgent:
    
    def __init__(self, env, replay_size=1000, epsilon=1.0, epsilon_min=0.01, epsilon_decay=0.995, gamma=0.99):
        self.state_size = env.observation_space.shape[0]
        self.num_actions = env.action_space.n
        self.model = self.build_model()
        self.replay_buffer = deque(maxlen=replay_size)
        self.epsilon = epsilon
        self.epsilon_min = epsilon_min
        self.epsilon_decay = epsilon_decay
        self.gamma = gamma

        
    def build_model(self):
        model = Sequential()
        # TODO: add 2 dense layers each with 32 neurons, the input dim to the first
        # layer should be the state size, also add relu activations, for both these layers
        # Then add another Dense layer with num_actions neurons.
        # Then use model.compile to compile the model with mse loss and an Adam optimizer
        # with learning rate 0.001.
        model.add(Dense(32, input_dim = self.state_size))
        model.add(Activation("relu"))
        model.add(Dense(32))
        model.add(Activation("relu"))
        model.add(Dense(self.num_actions))
        keras.optimizers.Adam(lr=0.001)
        model.compile(optimizer = "Adam", loss="mse")

        
        return model
        
    def action(self, state):
        # Whenever a random number between 0 and 1 is less than epsilon we want to return
        # a random action. This means that with probability epsilon we return a random action.
        if np.random.random() <= self.epsilon:
            return np.random.randint(self.num_actions)
            #TODO: return random action here
        # Now we want to use our model to get the q values
        # HINT: we want to do prediction
        
        q_values = self.model.predict(state)
        return np.argmax(q_values[0])
    
    def add_to_replay_buffer(self, state, action, reward, next_state, done):
        self.replay_buffer.append((state, action, reward, next_state, done))

    def train_batch_from_replay(self, batch_size):
        # if we don't have enough samples in our replay buffer just return
        if len(self.replay_buffer) < batch_size:
            return False
        # TODO: randomly sample batch_size samples from the replay buffer
        # hint: use random.sample
        minibatch = random.sample(self.replay_buffer, batch_size)
        for state, action, reward, next_state, done in minibatch:
            target = reward
            if not done:
                next_Qs = self.model.predict(next_state)[0]
                # TODO: we want to add to our target GAMMA * max Q(next_state)
                target += self.gamma * np.max(next_Qs)

            # our target should only take into account the current action
            # so we set all the Q values except the current action, to the 
            # current output of our model so that they get ignored in the loss function.
            target_Qs = self.model.predict(state)
            target_Qs[0][action] = target
            self.model.fit(state, target_Qs, epochs=1, verbose=0)
        
        # Now we want to slowly decay how many random actions we take
        # to do this we can multiply epsilon by our epsilon decay parameter
        # each iteration
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay


# Part 3: Train the Model

In [27]:
agent = DQNAgent(env)

In [None]:
from tqdm import tqdm

done = False
batch_size = 32
num_episodes = 800

for episode in tqdm(range(num_episodes)):
    state = env.reset()
    state = np.reshape(state, [1, agent.state_size])
    
    for t in range(200):
        action = agent.action(state)
        next_state, reward, done, _ = env.step(action)
        reward = reward if not done else 100
        next_state = np.reshape(next_state, [1, agent.state_size])
        agent.add_to_replay_buffer(state, action, reward, next_state, done)
        # TODO: add this sample to the replay buffer
        
        state = next_state
        
        # TODO: train on a batch from the replay buffer
        agent.train_batch_from_replay(batch_size)
        if done: 
            break


  0%|          | 0/800 [00:00<?, ?it/s][A
  0%|          | 1/800 [00:02<33:34,  2.52s/it][A
  0%|          | 2/800 [00:04<29:08,  2.19s/it][AException KeyError: KeyError(<weakref at 0x181c1e96d8; to 'tqdm' at 0x181c325090>,) in <object repr() failed> ignored

  0%|          | 3/800 [00:05<26:00,  1.96s/it][A
  0%|          | 4/800 [00:07<25:28,  1.92s/it][A
  1%|          | 5/800 [00:09<25:39,  1.94s/it][A
  1%|          | 6/800 [00:10<23:19,  1.76s/it][A
  1%|          | 7/800 [00:12<23:50,  1.80s/it][A
  1%|          | 8/800 [00:15<26:00,  1.97s/it][A
  1%|          | 9/800 [00:17<25:39,  1.95s/it][A
  1%|▏         | 10/800 [00:18<24:34,  1.87s/it][A
  1%|▏         | 11/800 [00:21<26:15,  2.00s/it][A
  2%|▏         | 12/800 [00:24<26:25,  2.01s/it][A
  2%|▏         | 13/800 [00:27<27:32,  2.10s/it][A
  2%|▏         | 14/800 [00:29<27:08,  2.07s/it][A
  2%|▏         | 15/800 [00:31<27:28,  2.10s/it][A
  2%|▏         | 16/800 [00:33<27:08,  2.08s/it][A
  2%|▏         

 19%|█▉        | 154/800 [09:18<39:04,  3.63s/it][A
 19%|█▉        | 155/800 [09:22<38:59,  3.63s/it][A
 20%|█▉        | 156/800 [09:26<38:59,  3.63s/it][A
 20%|█▉        | 157/800 [09:32<39:05,  3.65s/it][A
 20%|█▉        | 158/800 [09:38<39:09,  3.66s/it][A
 20%|█▉        | 159/800 [09:41<39:05,  3.66s/it][A
 20%|██        | 160/800 [09:45<39:02,  3.66s/it][A
 20%|██        | 161/800 [09:50<39:01,  3.67s/it][A
 20%|██        | 162/800 [10:08<39:56,  3.76s/it][A
 20%|██        | 163/800 [10:19<40:19,  3.80s/it][A
 20%|██        | 164/800 [10:20<40:04,  3.78s/it][A
 21%|██        | 165/800 [10:21<39:50,  3.76s/it][A
 21%|██        | 166/800 [10:22<39:35,  3.75s/it][A
 21%|██        | 167/800 [10:26<39:35,  3.75s/it][A
 21%|██        | 168/800 [10:27<39:20,  3.73s/it][A
 21%|██        | 169/800 [10:28<39:05,  3.72s/it][A
 21%|██▏       | 170/800 [10:28<38:50,  3.70s/it][A
 21%|██▏       | 171/800 [10:29<38:37,  3.68s/it][A
 22%|██▏       | 172/800 [10:33<38:31,  3.68s/

 38%|███▊      | 308/800 [18:44<29:55,  3.65s/it][A
 39%|███▊      | 309/800 [18:44<29:47,  3.64s/it][A
 39%|███▉      | 310/800 [18:45<29:39,  3.63s/it][A
 39%|███▉      | 311/800 [18:46<29:31,  3.62s/it][A
 39%|███▉      | 312/800 [18:47<29:23,  3.61s/it][A
 39%|███▉      | 313/800 [18:48<29:15,  3.60s/it][A
 39%|███▉      | 314/800 [18:48<29:07,  3.59s/it][A
 39%|███▉      | 315/800 [18:49<28:59,  3.59s/it][A
 40%|███▉      | 316/800 [18:50<28:51,  3.58s/it][A
 40%|███▉      | 317/800 [18:51<28:43,  3.57s/it][A
 40%|███▉      | 318/800 [18:51<28:35,  3.56s/it][A
 40%|███▉      | 319/800 [18:55<28:32,  3.56s/it][A
 40%|████      | 320/800 [19:02<28:33,  3.57s/it][A
 40%|████      | 321/800 [19:06<28:30,  3.57s/it][A
 40%|████      | 322/800 [19:07<28:22,  3.56s/it][A
 40%|████      | 323/800 [19:17<28:29,  3.58s/it][A
 40%|████      | 324/800 [19:19<28:23,  3.58s/it][A
 41%|████      | 325/800 [19:21<28:17,  3.57s/it][A
 41%|████      | 326/800 [19:24<28:12,  3.57s/

# Part 4: Test the Model

In [None]:
#TODO: set the agent's epsilon so that we don't take any random actions.
for _ in range(10):
    state = env.reset()
    state = np.reshape(state, [1, agent.state_size])
    agent.episilon = -1
    total_reward = 0
    for t in range(200):
        action = agent.action(state)
        next_state, reward, done, _ = env.step(action)
        total_reward += reward
        state = np.reshape(next_state, [1, agent.state_size])
        # TODO: if you want to see the rendered version of your agent running
        # uncomment this line
        #env.render()
    print(total_reward)

# Part 5: Writeup

#### Now for the writeup portion write a paragraph of your understanding of how Deep Q Learning works.

Q-learning uses a simple update rule to perform q-value iteration, which allows us to bypass the need to keep track of values, transition functions, and reward functions. We use Deep Q-Learning to approximate our Q-value function with the use of a Neural Network. We choose the neuron from out network that has the highest value and take an action corresponding to this neuron. 