CEDL2017 · Yuan-Tong · Nov 8, 2017 · Nov 8, 2017 · Nov 8, 2017 · Nov 9, 2017
diff --git a/CEDL_3_pic/p3_1.JPG b/CEDL_3_pic/p3_1.JPG
diff --git a/CEDL_3_pic/p4_1.JPG b/CEDL_3_pic/p4_1.JPG
diff --git a/CEDL_3_pic/p6_1.JPG b/CEDL_3_pic/p6_1.JPG
diff --git a/Lab3-policy-gradient.ipynb b/Lab3-policy-gradient.ipynb
diff --git a/policy_gradient/policy.py b/policy_gradient/policy.py
@@ -30,6 +30,8 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
         Sample solution is about 2~4 lines.
         """
         # YOUR CODE HERE >>>>>>
+        hidden_layer = tf.contrib.layers.fully_connected(inputs = self._observations, num_outputs = hidden_dim, activation_fn = tf.tanh)
+        probs = tf.contrib.layers.fully_connected(inputs = hidden_layer, num_outputs = out_dim, activation_fn = tf.nn.softmax)
         # <<<<<<<<
 
         # --------------------------------------------------
@@ -72,6 +74,7 @@ def __init__(self, in_dim, out_dim, hidden_dim, optimizer, session):
         Sample solution is about 1~3 lines.
         """
         # YOUR CODE HERE >>>>>>
+        surr_loss = -tf.reduce_mean(tf.multiply(log_prob, self._advantages))
         # <<<<<<<<
 
         grads_and_vars = self._opt.compute_gradients(surr_loss)

diff --git a/policy_gradient/util.py b/policy_gradient/util.py
@@ -32,6 +32,9 @@ def discount_bootstrap(x, discount_rate, b):
     Sample code should be about 3 lines
     """
     # YOUR CODE >>>>>>>>>>>>>>>>>>>
+    #print(b)
+    #print(np.append(b[1:], 0))
+    return np.add(x, np.multiply(np.append(b[1:], 0), discount_rate))
     # <<<<<<<<<<<<<<<<<<<<<<<<<<<<
 
 def plot_curve(data, key, filename=None):

diff --git a/report.md b/report.md
@@ -1,3 +1,30 @@
 # Homework3-Policy-Gradient report
+105062575 何元通  
 
-TA: try to elaborate the algorithms that you implemented and any details worth mentioned.
+## Introduction
+
+&nbsp; &nbsp; &nbsp; &nbsp; In this assignment, we are asked to perform policy gradient to solve the cartpole problem. We will build a neural network to learn the parameters. And, the agent is able to select better actions without consulting a value function.
+
+## Implementation
+
+&nbsp; &nbsp; &nbsp; &nbsp; When implementing the functionality, what we should do is to finish the functions. All functions have been given without several critical lines of codes. Consequently, we simply need to fill in the correct codes in specific position.   
+
+&nbsp; &nbsp; &nbsp; &nbsp; For the problem 1, we have to build a two-layered fully-connected neural network. I directly use the function from tensorflow package. The function is capable of building a fully-connected layer. Thus, I just call the function twice and decides its input, output dimension, and activation function. Eventually, the network will give us outputs of probability for action selection.  
+
+&nbsp; &nbsp; &nbsp; &nbsp; Then, finish the loss function of the network for problem 2. I follow the loss function provided in .ipynb file. In addition, the accumulated discounted rewards have been provided and the action probabilities have been computed before as well. As a consequence, following with the formula, I add the reward vector and probability vector elementwisly and compute the mean of the addition one. It should be notice that since we would like to maximize the function, we have to multiply it with negative one.  
+
+&nbsp; &nbsp; &nbsp; &nbsp; I merely follow the formulas in problem 3, that changes the loss function. Different from the original one, we substract the baseline from the rewards. And, it is similar to problem 3, I just copy all of the codes in problem 3 to problem 4 and not substract the baseline. With running results, it is much convenient for comparison.  
+
+&nbsp; &nbsp; &nbsp; &nbsp; We are asked to complete a function in policy.py in problem 5. Following the hint and formulas from both .py and .ipynb files, I orinially simply multiply the baseline with discount vactor and add it with the immediate rewards elementwisly. Nonetheless, it is noticed that we should first remove the first element of the baseline and append a zero to it. As we have to give it a space for the next state and keep the size of the array. And, in problem 6, I use the discount function to compute the generalized advantage estimation with given parameters.  
+
+## Discussion
+
+&nbsp; &nbsp; &nbsp; &nbsp; For problem 3 and for problem 4, the main difference is the former's reward substract a baseline, but the later's is not. According to the results from the two parts, the one with baseline can converge quicklier than the one without.(As shown at the following two plots. The former is of problem 3 and the later is of problem4.) From the codes, the baseline is a value function. When we add the value function into the function, it will not influence the expectation of the function. Thus, it can be added for a faster convergence. As to why can it converge faster, I consider that because we would like to maximize the function, it performs the same as minimizing the function with the negative rewards. And, the value function stand for the predicted reward for next few states. So, if we substract the value from the reward, it is similar to maximize the target function and the target function can inffer from some future states. Contrary to the problem 4, which does not use baseline, the value function makes it precisely on selection of the policy and makes it be able to converge more quicklier.
+
+<img src="CEDL_3_pic/p3_1.JPG" width="400" height="200" />&nbsp;<img src="CEDL_3_pic/p4_1.JPG" width="400" height="200" />
+
+&nbsp; &nbsp; &nbsp; &nbsp; As to problem 5 and problem 6, it introduces a hyperparameter lamda to improve the original policy gradient. From the discussion of my classmate and I and the data searched from Internet, we consider this change will improve the original policy gradient. It seems to update the values for each step other than updating for each round. Thus, we assume that the higher frequency it updates, the faster it converges. And, the lamda also influence the steps it considers. This may effect its speed of convergence. Moreover, as showed in the following two images. The former is the plot of problem 3 and the later is the plot of problem 6. The change of loss in problem 6 is stabler than the one in problem 3. There is much less sharp promotion or sharp drop in problem 6 contrary to problem 3. I consider this may also caused by the improvement since updating for each step can help it find a better value in local small area.  
+
+<img src="CEDL_3_pic/p3_1.JPG" width="400" height="200" />&nbsp;<img src="CEDL_3_pic/p6_1.JPG" width="400" height="200" />  
+
+&nbsp; &nbsp; &nbsp; &nbsp;By the way, we have also found the disadvange of it. From the data from the Internet, it tells us this algorithm is hard to converge. Of course, this happens on us that we should re-run it for several times to satisfy the request of around 80 iteration. Thus, I think this may be a crucial challenge to improve it.