### Instructions 
- Download all the necessary libraries
- Download ROM from https://github.com/openai/atari-py#roms
- After run `python -m atari_py.import_roms <path to folder>`(I recommend the same that you're using)

### Import libraries

In [None]:
import gym
import numpy as np 
import datetime
import functools
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf 

### Define the enviroment and see the game

In [None]:
env = gym.make('Pong-v0')
observation = env.reset()
n_actions = env.action_space.n
observation.shape

In [None]:
for i in range(30):
    observation, reward, done, info = env.step(0)# 0 means stay the same place(or do nothing)  
plt.imshow(observation)

#### Basic explanation
- In the above image, we control the right side player
- If the ball passes our paddle and ends up on the right, we receive a -1 penalty for losing, and if the ball crosses the opponent and ends up on the left, we receive a +1 penalty. The game ends when one of the players achieves 21 points.

#### System definition:
- State is the screen of game.
- Action is going, down and stay

#### Possible approaches
- Because we're going to utilize a neural network as a policy, we'll need to provide enough data for it to figure out where the ball is going.
- The state can be defined as the game's latest 10 frames to convey this information. After that, feed it to a NN. Alternatively, we might feed the frames to an RNN one by one so that it learns the game's sequence.
- We utilize another technique for simplicity: we just remove two consecutive frames. Then use the resultant image as input to the network

### Preprocess the data

In [None]:
def preprocess_frames(new_frame:int, last_frame:int) -> int:
    # inputs are 2 numpy 2d arrays
    n_frame = new_frame.astype(np.int32)
    # remove backgound colors
    n_frame[(n_frame==144)|(n_frame==109)]=0 
    l_frame = last_frame.astype(np.int32)
    # remove backgound colors
    l_frame[(l_frame==144)|(l_frame==109)]=0 
    diff = n_frame - l_frame
    # crop top and bot 
    diff = diff[35:195]
    # down sample 
    diff=diff[::2,::2]
    # convert to grayscale
    diff = diff[:,:,0] * 299. / 1000 + diff[:,:,1] * 587. / 1000 + diff[:,:,2] * 114. / 1000
    # rescale numbers between 0 and 1
    max_val =diff.max() if diff.max()> abs(diff.min()) else abs(diff.min())
    if max_val != 0:
        diff = diff/max_val
    return diff

In [None]:
new_observation, reward, done, info = env.step(2)
plt.imshow(preprocess_frames(new_observation,observation),plt.cm.gray)

#### Explanation
- We attempted to remove any superfluous information, such as color.
- Because it did not give information, we cropped the bottom and top of the game screen.
- The backgounds are removed

Our current state is an 80*80 picture created by subtracting two consecutive frames, with most values being 0 but non-zero values where the paddles or ball have moved.

### Creating the model

In [None]:
inputs = tf.keras.layers.Input(shape=(80, 80))
inputs_reshape = tf.keras.layers.Reshape((80, 80, 1))(inputs)
conv2d1 = tf.keras.layers.Conv2D(filters=10, kernel_size=20, padding='valid', 
                                 activation='relu', strides=(4,4), use_bias=False)(inputs_reshape)
conv2d2 = tf.keras.layers.Conv2D(filters=20, kernel_size=10, padding='valid',
                                 activation='relu', strides=(2,2), use_bias=False)(conv2d1)
conv2d3 = tf.keras.layers.Conv2D(filters=40, kernel_size=3, padding='valid',
                                 activation='relu', use_bias=False)(conv2d2)
flattened_layer = tf.keras.layers.Flatten()(conv2d3)
sigmoid_output = tf.keras.layers.Dense(1, activation='sigmoid',use_bias=False)(flattened_layer)
model = tf.keras.models.Model(inputs=inputs,outputs=sigmoid_output)
model.summary()

### Defining Loss

We can define the loss as:

L = G ∇<sub>θ</sub> (−log(π))

Where G is the reward from the state we are updating, π is the probablity of taking the action that we took when playing. 

In [None]:
episode_reward = tf.keras.layers.Input(shape=(1,),name='episode_reward')
def m_loss(episode_reward):
    def loss(y_true,y_pred):
        # feed in y_true as actual action taken 
        # if actual action was up, we feed 1 as y_true and otherwise 0
        # y_pred is the network output(probablity of taking up action)        
        tmp_pred = tf.keras.layers.Lambda(lambda x: tf.keras.backend.clip(x,0.05,0.95))(y_pred)
        # we calculate log of probablity. y_pred is the probablity of taking up action
        # y_true is 1 when we actually chose up, and 0 when we chose down
        tmp_loss = tf.keras.layers.Lambda(lambda x:-y_true*tf.keras.backend.log(x)-
                                          (1-y_true)*(tf.keras.backend.log(1-x)))(tmp_pred)
        # multiply log of policy by reward
        policy_loss = tf.keras.layers.Multiply()([tmp_loss,episode_reward])
        return policy_loss
    return loss

### Creating optimizer and network

In [None]:
episode_reward = tf.keras.layers.Input(shape=(1,),name='episode_reward')
policy_network_train = tf.keras.models.Model(inputs=[inputs,episode_reward],outputs=sigmoid_output)
my_optimizer = tf.keras.optimizers.Adam(lr=0.0001)
policy_network_train.compile(optimizer=my_optimizer,loss=m_loss(episode_reward),)

It's important to remember that policy network train and policy network model both employ the same layers (from inputs to outputs), and their weights and parameters are the same. As a result, we just utilize policy network for training and then policy network model for playing and simulating.

### Reward (maybe the most important step in RL)

#### Problem definition
- When the ball goes through our paddle (we score -1) or our opponent's
paddle (we score +1), it appears like the environment just rewards us.
- The reward is almost always zero. In this situation, the agent does not get positive or negative feedback. The majority of gradients become zero.
- Only the actions we made to hit the ball were crucial to our victory, everything that happened after we struck the ball had no bearing on our victory.

#### Solution
- We set the reward of actions taken before each reward, similar to the reward obtained. For example if we got reward +1 at time 200, we say that reward of time 199 is +0.99, reward of time 198 is +0.98 and so on.
- We have the rewards for actions that resulted in a +1 or -1 with this reward criteria. We presume that the closer the action is to the reward received, the more significant it is.
- Some notes about normalization

In [None]:
def generate_episode(policy_network, env):
    states = [] # shape = (x,80,80)
    up_or_down_action=[]
    rewards=[]
    network_output=[]
    observation = env.reset()
    new_observation = observation
    done = False
    policy_output = []
    while done == False:
        processed_network_input = preprocess_frames(new_frame=new_observation,last_frame=observation)
        states.append(processed_network_input)
        reshaped_input = np.expand_dims(processed_network_input,axis=0) 
        up_probability = policy_network.predict(reshaped_input,batch_size=1)[0][0]
        network_output.append(up_probability)
        policy_output.append(up_probability)
        actual_action = np.random.choice(a=[2,3],size=1,p=[up_probability,1-up_probability]) 
        # 2 is up and 3 is down 
        if actual_action==2:
            up_or_down_action.append(1)
        else:
            up_or_down_action.append(0)
        observation= new_observation
        new_observation, reward, done, info = env.step(actual_action)
        rewards.append(reward)  
        if done:
            break        
    env.close()
    return states, up_or_down_action,rewards, network_output
def process_rewards(r):
    reward_decay=0.99
    tmp_r=0
    rew=np.zeros_like(r,dtype=np.float16)
    for i in range(len(r)-1,-1,-1):
        if r[i]==0:
            tmp_r=tmp_r*reward_decay
            rew[i]=tmp_r
        else: 
            tmp_r = r[i]
            rew[i]=tmp_r
    return rew

In [None]:
#states, up_or_down_action, rewards, network_output = generate_episode(model, env)

In [None]:
print("length of states= "+str(len(states)))# this is the number of frames
print("shape of each state="+str(states[0].shape))
print("length of rewards= "+str(len(rewards)))
# lets see how many times we won through whole game:
print("count win="+str(len(list(filter(lambda r: r>0,rewards)))))
print("count lose="+str(len(list(filter(lambda r: r<0,rewards)))))
print("count zero rewards="+str(len(list(filter(lambda r: r==0,rewards)))))

In [None]:
up_or_down_action[10:20]

##### Because the network has never been trained, its output is always around 50%. indicating it doesn't know which option is best right now and gives all states a probability of approximately 0.5.

In [None]:
plt.plot(process_rewards(rewards),'-',)
ax=plt.gca()
ax.grid(True)

### Training and simulation

In [None]:
def generate_episode_batch(model, env, n_batches=10):
    env = gym.make('Pong-v0')
    batch = []
    batch_up_or_down_action = []
    batch_rewards = []
    batch_network_output = []
    for i in range(n_batches):
        states,up_or_down_action,rewards,network_output = generate_episode(model, env)   
        batch.extend(states[15:])
        batch_network_output.extend(network_output[15:])
        batch_up_or_down_action.extend(up_or_down_action[15:])
        batch_rewards.extend(rewards[15:])
    episode_reward = np.expand_dims(process_rewards(batch_rewards), 1)
    X = np.array(batch)
    y_tmp = np.array(batch_up_or_down_action)
    y_true = np.expand_dims(y_tmp,1)
    episode_reward = episode_reward.astype(np.int64)
    print((episode_reward))
    print((y_true))
    policy_network_train.fit(x=[X, episode_reward], y=y_true)
    return batch, batch_up_or_down_action, batch_rewards, batch_network_output

In [None]:
train_n_times = 2000
for i in range(train_n_times):
    states, up_or_down_action, rewards, network_output = generate_episode_batch(model, env, 10)
    if i%500==0:
        print("i="+str(i))
        rr = np.array(rewards)
        print('count win='+str(len(rr[rr>0]))) 
        model.save("policy_network_model_simple.h5")
        model.save("policy_network_model_simple"+str(i)+".h5")
        with open('rews_model_simple.txt','a') as f_rew:
            f_rew.write("i="+str(i)+'       reward= '+str(len(rr[rr > 0])))
            f_rew.write("\n")