# What is Reinforcement Learning?

- It is the training of ML models to make a sequence of decisions
- The method employs episodes and Trial versions to get solution
- Rewards are given by games (it can be a positive and negative rewards). We focus on getting positive rewards.
- Goal is to Maximize the Total Rewards

## What is Keras?

##### Keras is one of the leading high-level neural networks APIs. It is written in Python and supports multiple back-end neural network computation engines.

- With Keras we do not need to make backpropogation algorithms(it is built in keras module)
- Many Layers could be added in just few lines of code
- All the types of models are built on same principles hence it becomes easier to master

## What is OpenAI Gym?

- Gym is a toolkit for developing and comparing reinforcement learning algorithms.
- It supports teaching agents everything from walking to playing games like Pong or Pinball.
- Has easy implementation in Python

## Learning On CartPole

- A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart.
- The pendulum starts upright, and the goal is to prevent it from falling over.
- A reward of +1 is provided for every timestep that the pole remains upright.
- The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

## Understanding OpenAI Gym Environment

- OpenAI Gym environments are structured around two main parts: an observation space and an action space
- Based On The current state of observation, we determine the action

## Using Gym

- create environment of gym env = gym.make('env-name')
- reset the environment env.reset()
- render the environment onto visible game env.render()
- Take next step in game env.step('give your action')
- close the rendering window env.close()

### import Libraries

In [1]:
import gym #gym use for agent and environment(reinforcement learning algorithms)
import numpy as np #numpy use for matrics 

In [2]:
env = gym.make('CartPole-v0')
# Create an environment of gym

In [3]:
obs = env.reset()
#reset the environment

## Data Collection

- We will collect data by running a certain number of Random trials
- Only those trials will be considered that have got us a min score
- One Hot Encoding will be used for passing action

In [4]:
#we will take random data for 10,000 steps
#The more data you run the more accurate results you gain
#Neural netwrorks need maximum amount of data for scores or total rewards

def collect_data(env):
    num_episodes = 10000
    # score is between 0 to 200 (0 for minimum and 200 for maximum)
    
    # I will take 50 for average because of random data 
    min_score = 50 
    
    #300 maximum steps at every episode/trial(usually it does not exceed 200 but for safety purpose we are taking 300 steps)
    t_steps = 300
    
    #X is for input and y is for output in trainig data for modeling
    training_X,training_y = [],[]
    
    #declare scores as empty as current score for comparison(array)
    scores = []
    
    #Iteration of number of trials
    
    for episode in range(num_episodes):
        obs = env.reset() #reset environment after each game
        score = 0  #declare score as integer
        training_sample_X,training_sample_y = [],[] #declare training sample data for current  
        
    #show only 400th time game in order to save time
        for step in range(t_steps):
            if(episode%400==0):
                env.render()
                
            #Take action randomly
            action = np.random.randint(0,2) # left or right
            
            #Convert action in one hot as Keras also recommend it
            #Keras also recommend it because it makes easier for modeling 
            one_hot_action = np.zeros(2) 
            
            #randomly action equal to 1 in one hot 
            one_hot_action[action] = 1
            
            #store current training data by append
            #4 input velocity,angel,angular velocity, position of poles store in observation
            training_sample_X.append(obs)
            training_sample_y.append(one_hot_action) #action is left(0) or right(1)
            
            #env.step() function returns observation,reward,done,info
            #observation keeps input as I have mentioned above
            #In reward 199 time step is maximum and 0 is minimum   
            #maximum time steps are 190 after 190, game will be done. In that time you win or lose
            
            observation , reward, done, info = env.step(action)
            score += reward
            if done: 
                break 
            #If your game is finised or done then it must be stop otherwise it is not useful. 
                
        #We will take scores only greater than minimum scores for training data               
        if score>min_score:
            scores.append(score)
            
            #store sample training variables into main training variables.
            training_X+=training_sample_X 
            training_y+=training_sample_y
    
    #Convert training input and output in array form
    training_X,training_y = np.array(training_X), np.array(training_y)
    
    #Pint mean and median of scores
    print("Average: {}".format(np.mean(scores)))
    print("Median: {}".format(np.median(scores)))
    
    return training_X,training_y

In [5]:
print(collect_data(env))
env.close()

Average: 61.75757575757576
Median: 59.0
(array([[ 0.03986029, -0.04820817, -0.03865588, -0.04262823],
       [ 0.03986029, -0.04820817, -0.03865588, -0.04262823],
       [ 0.03986029, -0.04820817, -0.03865588, -0.04262823],
       ...,
       [-0.01718096, -0.00128567,  0.00411215, -0.04054484],
       [-0.01718096, -0.00128567,  0.00411215, -0.04054484],
       [-0.01718096, -0.00128567,  0.00411215, -0.04054484]],
      dtype=float32), array([[1., 0.],
       [1., 0.],
       [0., 1.],
       ...,
       [0., 1.],
       [1., 0.],
       [0., 1.]]))


#### Average is 63.2 and median is 60 in minimum score 50. It is quite good average according to minimum score. 

##### Training X is showing four elements (position,velocity,angular velocity,angel) and y is showing 0,1 (left,right)

# Model Build

- We will use keras for model definition
- The model we use here is a very simple one: several fully-connected layers
- We can use enhancement such as Convolutions, LSTM,Dropouts etc.
- Input will be the observation and output will be action
- Loss can be used are mean_squared_error, categorical_crossentropy etc.
- Preferred optimizer is usually adam

In [6]:
from tensorflow.keras.models import Sequential #use for sequentional series of neural network
from tensorflow.keras.layers import Dense,Dropout #Dense layers are most basic and important layers of deep Q network(hidden layers) 
#Dropout function is use to drop some % deep q network or neural network

In [7]:
def Our_model():
    model = Sequential() #basic model for deep Q network
    model.add(Dense(128,input_shape=(4,),activation='relu')) # output is 128 layers, input size is 4
    # relu function has its derivative both are monotonic.
    model.add(Dropout(0.6)) #it removes 60% of neurons from the neural network to overcome the overfitting

    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.6))

    model.add(Dense(512,activation='relu'))
    model.add(Dropout(0.6))
     
    model.add(Dense(256,activation='relu'))
    model.add(Dropout(0.6))

    model.add(Dense(128,activation='relu'))
    model.add(Dropout(0.6))
    
    #Softmax function takes higher probability value amongst all
    #activation function in a neural network model
    model.add(Dense(2, activation='softmax'))
    
    #summary of the model
    model.summary()

    #Taking mean square error for finding loss 
    #we will use adam for optimization algorithm
    #accuracy will be measured by metrics
    model.compile(loss='mse',optimizer='adam',metrics=['accuracy'])
    return model


# Predictions

- From the data collection and model above we train our data
- We will go through several trials to check on multiple cases
- In each trial we get a score

In [8]:
def predict():
    env1 = gym.make('CartPole-v0') #create new environment
    training_X,training_y = collect_data(env1) #gather the data from the above function
    model = Our_model() # call created model function in variable model
    model.fit(training_X,training_y,epochs=5) #fit training data into model
    #In each epoch sending the data from the input layer to output layer
    #We increase epoch value for better accuracy 
    
    #create new environment for new games
    scores = []
    num_trails = 50 #for 50 trial games
    t_steps = 300
    
    for trial in range(num_trails):
        obs = env1.reset()
        score = 0
        
        for step in range(t_steps):
            if(trial%4==0): #takes every 4th trail only to render
                env1.render()
            
            #argmax function takes maximum value of arguement
            #Use model for prediction
            #observation will be change after every game
            #model will predict on the base of observation
            
            action = np.argmax(model.predict(obs.reshape(1,4)))
            observation,reward,done,info = env1.step(action)
            
            #if game is finished or done, the reward will be store in score
            if done:
                score+=reward
                break
                
        #update scores        
        scores.append(score)
        
        #print mean of the scores
        print(np.mean(scores))
        
    #close the environment    
    env1.close()

In [9]:
#call the predict function
predict()


Average: 62.67966573816156
Median: 59.0
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               640       
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense_1 (Dense)             (None, 256)               33024     
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 512)               131584    
                                                                 
 dropout_2 (Dropout)         (None, 512)               0         
                                                                 
 dense_3 (Dense)

##### The results you can see is 1.0 in every game which shows the pole is consistantly on upright (perpendicular) neither it is exceeding their boundries. Due to the 60% of dropout function accuracy is 60% which is showing it is not overfitting model. 