## Project Navigation - Report

### Training 

For each training session, we construct the **agent** with different parameters
and we run the *Deep-Q-Network* procedure **dqn** as follows:

        agent = Agent(state_size=37, action_size=4, seed=1, fc1_units=fc1_nodes, fc2_units=fc2_nodes)       
        scores, episodes = **dqn**(n_episodes = 2000, eps_start = epsilon_start, train_num=i)  
    
The  the obtained weights are saved into the file 'weights_'+str(train_numb)+'.trn'.   



### Deep-Q-Network algorithm

The _Deep-Q-Network_ procedure **dqn** performs the external loop (by _episodes_) till the number of episodes 
reached the maximal number of episodes _n_episodes = 2000_ or the _completion criteria_ is executed.
For the completion criteria, we check  

        np.mean(scores_window) >= 15,  

where _scores_\_window_ is the array of the type deque realizing  the shifting window of length <= 100.
The element _scores_\_window_[i] contains the _score_ achieved by the algorithm on the episode _i_.


In the internal loop,  **dqn** gets the current _action_ from the **agent**.
By this _action_ **dqn** gets _state_ and _reward_ from Unity environment _env_.
Then, the **agent** accept params _state_, _action_, _reward_, _next_\__state_, _done_
to the next training step. The variable _score_ accumulates obtained rewards.


### Mechanisms of Agent

The class **Agent** is is the well-known class implementing the following mechanisms:

* Two Q-Networks (local and target) using the simple neural network.

        self.qnetwork_local = QNetwork(state_size, action_size, seed, fc1_units, fc2_units).to(device)
        self.qnetwork_target = QNetwork(state_size, action_size, seed, fc1_units, fc2_units).to(device)

* Replay memory (using the class ReplayBuffer)

       self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
       ...
       e = self.experience(state, action, reward, next_state, done)
       self.memory.append(e)
     
* Epsilon-greedy mechanism

        if random.random() > eps:
            return np.argmax(action_values.cpu().data.numpy())
        else:
            return random.choice(np.arange(self.action_size))
            
   The epsilon become a bit smaller with each episode:
   
        eps = max(eps_end, eps_decay*eps), 
        
where eps_end=0.01, eps_decay = 0.999.        
   
* Q-learning, i.e., using the max value for all possible actions
* Computing the loss function by MSE loss

       loss = F.mse_loss(Q_expected, Q_targets)
     
* Minimize the loss by gradient descend mechanism using the ADAM optimizer

### Model Q-Network

Both Q-Networks (local and target) are implemented by the class
**QNetwork**. This class implements the simple neural network    
with 3 fully-connected layers and 2 rectified nonlinear layers.
The class **QNetwork** is realized in the framework of package **PyTorch**.   
The number of neurons of the fully-connected layers are as follows:

 * Layer fc1,  number of neurons: _state_\__size_ x _fc1_\__units_, 
 * Layer fc2,  number of neurons: _fc1_\__units_ x _fc2_\__units_,
 * Layer fc3,  number of neurons: _fc2_\__units_ x _action_\__size_,
 
where _state_\__size_ = 37, _action_\__size_ = 8, _fc1_\__units_ and _fc2_\__units_
are the input params.

### Training and Testing 
 
We run 5 training sessions with different parameters _fc1_\__units_,  _fc2_\__units_, _eps_\__start_,
and we save obtained weights by the function of **PyTorch**:

    torch.save(agent.qnetwork_local.state_dict(), 'weights_'+str(train_numb)+'.trn') 
     
For input: fc1_units = 80, fc2_units = 72, we get the following training output:   
train_num:  0 eps_start:  0.989
Episode: 2000, elapsed: 0:47:17.485940, Avg.Score: 11.59,  score 12.0, How many scores >= 15: 17, eps.: 0.13

For input: fc1_units = 80, fc2_units = 80, the following training output is as follows:
train_num:  1 eps_start:  0.998
Episode: 2000, elapsed: 0:45:05.674792, Avg.Score: 11.96,  score 9.0, How many scores >= 15: 26, eps.: 0.134

For input: fc1_units:  80 , fc2_units:  72
train_num:  2 eps_start:  0.988
Episode: 2000, elapsed: 0:45:17.810554, Avg.Score: 12.43,  score 11.0, How many scores >= 15: 29, eps.: 0.13

For input: fc1_units:  112 , fc2_units:  112
train_num:  3 eps_start:  0.991
Episode: 2000, elapsed: 0:45:30.095643, Avg.Score: 11.53,  score 16.0, How many scores >= 15: 17, eps.: 0.13

For input: fc1_units:  112 , fc2_units:  120
train_num:  4 eps_start:  0.994
Episode: 2000, elapsed: 0:45:28.997457, Avg.Score: 11.86,  score 13.0, How many scores >= 15: 23, eps.: 0.13

### Solved environment (from my local machine)

From the tests done on my local machine, it seems that when fc1_units > fc2_units, the model performs best as in case of the third test run, the model exceeded the 15 points threshold the most, with fc1_units = 80 and fc2_units = 72, with the highest average score of 12.43 


### Future ideas 

The future ideas for improving the agent's performance.

1. Possible improve can be achieved by adding one or more nonlinear (or also linear) layers to the neural network, albiet at    the cost of more computation power required.

2. Doing the test with many more variations in the number of fc1_units and fc2_units (5 in my case)

3. The starting value of epsilon should be parameter for more assessment.  

### References to possible improvements

It would be very useful to check improvements in [Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets](https://medium.freecodecamp.org/improvements-in-deep-q-learning-dueling-double-dqn-prioritized-experience-replay-and-fixed-58b130cc5682)

One effective way to improve the performance is by using Prioritized Experience Replay. It will be interesting to check [github repo](https://github.com/rlcode/per) for a fast implementation of Prioritized Experience Replay using a special data structure Sum Tree.

### The recent achievement 

[Open AI group to play Dota 2](https://openai.com/blog/dota-2/) using Reinforcement Learning. They have created a bot which beats the world’s top professionals at 1v1 matches of [Dota 2](http://blog.dota2.com/?l=english) under standard tournament rules. The bot learned the game from scratch by self-play, and does not use imitation learning or tree search. This is a step towards building AI systems which accomplish well-defined goals in messy, complicated situations involving real humans. 


### Tensorflow or PyTorch
Tensorflow is based on Theano and has been developed by Google, whereas PyTorch is based on Torch and has been developed by Facebook. [The force is strong with which one?](https://medium.com/@UdacityINDIA/tensorflow-or-pytorch-the-force-is-strong-with-which-one-68226bb7dab4)

