### **SOW-MKI49-2019-SEM1-V: NeurIPS**
#### Project: Neurosmash

This is the info document on the (updated*) Neurosmash environment that you will be using for your project. It contains background info and skeleton code to get you started.

### Project

In the next 4 + 1 weeks, you will be working exclusively on your project in the practicals. The goal is to take what has been discussed in class and what you have already worked on in the earlier practicals, and apply them on a RL problem in a novel environment. Note that while the earlier practicals were intended to give you the opportunity to gain experience with various RL topics and were not graded, your project will constitute 50% of your final grade.

Your project grade will be based on the following components:
- Online demonstration
- Source code
- Written report

These components will be evaluated based on performance, creativity, elegance, rigor and plausibility.

While you can use the material from earlier practicals (e.g., REINFORCE, DQN, etc.) as a boilerplate, you are also free to take any other approach be it imitation learning or world models for your project.

In addition to the practical sessions, we will provide additional support in the coming four weeks. You can email any of us to set up an appointment for discussing your project.

### Environment

Briefly, there are two agents: Red and Blue. Red is controlled by you. Blue is controlled by the environment "AI".* Both agents always run forward with a speed of 3.5 m/s*. If one of them gets within the reach of the other (a frontal sphere with 0.5 m radius), it gets pushed away automatically with a speed of 3.5 m/s. The only thing that the agents can do is to turn left or right with an angular speed of 180 degrees/s. This means that there are three possible discrete actions that your agent can take every step: Turn nowhere, turn left and turn right. For convenience, there is also a fourth built-in action which turns left or right with uniform probability. An episode begins when you reset the environment and ends when one of the agents fall off the platform. At the end of the episode, the winning agent gets a reward of 10 while the other gets nothing. Therefore, your goal is to train an agent who can maximize its reward by pushing the other agent off the platform or making it fall off the platform by itself.

* None that all times are simulation time. That is, 0.02 s per step when timescale is set to one.

* Basically, Blue is artificial but not really intelligent. What it does is that every 0.5 s, it updates its destination to the current position of Red plus some random variation (a surrounding circle with a radius of 1.75 m) and smoothly turns to that position.

### *Updates

* There has been several small changes made to the lite version based on your feedback. Most notable ones are:
- Bugs have (hopefully) been completely eliminated. Any remaining bug/glitch that your agent "learns" exploit will be considered fair game.

- TCP/IP interface has been made more robust (you can now stop and start the simulation with the gui. no need to quit and rerun the environment anymore to reset it if something goes wrong.)
- Animations/graphics have been updated (you can now tell what is going on more easily. agents actually fall down, etc.)
- Last but not least, size and timescale settings have been added (you can now change the resolution and the speed of the environment to make the simulation run faster). In other words:

Size => This is the size of the texture that the environment is rendered. This is set to 784 by default, which will result in a crisp image but slow speed. You can change the size to a value that works well for your environment should not go too low.

Timescale => This is the simulation speed of the environment. This is set to 1 by default. Setting it to n will make the simulation n times faster. In other words, less (if n < 1) or more (if n > 1) simulation time will pass per step. You might want to increase this value to around 10 if you cannot train your models fast enough so that they can sample more states in a shorter number of steps at the expense of precision.

### Misc. FAQs

Q: Can we get HCP access?  
A: I will try provide access to the AI HPC cluster if you require additional resources. If this is something that you would like, please contact me. Note however that you should use the cluster for training your final model and not development.

Q: Will the environment code be shared?  
A: Yes. I will share the entire unityproject at the end of the course (but without the 3D agent models).

Q: Is there a environment version that can be played with a mouse/keyboard?  
A: No but I will make one and update Brightspace when I have some free time.

Q: I found a bug/glitch. What should I do?  
A: Please let me know and I will fix it. Do note however that any updates from this point on will be optional to adopt. That is, you can keep working on the current environment if you so wish or think that updating will disadvantage you in any way.

### Skeleton code

- You should first add the Neurosmash file to your working directory or Python path.

In [1]:
import Neurosmash
import torch

# These are the default environment arguments. They must be the same as the values that are set in the environment GUI.
ip         = "127.0.0.1" # Ip address that the TCP/IP interface listens to
port       = 13000       # Port number that the TCP/IP interface listens to
size       = 150       # Please check the Updates section above for more details
timescale  = 10           # Please check the Updates section above for more details

agent = Neurosmash.Agent() # This is an example agent.
                           # It has a step function, which gets reward/state as arguments and returns an action.
                           # Right now, it always outputs a random action (3) regardless of reward/state.
                           # The real agent should output one of the following three actions:
                           # none (0), left (1) and right (2)

env = Neurosmash.Environment(ip, port, size, timescale) # This is the main environment.
                                       # It has a reset function, which is used to reset the environment before episodes.
                                       # It also has a step function, which is used to which steps one time point
                                       # It gets an action (as defined above) as input and outputs the following:
                                       # end (true if the episode has ended, false otherwise)
                                       # reward (10 if won, 0 otherwise)
                                       # state (flattened size x size x 3 vector of pixel values)
                                       # The state can be converted into an image as follows:
                                       # image = np.array(state, "uint8").reshape(size, size, 3)
                                       # You can also use to Neurosmash.Environment.state2image(state) function which returns
                                       # the state as a PIL image

In [2]:
# The following steps through an entire episode from start to finish with random actions (by default)
episode_count = 50
action_iter_count = 500

end, reward, state = env.reset()

# gpu calc
print("search cuda")
device = torch.device("cuda")
print(device)
agent.model.to(device)
agent.target_model.to(device)

for ep in range(episode_count):
    print("episode: {}".format(ep+1))
    for s in range(action_iter_count):
        action = agent.step(end, reward, state)
        end, reward, state_n = env.step(action)
        #if end > 0:
            #reward = -20
        if reward != 0:
            print("reward found: {}".format(reward))
        agent.memorize(state, action, reward, state_n, end)
        state = state_n
        agent.replay(device)
        if end != 0:
            end, reward, state = env.reset()
    agent.epsilon  -= agent.eps_decay
            


search cuda
cuda
episode: 1
episode: 2
replay buffer started
memory size: 501
2020-01-18 15:10:44.055305
start calculating error
end calculating error
2020-01-18 15:10:45.889789
start learning
end learning
2020-01-18 15:10:47.125683
replay buffer ended
current epsilon: 0.99
replay buffer started
memory size: 502
2020-01-18 15:10:47.202477
start calculating error
end calculating error
2020-01-18 15:10:47.666237
start learning
end learning
2020-01-18 15:10:48.837406
replay buffer ended
current epsilon: 0.99
replay buffer started
memory size: 503
2020-01-18 15:10:49.019801
start calculating error
end calculating error
2020-01-18 15:10:49.488547
start learning


KeyboardInterrupt: 

In [None]:
# Let's also do it step by step while displaying the state

end, reward, state = env.reset()

#environment.state2image(state)

In [None]:
#create a network (test)


action = agent.step(end, reward, state)
end, reward, state = env.step(action)
# state = 768*768*3
#state_1 = environment.state2image(state)
#print(type(np.array(environment.state2image(state))[1][1][1]))
data = [np.swapaxes(np.array(env.state2image(state), dtype='d'),0,2)]
x = torch.tensor(data, dtype=torch.double)
# print(type(environment.state2image(state)))
# print(type(np.array(environment.state2image(state), dtype='d')[0][0][0]))

# print(x)
#my_network = Network().double()
#my_network.forward(x)

In [None]:
action = agent.step(end, reward, state)
end, reward, state = env.step(action)
agent.step(state)