# A project in Multi-agent Systems: Stage 1

## Reward Structure

Our reward structure comprises the following. The agent gets

- +5 points for reaching location A.
- +20 points for reaching location B when carrying the item 
    (this is possible since we can track if the agent has picked up the item or not)
- -1 point for each step taken.
- 0 points for standing still.


## Imports

In [13]:
import numpy as np

## Environment Setup

In [14]:
class GridWorld:
    def __init__(self, grid_size=5):
        self.grid_size = grid_size
        self.reset()

    def reset(self):
        self.reward = 0
        self.agent_position = (0,0) # testing
        #self.agent_position = (np.random.randint(0, self.grid_size - 1), np.random.randint(0, self.grid_size - 1))
        self.item_position = (1,0) # testing
        #self.item_position = (np.random.randint(0, self.grid_size - 1), np.random.randint(0, self.grid_size - 1))
        self.target_position = (2,0) # testing
        #self.target_position = (np.random.randint(0, self.grid_size - 1), np.random.randint(0, self.grid_size - 1))

        self.carrying_item = False
        self.done = False
        return self._get_state()
    
    def _get_state(self):
        return (self.agent_position, self.item_position, self.carrying_item, self.reward, self.done)

    def step(self, action):
        x, y = self.agent_position
        # we represent x as the vertical position (rows). y is the horizontal position (columns).
        # we wont let the agent move outside the grid
        if action == 0: # north
            pos = (max(x-1, 0), y)
        elif action == 1: # south
            pos = (min(x+1, self.grid_size - 1), y)
        elif action == 2: # west
            pos = (x, max(y-1, 0))
        elif action == 3:
            pos = (x, min(y+1, self.grid_size - 1))
        else:
            raise ValueError("The action was invalid. Choose a number between 0 to 3")
        
        self.agent_position = pos

        # took a step, so reward - 1
        self.reward -=1

        # Checking if we picked up the item        
        if self.agent_position == self.item_position and not self.carrying_item:
            print('gotitem')
            self.reward += 5  # +5 points for reaching location A
            self.carrying_item = True

        # Checking if we're done
        if self.agent_position == self.target_position and self.carrying_item:
            self.reward += 20 # +10 points for reaching location B when carrying the item
            print('gotthere')
            self.done = True


        return self._get_state(), self.reward, self.done




In [15]:
env = GridWorld(grid_size=5) 
state = env.reset()

# test here. you can change parameters in the reset function for proper testing
actions = [1, 1, 1, 1, 1, 1, 1] # see the step function for the mapping of the movements
for action in actions:
    state, reward, done = env.step(action)
    print(f"reward: {reward}, done: {done}")
    if done:
        break


gotitem
reward: 4, done: False
gotthere
reward: 23, done: True


## Q-learning Algorithm

## Training Phase

## Evaluation Phase

## Visualisation

## Conclusion