# INM707 Coursework
### Aaron Mir (Student Number: 160001207)
<img src="All_Tasks.png" alt="All_TasksOverview" width="700"/>  <img src="Task_1.png" alt="Task_1" width="700"/>

In [None]:
##--------------------Coding References---------------------##
# Percentage of borrowed code: X% - 
# [1] 

Environment: The agent is preparing for an upcoming mission in which they are to infiltrate an enemy stronghold to gather intelligence on a potential coup d'état. The enemy land is in the form of an NxN grid with each grid containing one stronghold of size N/2 x N or N x N/2 (starting on either side of the middle column (randomly decided)) and a wide river surrounding the entire enemy land. The stronghold has as many entrances from the mainland as (1/4)N and contains dangerous enemy combatants (at random positions) within the stronghold that move around per step (stochasticity). The land inside the stronghold has no transition probability associated with it (deterministic). The land surrounding the stronghold (the shore) is booby-trapped and covered in mist which means that there is both land which can kill/hurt the agent as well as normal land and a transition probability (0.7 to move to the chosen state or 0.1 to move to the other 3) associated with each step (stochasticity) outside the stronghold. The amount of shore that has traps on it is (1/4)N x rows of shore. The row/column of values immediately beside the stronghold is normal land. The goal of the agent is to infiltrate the stronghold and gather the intelligence without being seen/killed by enemy combatants or booby traps.

The agent starts in a random cell which belongs to the land surrounding the stronghold. 

State of the agent: Governed by the index of the cell it is on.
Set of states of the environment: Governed by the index of the agent and the index of moving enemies.

Rewards: +10 for getting intelligence, -100 for getting hurt by a combatant or booby trap, -1 for moving into a wall or water

Terminal States: agent moves to the intelligence, if agent moves to a trap and dies, if the agent moves to an enemy combatant, if agent runs out of time

_ represents normal land [0], X represents a river/wall [1], A represents the agent [2], T represents a booby trap [3], E represents an enemy [4], I represents the intelligence [5]


Checklist:

The MDP consists of states, a transition probability, a reward function, and also actions

    1. Estabilish states (done, represented by the coordinates of the agent on the grid)
    2. Estabilish transitions matrix.
    3. Estabilish transition probability matrix. - deterministic right now
    3. Estabilish rewards matrix.
    4. Establish return G.
    5. Sort out enemy movement.
    6. Implement partial observability?
    7. Improve visualisation.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
class Stronghold():
    def __init__(self, size):
        self.size = size
        self.actions = {0: 'Up', 1: 'Down', 2: 'Left', 3: 'Right'}
        self.land = self.env_gen()
        self.position_agent = None                                 # initial position of the agent will be decided by resetting the environment
        self.time_elapsed = 0                                      # run time
        self.time_limit = self.size**2
        self.R = self.fill_reward_matrix()
        self.P = self.fill_transition_probability_matrix()
        self.dict_map_display={ 0:'_',  # normal land
                                1:'X',  # river/wall
                                2:'A',  # agent
                                3:'T',  # trap
                                4:'E',  # enemy
                                5:'I'}  # intelligence
                    
    def env_gen(self):
        land = np.zeros((self.size, self.size))
        land[0,:] = 1                                               # establish the river
        land[:,0] = 1
        land[self.size-1,:] = 1
        land[:,self.size-1] = 1
        self.column_choice = np.random.choice((self.size//2-1, self.size//2+1)) # random choice whether stronghold starts from the left or right of the 'central' column
        land[1:self.size-1, self.column_choice] = 1
        if self.column_choice == self.size//2-1:                              # if stronghold is on left
            land[1, 0:self.column_choice] = 1                                # establish the walls of stronghold
            land[self.size-2, 0:self.column_choice] = 1
            land[1:self.size-1, 0] = 1                                   
            for col in land[1:self.size-1, self.column_choice+2:self.size-1].T:
                traps = []
                for i in range(int(np.round(1/4*len(col)))):        # make as many traps as 1/4 of the length of each column in the shore
                    trap = np.random.choice(np.setdiff1d(range(len(col)), traps))
                    col[trap] = 3
                    traps.append(trap)
            for col in land[2:self.size-2, 1:self.column_choice].T:
                enemies = []
                for i in range(int(np.round(1/4*len(col)))):        # populate enemies randomly inside the stronghold
                    enemy = np.random.choice(np.setdiff1d(range(len(col)), enemies))
                    col[enemy] = 4
                    enemies.append(enemy)
            intel_row = np.random.randint(1, len(land[2:self.size-2, 1:self.column_choice-1]))
            intel_col = np.random.randint(1, len(land[2:self.size-2, 1:self.column_choice-1].T))
            land[intel_row+2][self.column_choice-intel_col] = 5            # randomly insert intelligence into stronghold 
        else:                                                              # if stronghold is on right                          
            land[1, self.column_choice:self.size] = 1
            land[self.size-2, self.column_choice:self.size] = 1
            land[1:self.size-1, self.size-1] = 1
            for col in land[1:self.size-1, 1:self.column_choice-1].T:
                traps = []
                for i in range(int(np.round(1/4*len(col)))):        # make as many traps as 1/4 of the length of each column in the shore
                    trap = np.random.choice(np.setdiff1d(range(len(col)), traps))
                    col[trap] = 3
                    traps.append(trap)
            for col in land[2:self.size-2, self.column_choice+1:self.size-1].T:
                enemies = []
                for i in range(int(np.round(1/4*len(col)))):        # populate enemies randomly inside the stronghold
                    enemy = np.random.choice(np.setdiff1d(range(len(col)), enemies))
                    col[enemy] = 4
                    enemies.append(enemy)
            intel_row = np.random.randint(1, len(land[2:self.size-2, self.column_choice+1:self.size-1]))              
            intel_col = np.random.randint(1, len(land[2:self.size-2, self.column_choice+1:self.size-1].T))
            land[intel_row+2][self.column_choice+intel_col] = 5            # randomly insert intelligence into stronghold
        entrances = []
        for i in range(int(np.round(1/4*len(land[2:self.size-2, self.column_choice])))):        # make as many entrances as 1/4 of the length of the front wall 
            entrance = np.random.choice(np.setdiff1d(range(len(land[2:self.size-2, self.column_choice])), entrances))
            land[2:self.size-2, self.column_choice][entrance] = 0
            entrances.append(entrance)
        return land
        
    def get_empty_cells_shore(self, n_cells):
        if self.column_choice == self.size//2-1:
            empty_cells_coord = np.where(self.land[1:self.size-1, self.column_choice+1:self.size-1] == 0)
            selected_indices = np.random.choice(np.arange(len(empty_cells_coord[0])), n_cells)
            selected_coordinates = empty_cells_coord[0][selected_indices]+1, empty_cells_coord[1][selected_indices]+len(self.land[1][:self.column_choice+1])
        if self.column_choice == self.size//2+1:
            empty_cells_coord = np.where(self.land[1:self.size-1, 1:self.column_choice] == 0)
            selected_indices = np.random.choice(np.arange(len(empty_cells_coord[0])), n_cells)
            selected_coordinates = empty_cells_coord[0][selected_indices]+1, empty_cells_coord[1][selected_indices]
        if n_cells == 1:
            return np.asarray(selected_coordinates).reshape(2,)
        return selected_coordinates

    def step(self, action):
        # enemies move randomly - they do not move if their choice is a wall, the intelligence, another enemy or the stronghold entrance column
        for i, j in zip(*np.where(self.land == 4)):
            move = np.random.choice(('up', 'down', 'left', 'right'))
            if move == 'up' and self.land[i-1][j] != 1 and self.land[i-1][j] != 4 \
                and self.land[i-1][j] != 5:
                self.land[i][j] = 0
                self.land[i-1][j] = 4
            if move == 'down' and self.land[i+1][j] != 1 and self.land[i+1][j] != 4 \
                and self.land[i+1][j] != 5:
                self.land[i][j] = 0
                self.land[i+1][j] = 4
            if move == 'left' and self.land[i][j-1] != 1 and self.land[i][j-1] != 4 \
                and self.land[i][j-1] != 5 and j-1 != self.column_choice:
                self.land[i][j] = 0
                self.land[i][j-1] = 4
            if move == 'right' and self.land[i][j+1] != 1 and self.land[i][j+1] != 4 \
                and self.land[i][j+1] != 5 and j+1 != self.column_choice:
                self.land[i][j] = 0
                self.land[i][j+1] = 4

        # add partial observability?
        
        # agent moves
        current_position = np.array((self.position_agent[0], self.position_agent[1])) # saving the current position in case agent hits a wall
        reward_step = 0
        if action == 'up':                                          # action is 'up', 'down', 'left', or 'right'
            self.position_agent[0] -= 1   
        if action == 'down':
            self.position_agent[0] += 1 
        if action == 'left':
            self.position_agent[1] -= 1
        if action == 'right':
            self.position_agent[1] += 1

        # calculate total reward
        if self.land[self.position_agent[0]][self.position_agent[1]] == 1:
            reward_step -= 1
            self.position_agent = current_position                                                         
        if self.land[self.position_agent[0]][self.position_agent[1]] == 3 or self.land[self.position_agent[0]][self.position_agent[1]] == 4:
            reward_step -= 100
        if self.land[self.position_agent[0]][self.position_agent[1]] == 5:
            reward_step += 10
            done = True                     # termination condition

        # implement transition probabilities proba_of_tripping = [0.1, 0.1, 0.1, 0.7]

        # calculate observations
        observations = self.observe()
        
        # time-limit termination condition
        if self.time_elapsed == self.time_limit:
            done = True
        else:
            done = False
            self.time_elapsed += 1              # update time
            reward_step -= 1                    # negative reward per time-step
        
        new_state = self.state_transition_matrix[self.position_agent[0]][self.position_agent[1]][self.dict_actions[action]]
        return new_state, reward, done
    
    def reset(self):
        self.time_elapsed = 0                                                 # put time_elapsed to 0
        self.position_agent = np.asarray(self.get_empty_cells_shore(1))       # position of the agent is a random cell on the shore numpy array
        
        # Calculate observations
        #observations = self.calculate_observations()
        #return observations

    def render(self):                                                       # displays the land
        envir_with_agent = self.land.copy()
        envir_with_agent[self.position_agent[0], self.position_agent[1]] = 2
        full_repr = ""
        for r in range(self.size):
            line = ""
            for c in range(self.size):
                string_repr = self.dict_map_display[envir_with_agent[r,c]]    
                line += "{0:2}".format(string_repr)
            full_repr += line + "\n"
        print(full_repr)

    def fill_transition_probability_matrix(self):
        state_transition_matrix = []
        for i in range(len(self.land)):
            for j in range(len(self.land)):
                if i-1 < 0:
                    state_up = None
                else: state_up = np.array([i-1, j])
                if i+1 > len(self.land)-1:
                    state_down = None
                else: state_down = np.array([i+1, j])
                if j-1 < 0:
                    state_left = None
                else: state_left = np.array([i, j-1])
                if j+1 > len(self.land)-1:
                    state_right = None
                else: state_right = np.array([i, j+1])
                state_transition_matrix.append([state_up, state_down, state_left, state_right])
        state_transition_array = np.array(state_transition_matrix).reshape(self.size*self.size, 4) # used to be 1 at the end
        P = np.vsplit(state_transition_array, self.size) # P maps the position of the agent (the state) and action to reachable states - N*N is number of reachable states, 4 is number of possible actions 
        return P
        
    def fill_reward_matrix(self): # change the values to reward values?
        reward_matrix = []
        for i in range(len(self.land)):
            for j in range(len(self.land)):
                if i-1 < 0:
                    reward_up = None
                else: reward_up = self.land[i-1][j]
                if i+1 > len(self.land)-1:
                    reward_down = None
                else: reward_down = self.land[i+1][j]
                if j-1 < 0:
                    reward_left = None
                else: reward_left = self.land[i][j-1]
                if j+1 > len(self.land)-1:
                    reward_right = None
                else: reward_right = self.land[i][j+1]
                reward_matrix.append([reward_up, reward_down, reward_left, reward_right])
        reward_array = np.array(reward_matrix).reshape(self.size*self.size, 4) 
        R = np.vsplit(reward_array, self.size) # R maps the position of the agent (the state) and action to rewards
        return R


In [3]:
# the state of the agent is entirely described by either the coordinates of the cell it is on, or the index of the cell.
# the value function (can be represented as a dictionary, or an array) that maps the state to the value of the state. As we don't know the true value at the beginning, it will be initialized at 1.
# the current state is given by self.position_agent gives the state i.e. its index
# the actions will be the direct cells that an agent can go to from a particular cell or up down left right
# the rewards will be given to a robot if a cell/state is directly reachable from the current state.

In [91]:
stronghold = Stronghold(21)
stronghold.reset()
stronghold.render()

X X X X X X X X X X X X X X X X X X X X X 
X X X X X X X X X X _ T T _ T _ _ T _ _ X 
X E E _ E _ _ _ _ _ _ _ _ _ _ _ _ _ T _ X 
X _ _ E _ _ E _ _ X _ _ T _ _ T _ _ _ T X 
X _ E _ _ _ _ _ _ X _ T _ _ _ T _ _ _ _ X 
X _ _ _ _ _ _ _ _ _ _ _ T A _ T T T _ _ X 
X _ _ _ _ _ _ _ _ X _ _ _ _ T _ T _ T _ X 
X _ _ E _ E _ _ _ X _ _ T T _ _ _ T T T X 
X _ _ _ E _ _ E _ X _ _ _ T _ T _ _ _ _ X 
X _ _ _ _ E _ E E _ _ _ _ _ _ _ _ _ _ _ X 
X _ _ _ E _ _ _ _ X _ _ _ T _ _ _ _ T _ X 
X E _ _ _ _ I _ E X _ T _ _ _ _ T T _ T X 
X E _ _ E E _ _ _ X _ T _ T _ T _ T _ T X 
X _ E E _ _ _ E _ X _ _ T _ _ _ _ _ _ _ X 
X _ _ _ _ _ E _ _ _ _ _ _ _ _ _ _ _ _ _ X 
X _ E _ _ E E E _ X _ _ _ _ _ _ _ _ _ _ X 
X E _ E _ _ E _ E X _ _ _ _ T _ T _ _ _ X 
X _ _ _ _ _ _ _ E X _ T _ _ T _ _ _ _ _ X 
X _ _ _ _ _ _ _ _ X _ _ _ T _ _ T _ T _ X 
X X X X X X X X X X _ _ _ _ T _ _ _ _ T X 
X X X X X X X X X X X X X X X X X X X X X 



In [16]:
stronghold.land[0][0]

1.0

In [154]:
stronghold.position_agent

array([ 4, 19], dtype=int64)

In [155]:
np.shape(stronghold.P)

(21, 21, 4)

In [186]:
stronghold.P[9][4]

array([array([list([8, 4]), 1.0], dtype=object),
       array([list([10, 4]), 1.0], dtype=object),
       array([list([9, 3]), 1.0], dtype=object),
       array([list([9, 5]), 1.0], dtype=object)], dtype=object)

In [160]:
stronghold.R[9][4]

array([4.0, 0.0, 0.0, 0.0], dtype=object)

In [16]:
class Policies():
    def __init__(self):
        pass
    
    def random_policy(self):
        pass

# Ignore

In [None]:
   def fill_transition_probability_matrix(self):
        state_transition_matrix = []
        for i in range(len(self.land)):
            for j in range(len(self.land)):
                if i-1 < 0:
                    state_up = None
                elif self.land[i][j] == 0: 
                    state_up = np.array([[i-1, j], 1.0])
                elif self.land[i][j] == 1: 
                    state_up = np.array([[i-1, j], 1.0])
                elif self.land[i][j] == 3: 
                    state_up = np.array([[i-1, j], 1.0])
                elif self.land[i][j] == 4: 
                    state_up = np.array([[i-1, j], 1.0])
                elif self.land[i][j] == 5: 
                    state_up = np.array([[i-1, j], 1.0])
                
                if i+1 > len(self.land)-1:
                    state_down = None
                elif self.land[i][j] == 0: 
                    state_down = np.array([[i+1, j], 1.0])
                elif self.land[i][j] == 1: 
                    state_down = np.array([[i+1, j], 1.0])
                elif self.land[i][j] == 3: 
                    state_down = np.array([[i+1, j], 1.0])
                elif self.land[i][j] == 4: 
                    state_down = np.array([[i+1, j], 1.0])
                elif self.land[i][j] == 5: 
                    state_down = np.array([[i+1, j], 1.0])
                
                if j-1 < 0:
                    state_left = None
                elif self.land[i][j] == 0: 
                    state_left = np.array([[i, j-1], 1.0])
                elif self.land[i][j] == 1: 
                    state_left = np.array([[i, j-1], 1.0])
                elif self.land[i][j] == 3: 
                    state_left = np.array([[i, j-1], 1.0])
                elif self.land[i][j] == 4: 
                    state_left = np.array([[i, j-1], 1.0])
                elif self.land[i][j] == 5: 
                    state_left = np.asarray([[i, j-1], 1.0])
            
                if j+1 > len(self.land)-1:
                    state_right = None
                elif self.land[i][j] == 0: 
                    state_right = np.array([[i, j+1], 1.0])
                elif self.land[i][j] == 1: 
                    state_right = np.array([[i, j+1], 1.0])
                elif self.land[i][j] == 3: 
                    state_right = np.array([[i, j+1], 1.0])
                elif self.land[i][j] == 4: 
                    state_right = np.array([[i, j+1], 1.0])
                elif self.land[i][j] == 5: 
                    state_right = np.array([[i, j+1], 1.0])
                state_transition_matrix.append([state_up, state_down, state_left, state_right])
        state_transition_array = np.array(state_transition_matrix).reshape(self.size*self.size, 4) # used to be 1 at the end
        det_P = np.vsplit(state_transition_array, self.size) # deterministic version of P