# Reinforcement learning with Foolsball
- Reinforcement learning is learning to make decisions from experience.
- Games are a good testbed for RL.
 

# About Foolsball v3.0
- 5x4 playground that provides a football/foosball-like environment.
- A controllable player:
  - always spawned in the top-left corner
  - displayed as '⚽'
  - can move North, South, East or West.
  - **Movements have some uncertainty associated with them.**
  - can be controlled algorithmically
- A number of **dynamic** opponents, each represented by 👕, that occupy certain locations on the field.
- **The opponents can move up or down randomly independent of each other**
- A goalpost 🥅 that is fixed in the bottom right corner

## Goals
### Primary goal
- We want the agent to learn to reach the goalpost 

### Secondary goals
- We may want the agent to learn to be efficient in some sense, for example, take the shortest path to the goalpost. 

## Rules 
- Initial rules:
    - **The ball can be (tried to be) moved in five ways: \['n','e','w',s','x'\], 'x' representing holding the ball in the current position.**
    - **The environment is stochastic. So there's a small probability that a bad action gets triggered instead of the desired action.**
    - Move the ball to an unmarked position: -1 points, game continues
    - Move the ball to a position marked by a defender: -5 points, **game continues**
    - Try to move the ball ouside the field: -1 (ball stays in the previous position), game continues
    - Move the ball into the goal post position: +5, game terminates
    - **Each opponent can randomly move up or down in its column**
    - **The agent can sense the presence or absense of defenders in the adjacent cells.**


# Create the enviroment

In [1]:
import numpy as np

agent = '⚽'
opponent = '👕'
goal = '🥅'

arena = [['⚽', ' ' , '👕', ' ' ],
         [' ' , ' ' , ' ' , '👕'],
         [' ' , '👕', ' ' , ' ' ],
         [' ' , ' ' , ' ' , '👕'],
         [' ' , '👕', ' ' , '🥅']]

### Key questions
- What information should make up the state of the environment+agent now?
- How and why should the state be encoded?
- Is it possible to segregate next_state calculation and reward calculation as we have done in the past? 
- Think about the atomicity and order of the actions. Is agent's movement followed by defender movement equivalent to defneder movement followed by agents's movement?

### Todos:
- Implement the new rules in the code below.

In [2]:
import pdb

class Foolsball(object):
    def __encode_indices__(self,row,col):
        """Convert from indices (row,col) to integer state."""
        return row*self.n_cols + col
    
    
    def __decode_indices__(self, state):
        """Convert from integer state to indices(row,col)"""
        row = state // self.n_cols
        col = state % self.n_cols
        return row,col
        
    def __make_observation__(self):
        player_obs = self.state['agent']
        
        player_row, player_col = self.__decode_indices__(player_obs)
        delta = ({'n':(-1,0), 'e':(0,+1), 'w':(0, -1), 's':(+1,0)})
        
        opponent_obs = {}
        for k in delta:
            scan_row, scan_column = player_row + delta[k][0], player_col + delta[k][1]
            
            if (0<=scan_row<self.n_rows) and (0<=scan_column<self.n_cols) and\
            self.__encode_indices__(scan_row, scan_column) in self.state['opponents']:
                opponent_obs[k] = 1
            else:
                opponent_obs[k] = 0
        
        return(player_obs,opponent_obs)
                
            
            
        

    def __deserialize__(self,map:list,agent:str,opponent:str, goal:str):
        """Convrt a string representation of a map into a 2D numpy array
        Param map: list of lists of strings representing the player, opponents and goal.
        Param agent: string representing the agent on the map 
        Param opponent: string representing every instance of an opponent player
        Param goal: string representing the location of the goal on the map
        """
        ## Capture dimensions and map.
        self.n_rows = len(map)
        self.n_cols = len(map[0])
        self.map = np.asarray(map)

        ## Store string representations for printing the map, etc.
        self.agent_repr = agent
        self.opponent_repr  = opponent
        self.goal_repr = goal

        ## Find initial state, the desired goal state and the state of the opponents. 
        player_state = None
        goal_state = None
        opponents_states = []

        for row in range(self.n_rows):
            for col in range(self.n_cols):
                if map[row][col] == agent:
                    player_state = self.__encode_indices__(row,col)
                    self.map[row,col] = ' ' 

                elif map[row][col] == opponent:
                    opponents_states.append(self.__encode_indices__(row,col))
                    self.map[row,col] = ' ' 

                elif map[row][col] == goal:
                    goal_state = self.__encode_indices__(row,col)
        
        self.init_state = {'agent':player_state,'opponents':opponents_states, 'goal':goal_state}
       
        
        assert player_state is not None, f"Map {map} does not specify an agent {agent} location"
        assert opponents_states is not None,  f"Map {map} does not specify a goal {goal} location"
        assert goal_state,  f"Map {map} does not specify any opponents {opponent} location"

        return self.init_state
    
    
    def __init__(self,map,agent,opponent,goal,slip_prob):
        """Spawn the world, create variables to track state and actions."""
        # We need to track the location of the agent (the ball)
        # and the opponents.
        self.state = self.__deserialize__(map,agent,opponent,goal)
        self.done = False
        self.actions = ['n','e','w','s','x']
        self.n_actions = len(self.actions)
        self.slip_prob = slip_prob

        # Set up the rewards
        self.default_rewards = {'unmarked':-1, 'opponent':-5, 'outside':-1, 'goal':+5}
        self.set_rewards(self.default_rewards)
        
    def set_rewards(self,rewards):
        if not self.state == self.init_state:
            print('Warning: Setting reward while not in initial state! You may want to call reset() first.')
        for key in self.default_rewards:
            assert key in rewards, f'Key {key} missing from reward.'
        self.rewards = rewards
            
            
    def reset(self):
        """Reset the environment to its initial state."""
        # There's really just two things we need to reset: the state, which should
        # be reset to the initial state, and the `done` flag which should be 
        # cleared to signal that we are not in a terminal state anymore, even if we 
        # were earlier. 
        self.state = self.init_state
        self.done  = False
        return self.__make_observation__()
    
    
    def step(self,action):
        """Simulate state transition based on current state and action received."""
        assert not self.done, \
        f'You cannot call step() in a terminal state({self.state}). Check the "done" flag before calling step() to avoid this.'
        
        actions = self.actions
        selected_action_index = actions.index(action)
        bad_action_prob = self.slip_prob/(len(actions)-1)
        action_probs = np.ones(len(actions))* bad_action_prob
        action_probs[selected_action_index] = 1-self.slip_prob
        
        executed_action = np.random.choice(actions,p=action_probs)
        
        
        player_position, opponent_positions, goal_position =\
            self.state['agent'], self.state['opponents'], self.state['goal']
        
        action_to_index_delta = {'n':[-1,0], 'e':[0,+1], 'w':[0,-1], 's':[+1,0], 'x':[0,0]}
        
        # Simulate player's movement
        row, col = self.__decode_indices__(player_position)
        
        row_delta, col_delta = action_to_index_delta[executed_action]
        new_row , new_col = row+row_delta, col+col_delta

        ## Check and advance player
        if (0<=new_row<self.n_rows) and (0<=new_col<self.n_cols):
            new_player_position = self.__encode_indices__(new_row, new_col) 
            outside = False
        else:
            new_player_position = player_position 
            outside = True

        
        # Simulate opponents' movements
        new_opponent_positions = opponent_positions.copy()
        for i in range(len(opponent_positions)):
            row, col = self.__decode_indices__(opponent_positions[i])
            random_action = np.random.choice(['n','s','x'])
            
            row_delta, col_delta = action_to_index_delta[random_action]
            new_row , new_col = row+row_delta, col+col_delta
            
            if (0<=new_row<self.n_rows) and (0<=new_col<self.n_cols):
                new_opponent_position = self.__encode_indices__(new_row, new_col)
                
                ### Check that there's no other opponent in the new position
                ### and it is not a goal state either and update the position in 
                ### place
                if not new_opponent_position in new_opponent_positions\
                and new_opponent_position is not goal_position:
                    new_opponent_positions[i] = new_opponent_position
                    
        
        # Calculate reward and done flags.
        # The if conditions can overlap and reward needs to be accumulated.
        # For example an (attempted) outside followed by a capture by an opponent.
        reward = 0
        done = False
        normal_move = True
        if new_player_position is  goal_position:
            reward += self.rewards['goal']
            done = True
            normal_move = False
            
        if new_player_position in opponent_positions:
            reward += self.rewards['opponent']
            normal_move = False
        
        if outside:
            reward += self.rewards['outside']
            normal_move = False
        
        if normal_move:
            reward = self.rewards['unmarked']
            

        self.state = {'agent':new_player_position,'opponents':new_opponent_positions, 'goal':goal_position}
        self.done = done
        
        observation = self.__make_observation__()
        
        return observation, reward, done
    

    
    
    def render(self):
        """Pretty-print the environment and agent."""
        ## Create a copy of the map and change data type to accomodate
        ## 3-character strings
        _map = np.array(self.map, dtype='<U3')
        
        for opponent_position in self.state['opponents']:
            opp_row, opp_col = self.__decode_indices__(opponent_position)
            _map[opp_row,opp_col] = self.opponent_repr

        ## Mark unoccupied positions with special symbol.
        ## And add extra spacing to align all columns.
        for row in range(_map.shape[0]):
            for col in range(_map.shape[1]):
                if _map[row,col] == ' ':
                    _map[row,col] = ' + '

                elif _map[row,col] == self.opponent_repr: 
                    _map[row,col] =  self.opponent_repr + ' '

                elif _map[row,col] == self.goal_repr:
                    _map[row,col] = ' ' + self.goal_repr + ' '

        ## If current state overlaps with the goal state or one of the opponents'
        ## states, susbstitute a distinct marker.
        if self.state['agent'] == self.state['goal']:
            r,c = self.__decode_indices__(self.state['agent'])
            _map[r,c] = ' 🏁 '
        elif self.state['agent'] in self.state['opponents']:
            r,c = self.__decode_indices__(self.state['agent'])
            _map[r,c] = ' ❗ '
        else:
            r,c = self.__decode_indices__(self.state['agent'])
            _map[r,c] = ' ' + self.agent_repr

        for row in range(_map.shape[0]):
            for col in range(_map.shape[1]):
                print(f' {_map[row,col]} ',end="")
            print('\n') 

        print()


In [None]:
foolsball = Foolsball(arena, agent, opponent, goal, slip_prob=0.1)

In [None]:
foolsball.reset()
foolsball.render()

# Interact with the Environment

In [None]:
## Move: n,s,e,w
## Reset: r
## Exit: x
while True:
    try:
        act = input('>>')

        if act in foolsball.actions:
            print(foolsball.step(act))
            print()
            foolsball.render()
        elif act == 'r':
            print(foolsball.reset())
            print()
            foolsball.render()
        elif act == 'q':
            break
        else:
            print(f'Invalid input:{act}')
    except Exception as e:
        print(e)

# Override the default reward structure.
- Use a more sparse reward: {'unmarked':0, 'opponent':-5, 'outside':-1, 'goal':+5}

In [None]:
## Update reward structure to: {'unmarked':0, 'opponent':-5, 'outside':-1, 'goal':+5}
foolsball.reset()
foolsball.set_rewards({'unmarked':0, 'opponent':-5, 'outside':-1, 'goal':+5})

# Implement discounted returns¶
$$Discounted\ Return = R_{t_1} + \gamma*R_{t_2} + \gamma^2*R_{t_3} + ... + \gamma^{n-1}*R_{t_n}$$where $R_{t_k}$ is the reward after step k and $\gamma$ is called the discount factor.
- Set the discount factor $\gamma$ to 0.9

In [None]:
def get_discounted_return(path, gamma=0):
    foolsball.reset()
    foolsball.render()
    _return_ = 0
    discount_coeff = 1
    for act in path: 
        next_state, reward, done = foolsball.step(act)
        _return_ += discount_coeff*reward
        discount_coeff *= gamma    

        foolsball.render()
        if done:
            break
            
    print(f'Return (accumulated reward): {_return_}')

In [None]:
HYPER_PARAMS = {'gamma':0.9}

# Intro to Policies

In [None]:
def greedy_policy_from_returns_tbl(table):
    policy = {s:None for s in table.index }
    for state in table.index:
        greedy_action = table.loc[state].idxmax()
        policy[state] = greedy_action
            
    return policy

In [None]:
def pretty_print_policy(policy):
    direction_repr = {'n':' 🡑 ', 'e':' 🡒 ', 'w':' 🡐 ', 's':' 🡓 ', None:' ⬤ '}

    for row in range(foolsball.n_rows):
        for col in range(foolsball.n_cols):
            state = foolsball.__to_state__(row, col)
            print(direction_repr[policy[state]],end='')
        print()

# Dealing with incomplete Knowledge of the environment

In [None]:
import numpy as np
def collect_random_episode():
    state = foolsball.reset()
    done = False
    episode = []
    
    while not done:
        action = np.random.choice(foolsball.actions)
        next_state, reward, done = foolsball.step(action)
        episode.append([state, action, reward])
        state = next_state
        
    return episode

In [None]:
ep = collect_random_episode()
foolsball.render()
print(ep)

# Implement discounted returns for episodes
- If an episode is: (s1,a1,r1),(s2,a2,r2),(s3,a3,r3), (s4),  s4 being a terminal state:
  - The (discounted) return for (s1,a1) is r1+γ∗r2+γ2∗r3
  - The (discounted) return for (s2, a2)is r2+γ∗r3
  - The (discounted) return for (s3,a3) is r3


In [None]:
def discounted_return_from_episode(ep, gamma=0):
    states, actions, rewards = list(zip(*ep))
    rewards = np.asarray(rewards)
    discount_coeffs = np.asarray([np.power(gamma,p) for p in range(len(rewards))])
    
    l = len(rewards)
    discounted_returns = [np.dot(rewards[i:],discount_coeffs[:l-i]) for i in range(l)]
    
    return (states, actions, discounted_returns)

In [None]:
discounted_return_from_episode(ep, gamma=HYPER_PARAMS['gamma'])

# Exploration-Exploitation with Epsilon Decay

In [None]:
def collect_epsilon_greedy_episode_from_returns_tbl(table, max_ep_len=20, epsilon=0.1):
    state = foolsball.reset()
    done = False
    episode = []
    
    for _ in range(max_ep_len):
        if done:
            break
    
        actions = table.columns
        action_probs = np.asarray([epsilon/len(actions)]*len(actions),dtype=np.float)

        greedy_action_index = np.argmax(table.loc[state].values)
        action_probs[greedy_action_index] += 1-epsilon

        epsilon_greedy_action = np.random.choice(table.columns,p=action_probs)

        next_state, reward, done = foolsball.step(epsilon_greedy_action)
        episode.append([state, epsilon_greedy_action, reward])
        state = next_state

    return episode

# Encoding the observations
- Number of states
- Encoding scheme

# Constant Alpha

## The idea:
- Dividing the accumulated returns by visit count has a non linear effect on the updates. (Go back to previous step and see for yourself).
- Don't divide at all!
- But we need to ensure that updates are small
- Idea:
 - ESTIMATED_RETURNS_TBL.loc[s,a] and ret are both estimates of the same quantity.
 - Use the difference of the two estimates to update ESTIMATED_RETURNS_TBL.loc[s,a] much like we do in Deep Learning.

In [None]:
import pandas as pd
ESTIMATED_RETURNS_TBL = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 5000
epsilon = 1
min_epsilon = 0.1
epsilon_decay = 0.999

alpha = 0.001

for i in range(n_episodes):
    estimated_returns = ESTIMATED_RETURNS_TBL
  
    epsilon = max(epsilon,min_epsilon)
    episode_i = collect_epsilon_greedy_episode_from_returns_tbl(estimated_returns,epsilon=epsilon)
    epsilon *= epsilon_decay
    states, actions, discounted_returns = discounted_return_from_episode(episode_i, gamma=HYPER_PARAMS['gamma'])

    for s,a,ret in zip(states, actions, discounted_returns):
        ESTIMATED_RETURNS_TBL.loc[s,a] += alpha*(ret - ESTIMATED_RETURNS_TBL.loc[s,a])

In [None]:
estimated_returns = ESTIMATED_RETURNS_TBL
print(estimated_returns)

policy0 = greedy_policy_from_returns_tbl(estimated_returns)
print(policy0)

pretty_print_policy(policy0)

# How can we get faster convergence?

- Try the SARSA and Q-learning appraches described [here](https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html#sarsa-on-policy-td-control) 

# Temporal Difference(TD) Learning
 - Learn every step of an episode 
 - This translates into using an updated policy for selecting the action at each step 

In [None]:
def epsilon_greedy_action_from_Q(Q, state, epsilon):
    actions = Q.columns
    action_probs = np.asarray([epsilon/len(actions)]*len(actions),dtype=np.float)
    
    greedy_action_index = np.argmax(Q.loc[state].values)
    action_probs[greedy_action_index] += 1-epsilon

    epsilon_greedy_action = np.random.choice(Q.columns,p=action_probs)
    
    return epsilon_greedy_action

## SARSA (State-Action-Reward-State-Action)

In [None]:
from tqdm import tqdm 
Q = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 2000
epsilon = 1
min_epsilon = 0.1
epsilon_decay = 0.9995

alpha = 0.01


for i in tqdm(range(n_episodes)):
    foolsball.reset()
    s0 = foolsball.init_state
    a0 = epsilon_greedy_action_from_Q(Q,s0,epsilon)
    done = False
    
    while not done:
        s1, reward, done  = foolsball.step(a0)
        a1 = epsilon_greedy_action_from_Q(Q,s1,epsilon)
        
        Q.loc[s0,a0] += alpha*(reward + HYPER_PARAMS['gamma']*Q.loc[s1,a1] - Q.loc[s0,a0])
        
        s0, a0 = s1, a1
  
    epsilon *= epsilon_decay
    epsilon = max(epsilon,min_epsilon)
    
    if (i+1)%500 == 0:
        print(f'Iteration {i+1}')
        policy = greedy_policy_from_returns_tbl(Q)
        pretty_print_policy(policy)
        

policy_SARSA = greedy_policy_from_returns_tbl(Q)
print(policy_SARSA)

pretty_print_policy(policy_SARSA)
        

## Q-Learning

In [None]:
from tqdm import tqdm 
Q = pd.DataFrame.from_dict({s:{a:0 for a in foolsball.actions} for s in range(foolsball.n_states)}, orient='index')

n_episodes = 10000
epsilon = 1
min_epsilon = 0.1
epsilon_decay = 0.9995

alpha = 0.01
rewards = np.zeros(n_episodes)


for i in tqdm(range(n_episodes)):
    foolsball.reset()
    s0 = foolsball.init_state
    done = False
    
    episode_reward = 0
    while not done:
        a0 = epsilon_greedy_action_from_Q(Q,s0,epsilon)
        s1, reward, done  = foolsball.step(a0)
        
        Q.loc[s0,a0] += alpha*(reward + HYPER_PARAMS['gamma']*Q.loc[s1].max() - Q.loc[s0,a0])
        episode_reward += reward
        
        s0 = s1
  
    epsilon *= epsilon_decay
    epsilon = max(epsilon,min_epsilon)
    
    rewards[i] = episode_reward
    
    if (i+1)%500 == 0:
        print(f'Iteration {i+1}')
        policy = greedy_policy_from_returns_tbl(Q)
        pretty_print_policy(policy)
        #print(Q)
        

policy_Q_Learning = greedy_policy_from_returns_tbl(Q)
print(policy_Q_Learning)

pretty_print_policy(policy_Q_Learning)

In [None]:
Q

In [None]:
import matplotlib.pyplot as plt
windowed_rewards = np.convolve(rewards, np.ones(100), 'valid')
plt.plot(windowed_rewards/100)
plt.show()