[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PetiteIA/schema_mechanism/blob/master/notebooks/agent6.ipynb)

# THE AGENT WHO FOLLOWED ANCIENT CLUES

# Learning objectives

Upon completing this lab, you will be able to implement a developmental agent that learns sequences of interactions.

## Define the necessary classes

Let's improve the Interaction class a little bit.

In [1]:
class Interaction:
    """An interaction is a tuple (action, outcome) with a valence"""
    def __init__(self, action, outcome, valence):
        self._action = action
        self._outcome = outcome
        self._valence = valence

    def get_action(self):
        """Return the action"""
        return self._action

    def get_outcome(self):
        """Return the action"""
        return self._outcome

    def get_valence(self):
        """Return the action"""
        return self._valence

    def key(self):
        """ The key to find this interaction in the dictinary is the string '<action><outcome>'. """
        return f"{self._action}{self._outcome}"

    def pre_key(self):
        return self.key()

    def __str__(self):
        """ Print interaction in the form '<action><outcome:<valence>' for debug."""
        return f"{self._action}{self._outcome}:{self._valence}"

    def __eq__(self, other):
        """ Interactions are equal if they have the same key """
        if isinstance(other, self.__class__):
            return self.key() == other.key()
        else:
            return False            

In [2]:
class CompositeInteraction:
    """A composite interaction is a tuple (pre_interaction, post_interaction) and a weight"""
    def __init__(self, pre_interaction, post_interaction):
        self.pre_interaction = pre_interaction
        self.post_interaction = post_interaction
        self.weight = 1
        self.isActivated = False

    def get_action(self):
        """Return the action of the post interaction"""
        return self.post_interaction.get_action()
    
    def get_valence(self):
        """Return the valence of the pre_interaction plus the valence of the post_interaction"""
        return self.pre_interaction.get_valence() + self.post_interaction.get_valence()

    def reinforce(self):
        """Increment the composite interaction's weight"""
        self.weight += 1

    def key(self):
        """ The key to find this interaction in the dictionary is the string '<pre_interaction><post_interaction>'. """
        return f"({self.pre_interaction.key()},{self.post_interaction.key()})"

    def pre_key(self):
        """Return the key of the pre_interaction"""
        return self.pre_interaction.pre_key()

    def __str__(self):
        """ Print the interaction in the Newick tree format (pre_interaction, post_interaction: valence) """
        return f"({self.pre_interaction}, {self.post_interaction}: {self.weight})"

    def __eq__(self, other):
        """ Interactions are equal if they have the same pre and post interactions """
        if isinstance(other, self.__class__):
            return (self.pre_interaction == other.pre_interaction) and (self.post_interaction == other.post_interaction)
        else:
            return False

## Define the Agent class

We will use a Pandas DataFrame to compute the selection of the next intended interaction and to predict its most likely outcome.

In [None]:
!pip install pandas

Let's implement a base Agent that has the functionnalities of Agent5 implemented using pandas.

In [34]:
import pandas as pd

class Agent:
    def __init__(self, _interactions):
        """ Initialize our agent """
        self._interactions = {interaction.key(): interaction for interaction in _interactions}
        self._composite_interactions = {}
        self._intended_interaction = self._interactions["00"]
        self._last_interaction = None
        self._previous_interaction = None
        self._penultimate_interaction = None
        self._last_composite_interaction = None
        self._previous_composite_interaction = None
        # Create a dataframe of default primitive interactions 
        default_interactions = [interaction for interaction in _interactions if interaction.get_outcome() == 0]
        data = {'post_action': [i.get_action() for i in default_interactions],
                'weight': [0] * len(default_interactions),
                'proclivity': [0] * len(default_interactions),
                'post_interaction': [i.key() for i in default_interactions]}
        self.primitive_df = pd.DataFrame(data)
        # Store the selection dataframe as a class attribute so we can display it in the notebook
        self.selection_df = None

    def action(self, _outcome):
        """Implement the agent's policy"""
        # tracing the previous cycle
        self._previous_composite_interaction = self._last_composite_interaction
        self._penultimate_interaction = self._previous_interaction
        self._previous_interaction = self._last_interaction
        self._last_interaction = self._interactions[f"{self._intended_interaction.get_action()}{_outcome}"]
        print(f"Action: {self._intended_interaction.get_action()}, Prediction: {self._intended_interaction.get_outcome()}, "
              f"Outcome: {_outcome}, Prediction_correct: {self._intended_interaction.get_outcome() == _outcome}, "
              f"Valence: {self._last_interaction.get_valence()}")

        # Call the learning mechanism
        self.learn()
        
        # Create a dataframe from the activated composite interactions 
        activated_keys = [composite_interaction.key() for composite_interaction in self._composite_interactions.values() 
                          if composite_interaction.pre_interaction == self._last_interaction or 
                          composite_interaction.pre_interaction == self._last_composite_interaction]
        data = {'composite': activated_keys,
                'weight': [self._composite_interactions[k].weight for k in activated_keys],
                'post_valence': [self._composite_interactions[k].post_interaction.get_valence() for k in activated_keys],
                'post_action': [self._composite_interactions[k].post_interaction.get_action() for k in activated_keys],
                'post_interaction': [self._composite_interactions[k].post_interaction.pre_key() for k in activated_keys]
                }
        activated_df = pd.DataFrame(data)

        # Create the selection dataframe from the primitive and the activated dataframes
        df = pd.concat([self.primitive_df, activated_df], ignore_index=True)

        # Compute the proclivity for each action
        df['proclivity'] = df['weight'] * df['post_valence']
        grouped_df = df.groupby('post_action').agg({'proclivity': 'sum'}).reset_index()
        df = df.merge(grouped_df, on='post_action', suffixes=('', '_sum'))

        # Find the most probable outcome for each action
        max_weight_df = df.loc[df.groupby('post_action')['weight'].idxmax(), ['post_action', 'post_interaction']].reset_index(drop=True)
        max_weight_df.columns = ['post_action', 'intended']
        df = df.merge(max_weight_df, on='post_action')

        # Find the first row that has the highest proclivity
        max_index = df['proclivity_sum'].idxmax()
        intended_interaction_key = df.loc[max_index, ['intended']].values[0]
        self._intended_interaction = self._interactions[intended_interaction_key]
        print("Intended", self._intended_interaction)

        # Store the selection dataframe for printing
        self.selection_df = df.copy()
        
        return self._intended_interaction.get_action()

    def learn(self):
        """Recording previous composite interaction"""
        if self._previous_interaction is not None:
            composite_interaction = CompositeInteraction(self._previous_interaction, self._last_interaction)
            if composite_interaction.key() not in self._composite_interactions:
                self._composite_interactions[composite_interaction.key()] = composite_interaction
                print(f"Learning {composite_interaction}")
                self._last_composite_interaction = composite_interaction
            else:
                self._composite_interactions[composite_interaction.key()].reinforce()
                print(f"Reinforcing {self._composite_interactions[composite_interaction.key()]}")
                self._last_composite_interaction = self._composite_interactions[composite_interaction.key()]

# PRELIMINARY EXERCISE

Let's test this agent in Environment4

In [35]:
class Environment4:
    """ Environment4 """
    def __init__(self):
        """ Initializing Environment4 """
        self.step = 0

    def outcome(self, _action):
        """Take the action and generate the next outcome """
        self.step += 1
        # Behave like environment1 during the first 10 steps
        if self.step < 10:
            if _action == 0:
                return 0
            else:
                return 1            
        # Behave like Environment2 after the first 10 steps
        else: 
            if _action == 0:
                return 1
            else:
                return 0          

Initialize the simulation

In [36]:
# Instanciate a new agent
interactions = [
    Interaction(0,0,-1),
    Interaction(0,1,1),
    Interaction(1,0,-1),
    Interaction(1,1,1),
    Interaction(2,0,-1),
    Interaction(2,1,1)
]
a = Agent(interactions)
e = Environment4()

# Run the interaction loop
step = 0
outcome = 0

Run the simulation step by step to see the Selection DataFrame. Use `Ctrl+Enter` to run the cell bellow and stay on it.

In [37]:
print(f"Step {step}")
step += 1
action = a.action(outcome)
outcome = e.outcome(action)
a.selection_df[['composite', 'weight', 'post_valence', 'post_action', 'proclivity', 'proclivity_sum', 'intended']]

Step 0
Action: 0, Prediction: 0, Outcome: 0, Prediction_correct: True, Valence: -1
Intended 00:-1


Unnamed: 0,composite,weight,post_valence,post_action,proclivity,proclivity_sum,intended
0,,0.0,,0.0,,0.0,0
1,,0.0,,1.0,,0.0,10
2,,0.0,,2.0,,0.0,20


Observe the Selection DataFrame above as you run the agent step by step. 
Each activated composite interaction proposes its post_action with proclity equals to the composite interaction's weight multiplied by the post interactions' valence. 

The proclivities are summed for each action according to the formula given in Agent5: 

$\displaystyle proclivity_a = \sum_{c \in A_a} w_c \cdot v_{i}$

in which $A_a$ is the set of activated composite interactions $c$ that propose action $a$, $w_c$ is the weight of $c$, $v_i$ is the valence of the interaction proposed by $c$.

The action that has the highest `proclivity_sum` $proclivity_a$ is selected.

* During the first 10 steps, the composite interaction (11,11) is progressively reinforced as the agent learns that, in the context of intreaction 11, it can again enact interation 11.
* After Step 10, the agent learns the composite interaction (01,01), which tells that, in the context of interaction, 01, the agent can again enact 01.

## Remark on the relation between proclivity and expected reward 

Notably,  $proclivity_a$ relates to the "expected reward" traditionally used in reward optimisation under uncertainty. 
The expected reward balances the reward of each sequence with the probability of successfully enacting the sequence.
In the reinforcement learning literature, the expected reward for action $a$ is sometimes noted $\mathbb{E}(R_a)$ and its estimation $\hat{\mathbb{E}}(R_a)$ .

In our case, however, the agent ignores the probability of successfully enacting the post interaction $i$. 
The agent can only compute an estimated expected valence $\hat{v}_a$ based on an estimated probability $\hat{p}_{ai}$ of enacting the post interaction $i$ when selecting action $a$ :

$\displaystyle \hat{v}_a = \sum_{i \in I} \hat{p}_{ia} \cdot v_{i}$

in which $I$ is the set of primitive interactions that can result from action $a$.

If there is at least one activated composite interaction proposing $a$ then, for any interaction $i$ whose action is $a$, the estimated probability $\hat{p}_{ia}$ can be computed as the ratio of the weight of composite interactions proposing $i$ over the total weight of composite interactions proposing $a$ :

$\displaystyle \hat{p}_{ia} = \frac{\sum_{c \in A_{ia}} w_c }{\sum_{c \in A_a} w_c }$ 

in which $A_a$ is the set of activated composite interactions proposing $a$, and $A_{ia} \subset A_a$ is the set of activated composite interactions proposing interaction $i$ whose action is $a$. 

The proclivy $proclivity_a$ for selecting action $a$ is thus equal to the estimated expected valence resulting from $a$ multiplied by the sum of the weight of the activated composite interactions proposing $a$ :

$\displaystyle proclivity_a = \hat{v}_a \cdot \sum_{c \in A_a} w_c $

Two observations follow: 
* The $\sum_{c \in A_a} w_c$ factor, called the "proposal weight" increases the chances of selecting actions that yield positive valence over time. The more an action has been selected, the more it will be selected in the future, leading the agent to learn habits of interactions that yield positive valences.
* It is well possible that the estimation $\hat{v}_a$ is inacurate because the estimated probabilities $\hat{p}_{ai}$ is biased by the fact that the agent preferably tries actions that already have a positive expected valence.

In future agents, we may want to compute the proposal weight differently, and improve the estimation of expected valences.

## Environment5

Let's define Environment5 in which the agent must perform the same action twice in a row in order to get an outcome of 1.

In [106]:
class Environment5:
    """ Environment5 """
    def __init__(self):
        """ Initializing Environment4 """
        self._previous_action = 0
        self._last_action = 0

    def outcome(self, _action):
        """Take the action and generate the next outcome """
        if _action == self._last_action and _action == self._previous_action:
            # If same action during the last 3 steps then outcome 0
            _outcome = 0
        else:
            # If different action then outcome 1
            _outcome = 1
        self._previous_action = self._last_action
        self._last_action = _action
        return _outcome

In [84]:
# Instanciate a new agent
interactions = [
    Interaction(0,0,-1),
    Interaction(0,1,1),
    Interaction(1,0,-1),
    Interaction(1,1,1),
]
a = Agent(interactions)
e = Environment5()

# Run the interaction loop
step = 0
outcome = 0

In [105]:
print(f"Step {step}")
step += 1
action = a.action(outcome)
outcome = e.outcome(action)
a.selection_df[['composite', 'weight', 'post_valence', 'post_action', 'proclivity', 'proclivity_sum', 'intended']]

Step 20
Action: 0, Prediction: 1, Outcome: 1, Prediction_correct: True, Valence: 1
Reinforcing (01:1, 01:1: 5)
Intended 01:1


Unnamed: 0,composite,weight,post_valence,post_action,proclivity,proclivity_sum,intended
0,,0,,0,,1.0,1
1,,0,,1,,0.0,10
2,"(01,01)",5,1.0,0,5.0,1.0,1
3,"(01,00)",4,-1.0,0,-4.0,1.0,1


Observe that this agent cannot get a positive valence each time. 
To do so, he would need to select an action base on the last two previous interactions. 
We are going to design Agent6 that can do this.

# ASSIGNMENT

Implement Agent6 that learns a higher level of composite interactions as shown in Figure 1.

![Agent5](img/Figure_1_Agent6.svg)

Figure 1: Agent6 records and reinforces two levels of composite interactions:
* First-level composite interaction $c_{t-1} = (i_{t-2}, i_{t-1}: weight)$, 
* Second-level composite interaction $(c_{t-2}, i_{t-1}: weight)$.

The last enacted primitive interaction $i_{t-1}$ and the last enacted composite interaction $c_{t-1}$ activates previously-learned composite interactions that propose the action of their post interaction.

## Implement the learning mechanism of Agent6

In the Agent's learning method below, add the learning of the second-level composite interaction similar to the first level.

* Its pre_interaction will be `self._previous_composite_interaction`
* Its post_interaction will be `self._last_interaction`

In [134]:
class Agent6(Agent):
    def learn(self):
        # Recording previous composite interaction
        if self._previous_interaction is not None:
            # Record or reinforce the first level composite interaction
            composite_interaction = CompositeInteraction(self._previous_interaction, self._last_interaction)
            if composite_interaction.key() not in self._composite_interactions:
                self._composite_interactions[composite_interaction.key()] = composite_interaction
                print(f"Learning {composite_interaction}")
                self._last_composite_interaction = composite_interaction
            else:
                self._composite_interactions[composite_interaction.key()].reinforce()
                print(f"Reinforcing {self._composite_interactions[composite_interaction.key()]}")
                self._last_composite_interaction = self._composite_interactions[composite_interaction.key()]
            # *** Implement the learning mechanism of the second level composite interaction here ****


## Test your Agent6 in Environment5

It is ok if your Agent6 requires several tens of interaction to learn, but we expect it to eventually obtain a positive valence on every cicle. 

In [135]:
# Instanciate a new agent
interactions = [
    Interaction(0,0,-1),
    Interaction(0,1,1),
    Interaction(1,0,-1),
    Interaction(1,1,1),
]
a = Agent6(interactions)
e = Environment5()

# Run the interaction loop
step = 0
outcome = 0

In [146]:
print(f"Step {step}")
step += 1
action = a.action(outcome)
outcome = e.outcome(action)
a.selection_df[['composite', 'weight', 'post_valence', 'post_action', 'proclivity', 'proclivity_sum', 'intended']]

Step 10
Action: 1, Prediction: 1, Outcome: 1, Prediction_correct: True, Valence: 1
Reinforcing (00:-1, 11:1: 3)
Intended 01:1


Unnamed: 0,composite,weight,post_valence,post_action,proclivity,proclivity_sum,intended
0,,0,,0,,2.0,1
1,,0,,1,,0.0,10
2,"(11,01)",2,1.0,0,2.0,2.0,1


You should observe that activated composite interactions in column `composite` may have a pre_interaction that is itself a composite interaction. 
This allows Agent6 to select behavior that is adapted to the previous two interaction, and thus learn to get a positive valence every time in Environment5
