# Nash-Q Learning Educational Software
In this notebook you can interact with an educational software that simulates the Nash-Q learning algorithm in a multi-agent system.

More in details the software allows you to specify a **Stochastic game** as an enviroment for the agents to explore. The agents will then learn to play the game by interacting with it and updating their knowledge of the environment trying to reach a Nash equilibrium.

This notebook is devided in four main widgets:

1. **Presets** : In this widget you can select a preset environment to play with.

2. **Enviroment Editor** : In this widget you can edit the stochastic game enviroment. You can specify the number of agents, the number of games, the number of actions per player per game and the rewards and transition probability per action per game.

3. **Nash Q Learning Editor** : In this widget you can edit the hyperparameters of the agents and train them on the enviroment.
    
4. **Learning Display** : In this widget you can view the learning process of the agents and the Nash equilibrium reached.

> **❗Note❗**: The widgets are **interactive** and they are all **linked together**. This means that if you change a parameter in one widget the other widgets will update accordingly.

> Unfortunatly sometimes things can go wrong in unexpected ways, in that case you can always rerun the cell to reset the widgets or even restart the kernel.

## Imports
We developed a custom python package with all the necessary classes and functions to run the Nash-Q learning algorithm. The package is called `LearningNashQLearning` and it is fully available on PyP with an Open Source licence.

Your are highly encouraged to check the source code of the package. That is located in this same repository in the folder `LearningNashQLearning`.

Any form of contribution is encouraged and appreciated, feel free to open an issue or a pull request. 

In [1]:
# If you encounter issues in the installation of the requirements,
# you may need to substitute the pip command with pip3

!pip install LearningNashQLearning==0.10
!jupyter labextension enable widgetsnbextension



`sys_prefix` level settings are read-only, using `user` level for migration to `lockedExtensions`


In [2]:
from LearningNashQLearning.Model.Environment import Environment
from LearningNashQLearning.Model.NashQLearning import NashQLearning
from LearningNashQLearning.View.PresetGames import PresetGames
from LearningNashQLearning.View.GameEditor import EnvironmentWidget
from LearningNashQLearning.View.EnvGraphDisplay import EnvGraphDisplay
from LearningNashQLearning.View.FinalDisplay import FinalDisplay


import ipywidgets as widgets
from IPython.display import display

%matplotlib widget

# autoreload   
%load_ext autoreload
%autoreload 2

## Enviroment Editor ##
In this widget you can create your own environment, you can do it in the following interface.

First of all, you need to chose the number of **players** (that is the number of agents), limited to 4 for reasons of complexity of representation, and the number of **games** (states that can be reached by the agents).  

When defining the number of states, the user should keep in mind its definition as: ***S*** = *s<sub>1</sub> x s<sub>2</sub> x ... x s<sub>i</sub>*, with *i* = 1, ..., n as the number of agents and *s<sub>i</sub>* as the state space for the *i-th* agent. This means that the state space is defined as the Cartesian product of the individual state spaces for every agent. Therefore, every state represents one of the combinations of positions of the agents, and the transitions between states always involve all the agents together.  

After setting the number of **players** and **games**, it is possible to define a global number of possible **actions**, equal for every **players** in every **game**, or different number of **actions** for each **player** in each **game**.  

Then, in every **game** defined, for every **action profile** *A* in the set of action profiles ***A*** = *a<sub>1</sub> x a<sub>2</sub> x ... x a<sub>i</sub>*, with *a<sub>i</sub>* set of possible actions for the *i-th* **player**, it is possible to set the **probability** of the transition towards all the **games** in the environment, along with the associated **payoff**.  

The graph below the settings interface shows the current state space and the possible transitions between states, with the associated **probability** or **reward**, depending on the option chosen.

> **❗Note❗**: Don't worry if this sounds overwhealmig, you can always use the presets to get started.

In [3]:
env = Environment()
envGraph = EnvGraphDisplay(env, timeBuffer=0.5)
envWidget = EnvironmentWidget(env)
vBox = widgets.VBox([ envGraph.get_widget(), envWidget.getWidget()])
display(vBox)

VBox(children=(VBox(children=(Dropdown(description='Labels:', options=('Transition Probabilities', 'Payoffs'),…

## 2. Presets
Here you can load one of the presets that we have prepared for you. These are designed to show you the capabilities of the Nash-Q learning algorithm in different scenarios.

> Remember that in the next widget you can always view and edit the environment to your liking.

The presets are:

0. **Empty preset**: This only selects the minimum number of players (2) and 2 games without any transitions or payoffs. View this preset as a blank canvas for your first simple envs.

1. **Basic 2-2**: Defines an environment with 2 players and 2 games, with deterministic transitions and equal payoffs for each player on every transition.
In this setup the most desirable game to be in is  game 1, where it is best for them to stay because of the payoffs. This means that the only Nash-Q equilibrium is playing the action profile (0, 0) in both the games.

2. **Basic 3-2**: Defines an environment with 3 players and 2 games with the same properties of the last one, here the *Nash-Q equilibrium* consists of playing the **action profile** (0, 0, 0) in each game.

3. **Basic 4-2**: Once again same settings but this time with 4 players. The *Nash-Q equilibrium* is the action profile (0, 0, 0, 0)

4. **Stochastic "Prisoner's Dilemma"**: A more structured preset that tries to reinterpret the famous "Prisoner's Dilemma" in a stochastic game setting. More in details there are 2 players that can either be free (F) or imprisoned (I), this implies the existence of 4 games:

    **0 - FF** - In this game the players are both free. Unfortunatly their nature doesn't allow them to be free for long. Sooner or later they always end up committing a crime and years of friendship compels them to commit crimes together. The same feelings are inevitably tested though, because they inevitably end up caught.
    Each of them is then presented with a choice: they can either stay silent (S) or testify against the other (T). If both player confess they will get 2 years of prison each, if one confesses and the other stays silent the one that confessed will be free and the other will get 3 years of prison, if they both stay silent they will get 1 year of prison each.
    The payoffs are set to reflect these conditions.
    The actions SS and TT lead to the game 1 (II), while the actions TS and ST lead to the game 2 (FI) and 3 (IF) respectively.
    The payoffs are set in such a way that there is no difference in testifing or not if the other testifies, but remaining silent when the other testifies causes a pretty big negative reward. In this game the Nash-Q equilibrium is testifing for both players (C, C).
    
    **1 - II** - In this game both players are imprisoned but not all hopes are lost. Since they are both in prison they can work together in trying to escape through the sewers. Unfortunatly only one can fit and then the allarm will go off. They then decide to play a game of rock, paper, scissors to decide who will be released. The winner will be released and the other will have his sentence increased for tring to escape. If they tie they will both return to their cell but still get a smaller increase in their sentence.
    This means that they can play 3 actions: rock (R), paper (P), and scissors (S). The one that wins gets a positive reward because he gets released, while the one that loses gets a bigger negative reward because his sentence is increased; in the case of a tie, they both get a small negative reward, because they need to stay in prison with their sentence.  

    **2 FI** - In game 2 one of the criminal has testified, while the other has remained silent, so the first is free and the second is in prison. The free one has the choice between *(H)* helping the other criminal escape, or *(I)* ignoring him, while the prisoner can choose between *(E)* escaping or *(R)* remaining in prison. If the first tries to help and the other wants to escape *(H, E)* they both get a small positive reward for tasting freedom once again, while if the second decides to remain in prison *(H, R)*, he gets a small negative reward while the other gets a negative reward because he gets caught in the helping; if the first decides not to help and the second tries to escape *(I, E)*, the second gets a negative reward because he can't actually escape without help, and his sentence is increased, while the first gets a neutral reward; if the second remains in prison *(I, R)*, the second gets a small negative reward while the first still gets a neutral reward.  
    
    **3 IF** - Game 3 is the same as number 2, but with reversed roles.

5. **"Little(grid)world"**: a small version of a grid-world game where the grid is composed of four tiles in an upside-down T shape (3 horizontal tiles and one on top of the middle one). There are two agents on opposite sides of the board, their aim is to reach the other’s player position. The two agents can perform three actions each: they can either go up (a 0), go down (a 2) or move towards the opposite side of the board to their starting position (a 1).  When hitting a wall, the agent gets a unitary negative reward, when the two agents collide, they both get a double negative reward (-2). In the case of one agent reaching its final position, it gets a strong positive reward (+5), independently if a wall or another agent had been hit.
> **❗Note❗**: The graph representation of the game number 5 might be quite hard to comprehend and slow to render (the number of possible states for the system is 12 and the possible actions for each state are 9). For graphic representation refer to the report published on github (https://github.com/MultiagentSystemsProject-Polimi2024/LearningNashQLearning.git). 

In [4]:
preset = PresetGames(env)
display(preset.getWidget())

Dropdown(description='Preset:', options=(0, 1, 2, 3, 4, 5), value=0)

## 3. Training of the agents ##
In the following section, it will be possible to set the parameters for the trainig phase.  
The paramethers that can be personalized are the following:
- *Episodes*: number of training episodes
- *Gamma*: **discount factor**, that defines the reduction in the values of the rewards from actrions taken in the future
- *Epsilon*: **exploration-exploitation** paramether, defined as the probability with which the algorithm choses a random action (exploration) over the best possible action (exploitation)
- *Pure epsilon epidoses*: number of episodes after which *epsilon* starts decreasing, opting more and more for *exploitation* over *exploration*
- *Alfa*: **learning rate**, determines the weight of the updates on the values already known, as the speed at which the agent learns
- *Pure training episodes*: number of trainings after which *alpha* starts decreasing, reducing the weight of the new updates
- *Reset on goal state*: sets the environment in such a way that after reaching the goal state, the agents are brought back to the starting state, in order to learn again from the beginning
- *Start state*: defines from which state the agents will start
- *Goal state*: defines the goal state

After that, the training phase can be run by pressing the *Train* button.


In [5]:
nashQLearning = NashQLearning(env)
display(nashQLearning.getDisplayable())

GridBox(children=(IntText(value=1, description='Steps:'), FloatText(value=0.8, description='Gamma:'), FloatTex…

## 4. Results ##
The last section shows the results from the training.

First of all you can define a *window size*, as a percentage of the number of episodes, that will be used to smooth the rewards, in order to make them more readable in the graph. Lower values of *window size* allow you to better see the variations in the rewards, while higher values of this paramether make the graph more smooth, reducing the variations.

The reward graph displays the value of the reward received by every player, and the sum of their reward, for every training episode; by using the slider, by typing a value in, or by navigating using the arrows, it is possible to select a specific episode, and see its rewards signaled by the red vertical line. It is also possible to automatically advance in the episodes, by pressing the play button, whith the desired speed, as the number of episodes advanced for each second.

When a training episode is selected, all the information relative to that are displayed below:
- the state-space graph shows the states, highlighting the current state and the transition that has been taken in the state; depending on your selection in the dropdown menues, it can show the **Q-tables** or the **policy** that are associated to the states in the episode chosen, and relative to the selected agent. On the edge of the graph relative to the current action and transition, it is also possible to see the differences from the previous values in the *Q-Table*.
- below that, the interface shows the *current game*, the *current action profile*, meaning the actions taken by every agent in the state, the *current payoff*, as the reward received by every agent when taking the chosen action in the current state, and finally the *Current Policy* and the *Q-Tables*, for each player in each game, relative to the chosen episode. You should note that the *Q-Tables* will be shown only when the number of players is less than 3, because it wouldn't be possible to display a structure with more than 2 dimentions in the interface.

In [9]:
finalDisplay = FinalDisplay(nashQLearning, env)
display(finalDisplay.get_widget())

VBox(children=(HTML(value='<h1>Training Display</h1>'), HTML(value='<h2>Training History</h2>'), FloatSlider(v…

KeyError: (0, 4)