π₯ BVSOP-MADRL β Blockchain Validator Selection & Optimization via Multi-Agent Deep Reinforcement Learning
- What Problem Are We Solving?
- How We Solve It (The Big Picture)
- Repository Structure
- Environment Setup
- Running in Jupyter Notebook
- Running in VS Code
- Configuration Reference
- Code Architecture
- Training the Agents
- Testing & Evaluation
- Saved Models
- Troubleshooting
- Research Background
- Glossary
Imagine multiple hospitals, pharmacies, insurance companies, and the Ministry of Public Health all need to share medical data securely and quickly over a shared blockchain network. Each participant (called an Intelligent Participant, IP) has:
- Medical transactions waiting in a queue (patient records, test results, drug prescriptions)
- Each transaction has a urgency level (how fast it needs to be processed), a security level (how many validators are needed), and a queuing time (how long it has been waiting)
- A shared pool of blockchain validators (nodes with different computational powers) to verify data blocks
Every time a participant wants to send a data block, they must decide three things simultaneously:
| Decision | Effect |
|---|---|
| How many transactions to pack into one block | More β higher latency, lower cost per transaction |
| How many validators to involve | More β more secure, but slower and more expensive |
| Whether to compress the data | Compression β faster, but only appropriate for urgent (non-sensitive) data |
Minimizing latency, maximizing security, and minimizing cost are fundamentally conflicting. You cannot optimize all three perfectly at the same time.
The classical optimization approach requires solving a non-convex Multi-Objective Optimization Problem (MOOP) at every single time step. With 500 possible transaction counts Γ 40 possible validator counts Γ 3 compression choices = 60,000 combinations per agent per step. With multiple agents and real-time requirements in healthcare, this is computationally infeasible in practice.
We use Multi-Agent Deep Reinforcement Learning (MARL) β specifically the MAD3QN algorithm (Multi-Agent Dueling Double Deep Q-Network) β to train agents that learn the optimal policy offline and can make near-instant decisions at inference time.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β IP-HealthChain System β
β β
β Hospital 1 (Agent 1) Hospital 2 (Agent 2) β
β βββββββββββββββββββ βββββββββββββββββββ β
β β Observation: β β Observation: β β
β β - Queue state β β - Queue state β β
β β - Urgency lvls β β - Urgency lvls β β
β β - Security lvl β β - Security lvl β β
β β - Queuing time β β - Queuing time β β
β β - Validator x_iβ β - Validator x_iβ β
β ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ β
β β action (d, tr, v) β action (d, tr, v) β
β ββββββββββββββββ¬βββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββ β
β β Blockchain Network β β
β β (Shared Validators) β β
β β xβ, xβ, ..., x_M β β
β ββββββββββββ¬βββββββββββββ β
β β β
β ββββββββββββΌβββββββββββββ β
β β Bipartite Reward β β
β β r = δ·r_g + (1-Ξ΄)Β·r_lβ β
β β r_g: global (shared) β β
β β r_l: local (per IP) β β
β βββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The reward has two parts:
Global Reward (r_g) β Shared equally by ALL agents. Encourages efficient use of the shared validator pool. If agents waste resources or send when none are available, they are penalized.
Local Reward (r_l) β Unique to each agent. Reflects how well the agent optimized its own latency, security, and cost for the transactions it just sent.
r = Ξ΄ Γ r_g + (1 - Ξ΄) Γ r_l
Where Ξ΄ (delta) controls the balance between cooperation (high Ξ΄) and competition (low Ξ΄).
BVSOP-MADRL/
β
βββ BVSOP-MADRL-1.ipynb β Main Jupyter Notebook (all code lives here)
β
βββ hospital_1_mad3qn_M25.pth β Trained model: Agent 1, M=25 validators
βββ hospital_1_mad3qn_M40.pth β Trained model: Agent 1, M=40 validators
βββ hospital_2_mad3qn_M25.pth β Trained model: Agent 2, M=25 validators
βββ hospital_2_mad3qn_M40.pth β Trained model: Agent 2, M=40 validators
β
βββ README.md β This file
βββ requirements.txt β All Python dependencies with pinned versions
Note on
.pthfiles: PyTorch model weights. These are the result of completed training runs and allow you to skip training and go straight to evaluation.
Step 1 β Install Python 3.11
Download the installer from python.org. During installation:
- β Check "Add Python to PATH"
- β Check "Install for all users"
Verify in PowerShell:
python --version
# Expected: Python 3.11.xStep 2 β Clone or download the repository
# If you have git installed:
git clone https://github.com/Harsh-Raj-Jordan/Optimization-BNet-MADRL.git
cd Optimization-BNet-MADRL
# OR simply download the ZIP from GitHub and extract it, then open PowerShell inside the folder.Step 3 β Create a virtual environment
python -m venv venv
venv\Scripts\activate
# Your prompt should now show: (venv) PS C:\...\Optimization-BNet-MADRL>Step 4 β Install dependencies
pip install --upgrade pip
pip install -r requirements.txtStep 5 β (Optional) Install Jupyter
pip install jupyter notebook ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"Step 1 β Install Homebrew and Python
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python@3.11Verify:
python3.11 --version
# Expected: Python 3.11.xStep 2 β Clone the repository
git clone https://github.com/Harsh-Raj-Jordan/Optimization-BNet-MADRL.git
cd Optimization-BNet-MADRLStep 3 β Create a virtual environment
python3.11 -m venv venv
source venv/bin/activate
# Your prompt should now show: (venv) user@machine Optimization-BNet-MADRL %Step 4 β Install dependencies
pip install --upgrade pip
pip install -r requirements.txtApple Silicon (M1/M2/M3) Note: PyTorch supports Apple's Metal Performance Shaders (MPS) backend. The code uses CUDA detection, but MPS will not auto-activate. If you want GPU acceleration on Apple Silicon, change the device line in the notebook:
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
Step 5 β (Optional) Install Jupyter
pip install jupyter notebook ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"Step 1 β Install Python and build tools
sudo apt update && sudo apt upgrade -y
sudo apt install python3.11 python3.11-venv python3.11-dev build-essential git -yVerify:
python3.11 --versionStep 2 β Clone the repository
git clone https://github.com/Harsh-Raj-Jordan/Optimization-BNet-MADRL.git
cd Optimization-BNet-MADRLStep 3 β Create a virtual environment
python3.11 -m venv venv
source venv/bin/activateStep 4 β Install dependencies
pip install --upgrade pip
pip install -r requirements.txtStep 5 β (Optional) CUDA Setup for NVIDIA GPU
First verify your CUDA version:
nvidia-smi
# Note the "CUDA Version" shown in the top-right cornerIf CUDA 12.x is available, PyTorch from requirements.txt should work. If you need a specific CUDA build, visit pytorch.org/get-started and replace the torch line accordingly.
Step 6 β (Optional) Install Jupyter
pip install jupyter notebook ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"Step 1 β Activate your virtual environment (if not already active)
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activateStep 2 β Launch Jupyter
jupyter notebookThis opens a browser tab at http://localhost:8888.
Step 3 β Open the notebook
Click on BVSOP-MADRL-1.ipynb in the browser file tree.
Step 4 β Select the correct kernel
In the notebook menu: Kernel β Change kernel β BVSOP MADRL
If you skipped the ipykernel installation step, select any available Python 3 kernel that uses the venv.
Step 5 β Run cells in order
The notebook is divided into the following sections. Run them top to bottom:
| Section | What it Does |
|---|---|
| Cell 1: Configuration & Imports | Sets all hyperparameters, imports libraries, defines the XI_VECTOR of validator compute powers |
Cell 2: Environment (IPHealthChainEnv) |
Defines the simulation world β queues, validators, state transitions, reward computation |
Cell 3: Neural Network (BranchingDuelingDQN) |
Defines the Dueling DQN architecture with separate heads for each action sub-space |
| Cell 4: Agents | Defines MAD3QN, RS (Random), SB (Static), and ES (Exhaustive Search) agents |
| Cell 5: Training Loop | Runs the main training, saves .pth model files, records history |
| Cell 6: Evaluation | Loads a saved model and runs inference on a fixed environment |
| Cell 7: Plotting | Generates all result visualizations |
Tip: Use
Kernel β Restart & Run Allfor a clean full run. Expected training time: ~15β30 minutes on CPU for 3500 episodes.
Step 1 β Install VS Code
Download from code.visualstudio.com.
Step 2 β Install required extensions
Open VS Code, press Ctrl+Shift+X (or Cmd+Shift+X on Mac), and install:
- Python (by Microsoft)
- Jupyter (by Microsoft)
Step 3 β Open the project folder
File β Open Folder β select the BVSOP-MADRL folder
Step 4 β Select the Python interpreter
Press Ctrl+Shift+P β type Python: Select Interpreter β choose the one that shows venv in its path, e.g.:
- Windows:
.\venv\Scripts\python.exe - macOS/Linux:
./venv/bin/python
Step 5 β Open and run the notebook
Click on BVSOP-MADRL-1.ipynb in the Explorer panel. VS Code will render it as a notebook. Click "Select Kernel" in the top right β choose venv (Python 3.11).
Run cells individually with the βΆ button, or run all with Run All in the toolbar.
Tip: VS Code shows a GPU indicator in the bottom status bar if CUDA is detected. Look for
Python 3.11.x 64-bit ('venv')confirmation that you are using the right environment.
All parameters are defined in the CONFIG dictionary at the top of the notebook. Here is a complete explanation:
| Parameter | Default | Meaning |
|---|---|---|
w |
0.5e6 Hz |
Channel bandwidth |
SNR_d_db |
10 dB |
Downlink signal-to-noise ratio |
SNR_u_db |
12 dB |
Uplink signal-to-noise ratio |
r_d |
1.2e6 bps |
Downlink transmission rate (BM β Validators) |
r_u |
1.3e6 bps |
Uplink transmission rate (Validators β BM) |
xi_hat |
0.5e6 bits |
Verification feedback size (ΞΎΜ) |
zeta |
500 bytes |
Average transaction size (ΞΆ) |
q |
4.0 |
Network scale indicator for security function |
kappa |
1.0 |
Security coefficient (ΞΊ) |
psi |
0.001 |
Consensus latency coefficient (Ο) |
G_required |
100.0 |
Computation required for block verification (K) |
| Parameter | Default | Meaning |
|---|---|---|
T_r / X_max |
500 |
Maximum transactions per block |
M |
40 |
Total number of available validators |
N_agents |
2 |
Number of intelligent participants (hospitals) |
Switching Settings: To reproduce the paper's "Setting 1" (M=25), change
"M": 40β"M": 25and uncomment theXI_VECTORfor M=25. The code will automatically use the correct pre-trained model file (_M25.pth).
| Parameter | Default | Meaning |
|---|---|---|
alpha |
0.33 |
Weight for latency objective |
beta |
0.33 |
Weight for security objective |
gamma |
0.34 |
Weight for cost objective |
delta |
0.2 |
Global vs local reward balance (0=fully competitive, 1=fully cooperative) |
| Parameter | Default | Meaning |
|---|---|---|
u_th |
0.5 |
Urgency threshold for compression eligibility |
a_th |
5 |
Maximum allowed queuing time (zombie threshold) |
l_p |
0.5 |
Latency penalty coefficient |
s_p |
0.5 |
Security penalty coefficient |
d_p |
0.5 |
Wrong compression decision penalty |
U_p |
0.5 |
Insufficient resource utilization penalty |
epsilon_error |
0.2 |
Accepted security error margin (Ξ΅) |
| Parameter | Default | Meaning |
|---|---|---|
LR |
3e-4 |
Learning rate for RMSprop optimizer |
GAMMA |
0.99 |
Discount factor (Ξ») β how much to value future rewards |
BATCH_SIZE |
128 |
Number of experiences sampled per learning step |
MEMORY_SIZE |
50000 |
Replay buffer capacity |
EPSILON_START |
1.0 |
Initial exploration rate (100% random actions) |
EPSILON_END |
0.01 |
Final exploration rate (1% random actions) |
EPSILON_DECAY |
59500 |
Number of steps to linearly decay epsilon |
TAU |
0.001 |
Soft update rate for target network |
EPISODES |
3500 |
Total training episodes |
STEPS_PER_EP |
20 |
Environment steps per episode |
HIDDEN_SIZE |
512 |
Neurons per hidden layer in the neural network |
This class simulates the blockchain network world. Think of it as a game environment.
IPHealthChainEnv
β
βββ __init__() β Sets up validators, computes normalization bounds
βββ _init_resources() β Assigns random or fixed computational power (x_i) to validators
βββ reset() β Starts a fresh episode with a new random queue state
βββ _get_obs() β Builds the observation vector each agent sees
βββ compute_security() β S(v) = ΞΊ Β· v^q
βββ compute_latency() β L(tr, v, ΞΆ) = download + verify + consensus + upload
βββ compute_cost() β C(tr, v) = Ξ£x_i / tr
βββ step(actions) β Takes actions from all agents, computes rewards, advances state
What is the observation vector?
Each agent sees a flattened vector containing:
- Its transaction queue β up to
X_maxtransactions, each with(urgency, security, queuing_time)β shape(X_max Γ 3,) - The shared validator resources β each validator's compute power normalized β shape
(M Γ 2,)
Total observation size β 500 Γ 3 + 40 Γ 2 = 1580 floats per agent.
What is the zombie prevention algorithm?
If a transaction has been waiting longer than a_th steps, it becomes a "zombie" β it risks never being processed because higher-urgency transactions keep jumping the queue. The environment forces a send action when zombies are detected, overriding the agent's idle decision. This prevents starvation of low-urgency medical data.
Input: observation vector (1580-dim)
β
[Conv1D 1β16, k=3] [ReLU]
β
[Conv1D 16β32, k=3] [ReLU]
β
[Flatten β conv_out_size]
β βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β β β
[Value Stream V] [Advantage A_d] [Advantage A_tr] [Advantage A_v]
[Linearβ512βReLU] [Linearβ512βReLU][Linearβ512βReLU] [Linearβ512βReLU]
[Linearβ1] [Linearβ3] [Linearβ500] [Linearβ40]
β β β β
βββββββββ Q_d = V + (A_d - mean(A_d)) ββββ
Q_tr = V + (A_tr - mean(A_tr))
Q_v = V + (A_v - mean(A_v))
Why three output heads? The action space is multi-discrete: d β {0,1,2}, tr β {1..500}, v β {1..40}. A single head over the joint action space would require 3Γ500Γ40 = 60,000 outputs, which is intractable. The branching architecture handles each sub-action independently.
Why Dueling? The dueling architecture separates the value of a state (V) from the advantage of a specific action (A). This makes learning more stable because the network can update the state-value estimate even when it doesn't take an action, allowing faster convergence.
Why Double DQN? Standard DQN overestimates Q-values due to using the same network for both action selection and evaluation. Double DQN uses the online network to select the best action and the target network to evaluate it, eliminating this bias.
| Agent | Strategy | Purpose |
|---|---|---|
MAD3QNAgent |
Learns optimal policy via experience replay + neural network | Our proposed method |
RSAgent |
Picks actions uniformly at random | Lower-bound baseline; represents no intelligence |
SBAgent |
Always picks the same fixed action | Baseline for zero-adaptation behavior |
ESAgent |
Exhaustively evaluates all possible actions at each step | Near-optimal greedy baseline (infeasible at scale) |
Open the notebook and run all cells in order. Training begins at the cell labeled "Main Training Loop".
- The environment resets β fresh random transaction queues and validator resources
- Each agent observes its local state (partial observability)
- With probability Ξ΅ (epsilon), the agent picks a random action (exploration); otherwise it picks the greedy action from its Q-network (exploitation)
- The environment executes all actions simultaneously, computes bipartite rewards, and transitions to the next state
- Each experience tuple
(obs, action, reward, next_obs, done)is stored in the replay buffer - Every step, a random mini-batch of 128 experiences is sampled to update the Q-network (this breaks temporal correlations)
- The target network is soft-updated toward the online network every step:
ΞΈ_target β ΟΒ·ΞΈ_online + (1-Ο)Β·ΞΈ_target - Epsilon decays linearly over 59,500 steps until it reaches 0.01
Every 100 episodes, the console prints:
Episode 100/3500 | H1 Reward: 45.2341 | H2 Reward: 44.8901 | Resources: 67.3%
Healthy training shows gradually increasing rewards and increasing resource utilization over time. Expect convergence around episode 2000β3000.
After training completes, two model files are saved:
hospital_1_mad3qn_M40.pth β Agent 1's learned weights
hospital_2_mad3qn_M40.pth β Agent 2's learned weights
The evaluation cell creates a fixed-resource environment using the XI_VECTOR from the paper (so results are deterministic and reproducible) and loads saved weights:
eval_env = IPHealthChainEnv(CONFIG, fixed_resources=XI_VECTOR)
eval_agents[0].online_net.load_state_dict(
torch.load("hospital_1_mad3qn_M40.pth", map_location=device)
)
eval_agents[0].epsilon = 0.0 # β Disables all exploration; purely greedy----------------------------------------------------------------------------------
Selected: [0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 ... ]
m: x
n: y
Utility U: 0.abcd...
----------------------------------------------------------------------------------
| Field | Meaning |
|---|---|
Selected |
Binary mask of length M β a 1 means that validator was chosen |
m |
Number of validators selected for this block |
n |
Number of transactions packed into this block |
Utility U |
The final bipartite reward value (0β1 range, higher is better) |
Four pre-trained model files are included:
| File | Setting | Agents | Episodes Trained |
|---|---|---|---|
hospital_1_mad3qn_M25.pth |
M=25 validators, Setting 1 | Agent 1 | 3500 |
hospital_2_mad3qn_M25.pth |
M=25 validators, Setting 1 | Agent 2 | 3500 |
hospital_1_mad3qn_M40.pth |
M=40 validators, Setting 2 | Agent 1 | 3500 |
hospital_2_mad3qn_M40.pth |
M=40 validators, Setting 2 | Agent 2 | 3500 |
To load a specific model, update the path in the evaluation cell:
# For M=25 Setting 1:
eval_agents[0].online_net.load_state_dict(
torch.load("hospital_1_mad3qn_M25.pth", map_location=device)
)
# For M=40 Setting 2 (default):
eval_agents[0].online_net.load_state_dict(
torch.load("hospital_1_mad3qn_M40.pth", map_location=device)
)Remember: When switching between M=25 and M=40, also update CONFIG["M"] and the XI_VECTOR accordingly before running any cell.
Your virtual environment is not activated. Run source venv/bin/activate (macOS/Linux) or venv\Scripts\activate (Windows) and retry.
You are trying to load a model that hasn't been trained yet. Either run the training cell first, or ensure the .pth files from the repository are in the same directory as the notebook.
Your GPU doesn't have enough VRAM. Add this line to force CPU usage:
device = torch.device("cpu")This usually means insufficient RAM. Try reducing MEMORY_SIZE from 50000 to 20000 and HIDDEN_SIZE from 512 to 128.
Epsilon may be decaying too fast, or learning rate is too high. Try:
CONFIG["EPSILON_DECAY"] = 100000 # Slower decay
CONFIG["LR"] = 1e-4 # Lower learning rateThis happens if CONFIG["M"] and the length of XI_VECTOR don't match. Make sure XI_VECTOR is sliced to [:CONFIG["M"]] β the code already does this, but verify no manual edits broke it.
Re-register the kernel:
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"
jupyter notebookIf you are new to this topic, here is a quick mental model for each technology used:
Blockchain β A distributed ledger where data is grouped into blocks and verified by multiple nodes (validators) before being added. No single party controls it. Think of it as a shared Google Doc that no one can secretly edit β every change requires approval from multiple trusted witnesses.
Reinforcement Learning (RL) β A training paradigm where an agent learns by trial and error. It takes actions in an environment, receives a reward signal (good/bad), and gradually learns which actions lead to higher long-term rewards. Like training a dog with treats, but the dog is a neural network and the tricks are blockchain configuration decisions.
Multi-Agent RL (MARL) β Multiple agents learning simultaneously in the same environment. They can cooperate (sharing a global reward) or compete (pursuing individual rewards). The challenge: from each agent's perspective, the environment appears non-stationary because other agents are also changing their behaviour.
Deep Q-Network (DQN) β Uses a neural network to approximate the Q-function: "given this state, what is the expected future reward of taking this action?" Instead of storing a table (which would be astronomically large), the network generalizes across similar states.
Dueling DQN β Splits the Q-network into two streams: one estimates how good the current state is overall (V), and one estimates how much better/worse each specific action is compared to average (A). This is more efficient because many states don't require distinguishing between actions.
Double DQN β Fixes the overestimation bias in standard DQN by using two separate networks: one chooses the action, the other evaluates it. This leads to more conservative, accurate Q-value estimates.
Dec-POMDP β Decentralized Partially Observable Markov Decision Process. Each agent only sees part of the global state (partial observability). Agents make decisions independently (decentralized) without communicating directly. The "Markov" part means the future depends only on the current state, not the full history.
| Term | Definition |
|---|---|
| IP | Intelligent Participant β a hospital, pharmacy, etc. acting as a blockchain node |
| Validator | A computational node that verifies blockchain transactions |
| Block | A batch of transactions packaged together for blockchain submission |
| MOOP | Multi-Objective Optimization Problem |
| MARL | Multi-Agent Reinforcement Learning |
| MAD3QN | Multi-Agent Dueling Double Deep Q-Network β the proposed algorithm |
| D3QN | Dueling Double Deep Q-Network (single-agent version) |
| Dec-POMDP | Decentralized Partially Observable Markov Decision Process |
| Bipartite Reward | The paper's novel two-part reward (global + local) that enables implicit cooperation |
| Zombie Transaction | A transaction that has been waiting longer than the threshold a_th and risks never being processed |
| Exhaustive Search (ES) | A brute-force policy that evaluates all possible actions and picks the best β optimal but slow |
| Random Selection (RS) | A policy that picks actions uniformly at random |
| Static-Based (SB) | A policy that always picks the same fixed action regardless of state |
| DPoS | Delegated Proof of Stake β the consensus mechanism used; validators are selected by computational resources |
| CSLR | Cost-Security-Latency-Resource β the combined optimization objective of the system |
| Ξ΅-greedy | Exploration strategy: with probability Ξ΅ take a random action, otherwise take the best known action |
| Replay Buffer | A memory bank storing past experiences (state, action, reward, next_state) for random sampling during training |
| Soft Update | Slowly blending online network weights into the target network: ΞΈ_target β ΟΒ·ΞΈ_online + (1-Ο)Β·ΞΈ_target |
Last updated: May 2026