Skip to content

Harsh-Raj-Jordan/Optimization-BNet-MADRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ₯ BVSOP-MADRL β€” Blockchain Validator Selection & Optimization via Multi-Agent Deep Reinforcement Learning


πŸ“‘ Table of Contents

  1. What Problem Are We Solving?
  2. How We Solve It (The Big Picture)
  3. Repository Structure
  4. Environment Setup
  5. Running in Jupyter Notebook
  6. Running in VS Code
  7. Configuration Reference
  8. Code Architecture
  9. Training the Agents
  10. Testing & Evaluation
  11. Saved Models
  12. Troubleshooting
  13. Research Background
  14. Glossary

πŸ” What Problem Are We Solving?

The Real-World Scenario

Imagine multiple hospitals, pharmacies, insurance companies, and the Ministry of Public Health all need to share medical data securely and quickly over a shared blockchain network. Each participant (called an Intelligent Participant, IP) has:

  • Medical transactions waiting in a queue (patient records, test results, drug prescriptions)
  • Each transaction has a urgency level (how fast it needs to be processed), a security level (how many validators are needed), and a queuing time (how long it has been waiting)
  • A shared pool of blockchain validators (nodes with different computational powers) to verify data blocks

The Challenge: Three Conflicting Objectives

Every time a participant wants to send a data block, they must decide three things simultaneously:

Decision Effect
How many transactions to pack into one block More β†’ higher latency, lower cost per transaction
How many validators to involve More β†’ more secure, but slower and more expensive
Whether to compress the data Compression β†’ faster, but only appropriate for urgent (non-sensitive) data

Minimizing latency, maximizing security, and minimizing cost are fundamentally conflicting. You cannot optimize all three perfectly at the same time.

Why Not Just Use Maths?

The classical optimization approach requires solving a non-convex Multi-Objective Optimization Problem (MOOP) at every single time step. With 500 possible transaction counts Γ— 40 possible validator counts Γ— 3 compression choices = 60,000 combinations per agent per step. With multiple agents and real-time requirements in healthcare, this is computationally infeasible in practice.

Our Solution

We use Multi-Agent Deep Reinforcement Learning (MARL) β€” specifically the MAD3QN algorithm (Multi-Agent Dueling Double Deep Q-Network) β€” to train agents that learn the optimal policy offline and can make near-instant decisions at inference time.


🧠 How We Solve It (The Big Picture)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        IP-HealthChain System                        β”‚
β”‚                                                                     β”‚
β”‚  Hospital 1 (Agent 1)        Hospital 2 (Agent 2)                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚  Observation:   β”‚         β”‚  Observation:   β”‚                    β”‚
β”‚  β”‚  - Queue state  β”‚         β”‚  - Queue state  β”‚                    β”‚
β”‚  β”‚  - Urgency lvls β”‚         β”‚  - Urgency lvls β”‚                    β”‚
β”‚  β”‚  - Security lvl β”‚         β”‚  - Security lvl β”‚                    β”‚
β”‚  β”‚  - Queuing time β”‚         β”‚  - Queuing time β”‚                    β”‚
β”‚  β”‚  - Validator x_iβ”‚         β”‚  - Validator x_iβ”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚           β”‚ action (d, tr, v)         β”‚ action (d, tr, v)           β”‚
β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                          β–Ό                                          β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚              β”‚   Blockchain Network  β”‚                              β”‚
β”‚              β”‚  (Shared Validators)  β”‚                              β”‚
β”‚              β”‚  x₁, xβ‚‚, ..., x_M     β”‚                              β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚                         β”‚                                           β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                              β”‚
β”‚              β”‚   Bipartite Reward    β”‚                              β”‚
β”‚              β”‚  r = δ·r_g + (1-Ξ΄)Β·r_lβ”‚                              β”‚
β”‚              β”‚  r_g: global (shared) β”‚                              β”‚
β”‚              β”‚  r_l: local (per IP)  β”‚                              β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Reward Signal (How the Agent Learns)

The reward has two parts:

Global Reward (r_g) β€” Shared equally by ALL agents. Encourages efficient use of the shared validator pool. If agents waste resources or send when none are available, they are penalized.

Local Reward (r_l) β€” Unique to each agent. Reflects how well the agent optimized its own latency, security, and cost for the transactions it just sent.

r = Ξ΄ Γ— r_g  +  (1 - Ξ΄) Γ— r_l

Where Ξ΄ (delta) controls the balance between cooperation (high Ξ΄) and competition (low Ξ΄).


πŸ“ Repository Structure

BVSOP-MADRL/
β”‚
β”œβ”€β”€ BVSOP-MADRL-1.ipynb          ← Main Jupyter Notebook (all code lives here)
β”‚
β”œβ”€β”€ hospital_1_mad3qn_M25.pth    ← Trained model: Agent 1, M=25 validators
β”œβ”€β”€ hospital_1_mad3qn_M40.pth    ← Trained model: Agent 1, M=40 validators
β”œβ”€β”€ hospital_2_mad3qn_M25.pth    ← Trained model: Agent 2, M=25 validators
β”œβ”€β”€ hospital_2_mad3qn_M40.pth    ← Trained model: Agent 2, M=40 validators
β”‚
β”œβ”€β”€ README.md                    ← This file
└── requirements.txt             ← All Python dependencies with pinned versions

Note on .pth files: PyTorch model weights. These are the result of completed training runs and allow you to skip training and go straight to evaluation.


βš™οΈ Environment Setup

Windows

Step 1 β€” Install Python 3.11

Download the installer from python.org. During installation:

  • βœ… Check "Add Python to PATH"
  • βœ… Check "Install for all users"

Verify in PowerShell:

python --version
# Expected: Python 3.11.x

Step 2 β€” Clone or download the repository

# If you have git installed:
git clone https://github.com/Harsh-Raj-Jordan/Optimization-BNet-MADRL.git
cd Optimization-BNet-MADRL

# OR simply download the ZIP from GitHub and extract it, then open PowerShell inside the folder.

Step 3 β€” Create a virtual environment

python -m venv venv
venv\Scripts\activate
# Your prompt should now show: (venv) PS C:\...\Optimization-BNet-MADRL>

Step 4 β€” Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Step 5 β€” (Optional) Install Jupyter

pip install jupyter notebook ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"

macOS (Apple Silicon & Intel)

Step 1 β€” Install Homebrew and Python

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install python@3.11

Verify:

python3.11 --version
# Expected: Python 3.11.x

Step 2 β€” Clone the repository

git clone https://github.com/Harsh-Raj-Jordan/Optimization-BNet-MADRL.git
cd Optimization-BNet-MADRL

Step 3 β€” Create a virtual environment

python3.11 -m venv venv
source venv/bin/activate
# Your prompt should now show: (venv) user@machine Optimization-BNet-MADRL %

Step 4 β€” Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Apple Silicon (M1/M2/M3) Note: PyTorch supports Apple's Metal Performance Shaders (MPS) backend. The code uses CUDA detection, but MPS will not auto-activate. If you want GPU acceleration on Apple Silicon, change the device line in the notebook:

device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

Step 5 β€” (Optional) Install Jupyter

pip install jupyter notebook ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"

Linux (Ubuntu/Debian)

Step 1 β€” Install Python and build tools

sudo apt update && sudo apt upgrade -y
sudo apt install python3.11 python3.11-venv python3.11-dev build-essential git -y

Verify:

python3.11 --version

Step 2 β€” Clone the repository

git clone https://github.com/Harsh-Raj-Jordan/Optimization-BNet-MADRL.git
cd Optimization-BNet-MADRL

Step 3 β€” Create a virtual environment

python3.11 -m venv venv
source venv/bin/activate

Step 4 β€” Install dependencies

pip install --upgrade pip
pip install -r requirements.txt

Step 5 β€” (Optional) CUDA Setup for NVIDIA GPU

First verify your CUDA version:

nvidia-smi
# Note the "CUDA Version" shown in the top-right corner

If CUDA 12.x is available, PyTorch from requirements.txt should work. If you need a specific CUDA build, visit pytorch.org/get-started and replace the torch line accordingly.

Step 6 β€” (Optional) Install Jupyter

pip install jupyter notebook ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"

πŸ““ Running in Jupyter Notebook

Step 1 β€” Activate your virtual environment (if not already active)

# Windows:
venv\Scripts\activate

# macOS/Linux:
source venv/bin/activate

Step 2 β€” Launch Jupyter

jupyter notebook

This opens a browser tab at http://localhost:8888.

Step 3 β€” Open the notebook

Click on BVSOP-MADRL-1.ipynb in the browser file tree.

Step 4 β€” Select the correct kernel

In the notebook menu: Kernel β†’ Change kernel β†’ BVSOP MADRL

If you skipped the ipykernel installation step, select any available Python 3 kernel that uses the venv.

Step 5 β€” Run cells in order

The notebook is divided into the following sections. Run them top to bottom:

Section What it Does
Cell 1: Configuration & Imports Sets all hyperparameters, imports libraries, defines the XI_VECTOR of validator compute powers
Cell 2: Environment (IPHealthChainEnv) Defines the simulation world β€” queues, validators, state transitions, reward computation
Cell 3: Neural Network (BranchingDuelingDQN) Defines the Dueling DQN architecture with separate heads for each action sub-space
Cell 4: Agents Defines MAD3QN, RS (Random), SB (Static), and ES (Exhaustive Search) agents
Cell 5: Training Loop Runs the main training, saves .pth model files, records history
Cell 6: Evaluation Loads a saved model and runs inference on a fixed environment
Cell 7: Plotting Generates all result visualizations

Tip: Use Kernel β†’ Restart & Run All for a clean full run. Expected training time: ~15–30 minutes on CPU for 3500 episodes.


πŸ–₯️ Running in VS Code

Step 1 β€” Install VS Code

Download from code.visualstudio.com.

Step 2 β€” Install required extensions

Open VS Code, press Ctrl+Shift+X (or Cmd+Shift+X on Mac), and install:

  • Python (by Microsoft)
  • Jupyter (by Microsoft)

Step 3 β€” Open the project folder

File β†’ Open Folder β†’ select the BVSOP-MADRL folder

Step 4 β€” Select the Python interpreter

Press Ctrl+Shift+P β†’ type Python: Select Interpreter β†’ choose the one that shows venv in its path, e.g.:

  • Windows: .\venv\Scripts\python.exe
  • macOS/Linux: ./venv/bin/python

Step 5 β€” Open and run the notebook

Click on BVSOP-MADRL-1.ipynb in the Explorer panel. VS Code will render it as a notebook. Click "Select Kernel" in the top right β†’ choose venv (Python 3.11).

Run cells individually with the β–Ά button, or run all with Run All in the toolbar.

Tip: VS Code shows a GPU indicator in the bottom status bar if CUDA is detected. Look for Python 3.11.x 64-bit ('venv') confirmation that you are using the right environment.


βš™οΈ Configuration Reference

All parameters are defined in the CONFIG dictionary at the top of the notebook. Here is a complete explanation:

Blockchain & Physics Parameters

Parameter Default Meaning
w 0.5e6 Hz Channel bandwidth
SNR_d_db 10 dB Downlink signal-to-noise ratio
SNR_u_db 12 dB Uplink signal-to-noise ratio
r_d 1.2e6 bps Downlink transmission rate (BM β†’ Validators)
r_u 1.3e6 bps Uplink transmission rate (Validators β†’ BM)
xi_hat 0.5e6 bits Verification feedback size (ΞΎΜ‚)
zeta 500 bytes Average transaction size (ΞΆ)
q 4.0 Network scale indicator for security function
kappa 1.0 Security coefficient (ΞΊ)
psi 0.001 Consensus latency coefficient (ψ)
G_required 100.0 Computation required for block verification (K)

System Scale Parameters

Parameter Default Meaning
T_r / X_max 500 Maximum transactions per block
M 40 Total number of available validators
N_agents 2 Number of intelligent participants (hospitals)

Switching Settings: To reproduce the paper's "Setting 1" (M=25), change "M": 40 β†’ "M": 25 and uncomment the XI_VECTOR for M=25. The code will automatically use the correct pre-trained model file (_M25.pth).

Reward Weighting Parameters

Parameter Default Meaning
alpha 0.33 Weight for latency objective
beta 0.33 Weight for security objective
gamma 0.34 Weight for cost objective
delta 0.2 Global vs local reward balance (0=fully competitive, 1=fully cooperative)

Application-Level Thresholds

Parameter Default Meaning
u_th 0.5 Urgency threshold for compression eligibility
a_th 5 Maximum allowed queuing time (zombie threshold)
l_p 0.5 Latency penalty coefficient
s_p 0.5 Security penalty coefficient
d_p 0.5 Wrong compression decision penalty
U_p 0.5 Insufficient resource utilization penalty
epsilon_error 0.2 Accepted security error margin (Ξ΅)

RL Hyperparameters

Parameter Default Meaning
LR 3e-4 Learning rate for RMSprop optimizer
GAMMA 0.99 Discount factor (Ξ») β€” how much to value future rewards
BATCH_SIZE 128 Number of experiences sampled per learning step
MEMORY_SIZE 50000 Replay buffer capacity
EPSILON_START 1.0 Initial exploration rate (100% random actions)
EPSILON_END 0.01 Final exploration rate (1% random actions)
EPSILON_DECAY 59500 Number of steps to linearly decay epsilon
TAU 0.001 Soft update rate for target network
EPISODES 3500 Total training episodes
STEPS_PER_EP 20 Environment steps per episode
HIDDEN_SIZE 512 Neurons per hidden layer in the neural network

πŸ—οΈ Code Architecture

IPHealthChainEnv β€” The Simulation Environment

This class simulates the blockchain network world. Think of it as a game environment.

IPHealthChainEnv
β”‚
β”œβ”€β”€ __init__()         β€” Sets up validators, computes normalization bounds
β”œβ”€β”€ _init_resources()  β€” Assigns random or fixed computational power (x_i) to validators
β”œβ”€β”€ reset()            β€” Starts a fresh episode with a new random queue state
β”œβ”€β”€ _get_obs()         β€” Builds the observation vector each agent sees
β”œβ”€β”€ compute_security() β€” S(v) = ΞΊ Β· v^q
β”œβ”€β”€ compute_latency()  β€” L(tr, v, ΞΆ) = download + verify + consensus + upload
β”œβ”€β”€ compute_cost()     β€” C(tr, v) = Ξ£x_i / tr
└── step(actions)      β€” Takes actions from all agents, computes rewards, advances state

What is the observation vector?

Each agent sees a flattened vector containing:

  1. Its transaction queue β€” up to X_max transactions, each with (urgency, security, queuing_time) β€” shape (X_max Γ— 3,)
  2. The shared validator resources β€” each validator's compute power normalized β€” shape (M Γ— 2,)

Total observation size β‰ˆ 500 Γ— 3 + 40 Γ— 2 = 1580 floats per agent.

What is the zombie prevention algorithm?

If a transaction has been waiting longer than a_th steps, it becomes a "zombie" β€” it risks never being processed because higher-urgency transactions keep jumping the queue. The environment forces a send action when zombies are detected, overriding the agent's idle decision. This prevents starvation of low-urgency medical data.


BranchingDuelingDQN β€” The Neural Network

Input: observation vector (1580-dim)
    ↓
[Conv1D 1β†’16, k=3]  [ReLU]
    ↓
[Conv1D 16β†’32, k=3] [ReLU]
    ↓
[Flatten β†’ conv_out_size]
    ↓ ─────────────────────────────────────────────────────
    β”‚                    β”‚                    β”‚            β”‚
[Value Stream V]  [Advantage A_d]  [Advantage A_tr]  [Advantage A_v]
[Linear→512→ReLU] [Linear→512→ReLU][Linear→512→ReLU] [Linear→512→ReLU]
[Linear→1]        [Linear→3]       [Linear→500]      [Linear→40]
    β”‚                    β”‚                    β”‚            β”‚
    └──────── Q_d = V + (A_d - mean(A_d)) β”€β”€β”€β”˜
              Q_tr = V + (A_tr - mean(A_tr))
              Q_v = V + (A_v - mean(A_v))

Why three output heads? The action space is multi-discrete: d ∈ {0,1,2}, tr ∈ {1..500}, v ∈ {1..40}. A single head over the joint action space would require 3Γ—500Γ—40 = 60,000 outputs, which is intractable. The branching architecture handles each sub-action independently.

Why Dueling? The dueling architecture separates the value of a state (V) from the advantage of a specific action (A). This makes learning more stable because the network can update the state-value estimate even when it doesn't take an action, allowing faster convergence.

Why Double DQN? Standard DQN overestimates Q-values due to using the same network for both action selection and evaluation. Double DQN uses the online network to select the best action and the target network to evaluate it, eliminating this bias.


The Four Competing Agents

Agent Strategy Purpose
MAD3QNAgent Learns optimal policy via experience replay + neural network Our proposed method
RSAgent Picks actions uniformly at random Lower-bound baseline; represents no intelligence
SBAgent Always picks the same fixed action Baseline for zero-adaptation behavior
ESAgent Exhaustively evaluates all possible actions at each step Near-optimal greedy baseline (infeasible at scale)

πŸš€ Training the Agents

Quick Start

Open the notebook and run all cells in order. Training begins at the cell labeled "Main Training Loop".

What Happens During Training

  1. The environment resets β€” fresh random transaction queues and validator resources
  2. Each agent observes its local state (partial observability)
  3. With probability Ξ΅ (epsilon), the agent picks a random action (exploration); otherwise it picks the greedy action from its Q-network (exploitation)
  4. The environment executes all actions simultaneously, computes bipartite rewards, and transitions to the next state
  5. Each experience tuple (obs, action, reward, next_obs, done) is stored in the replay buffer
  6. Every step, a random mini-batch of 128 experiences is sampled to update the Q-network (this breaks temporal correlations)
  7. The target network is soft-updated toward the online network every step: ΞΈ_target ← τ·θ_online + (1-Ο„)Β·ΞΈ_target
  8. Epsilon decays linearly over 59,500 steps until it reaches 0.01

Progress Monitoring

Every 100 episodes, the console prints:

Episode 100/3500 | H1 Reward: 45.2341 | H2 Reward: 44.8901 | Resources: 67.3%

Healthy training shows gradually increasing rewards and increasing resource utilization over time. Expect convergence around episode 2000–3000.

Saved Outputs

After training completes, two model files are saved:

hospital_1_mad3qn_M40.pth   ← Agent 1's learned weights
hospital_2_mad3qn_M40.pth   ← Agent 2's learned weights

πŸ§ͺ Testing & Evaluation

Loading a Pre-Trained Model

The evaluation cell creates a fixed-resource environment using the XI_VECTOR from the paper (so results are deterministic and reproducible) and loads saved weights:

eval_env = IPHealthChainEnv(CONFIG, fixed_resources=XI_VECTOR)
eval_agents[0].online_net.load_state_dict(
    torch.load("hospital_1_mad3qn_M40.pth", map_location=device)
)
eval_agents[0].epsilon = 0.0   # ← Disables all exploration; purely greedy

What the Output Means

----------------------------------------------------------------------------------
Selected: [0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 ... ]
m: x
n: y
Utility U: 0.abcd...
----------------------------------------------------------------------------------
Field Meaning
Selected Binary mask of length M β€” a 1 means that validator was chosen
m Number of validators selected for this block
n Number of transactions packed into this block
Utility U The final bipartite reward value (0–1 range, higher is better)

πŸ’Ύ Saved Models

Four pre-trained model files are included:

File Setting Agents Episodes Trained
hospital_1_mad3qn_M25.pth M=25 validators, Setting 1 Agent 1 3500
hospital_2_mad3qn_M25.pth M=25 validators, Setting 1 Agent 2 3500
hospital_1_mad3qn_M40.pth M=40 validators, Setting 2 Agent 1 3500
hospital_2_mad3qn_M40.pth M=40 validators, Setting 2 Agent 2 3500

To load a specific model, update the path in the evaluation cell:

# For M=25 Setting 1:
eval_agents[0].online_net.load_state_dict(
    torch.load("hospital_1_mad3qn_M25.pth", map_location=device)
)

# For M=40 Setting 2 (default):
eval_agents[0].online_net.load_state_dict(
    torch.load("hospital_1_mad3qn_M40.pth", map_location=device)
)

Remember: When switching between M=25 and M=40, also update CONFIG["M"] and the XI_VECTOR accordingly before running any cell.


πŸ”§ Troubleshooting

ModuleNotFoundError: No module named 'torch'

Your virtual environment is not activated. Run source venv/bin/activate (macOS/Linux) or venv\Scripts\activate (Windows) and retry.

FileNotFoundError: hospital_1_mad3qn_M40.pth

You are trying to load a model that hasn't been trained yet. Either run the training cell first, or ensure the .pth files from the repository are in the same directory as the notebook.

CUDA errors like RuntimeError: CUDA out of memory

Your GPU doesn't have enough VRAM. Add this line to force CPU usage:

device = torch.device("cpu")

Notebook kernel crashes immediately

This usually means insufficient RAM. Try reducing MEMORY_SIZE from 50000 to 20000 and HIDDEN_SIZE from 512 to 128.

Training reward stuck at ~10 and not improving

Epsilon may be decaying too fast, or learning rate is too high. Try:

CONFIG["EPSILON_DECAY"] = 100000  # Slower decay
CONFIG["LR"] = 1e-4               # Lower learning rate

ValueError: operands could not be broadcast together

This happens if CONFIG["M"] and the length of XI_VECTOR don't match. Make sure XI_VECTOR is sliced to [:CONFIG["M"]] β€” the code already does this, but verify no manual edits broke it.

Jupyter not finding the venv kernel

Re-register the kernel:

source venv/bin/activate   # or venv\Scripts\activate on Windows
pip install ipykernel
python -m ipykernel install --user --name=bvsop-env --display-name "BVSOP MADRL"
jupyter notebook

πŸ“š Research Background

If you are new to this topic, here is a quick mental model for each technology used:

Blockchain β€” A distributed ledger where data is grouped into blocks and verified by multiple nodes (validators) before being added. No single party controls it. Think of it as a shared Google Doc that no one can secretly edit β€” every change requires approval from multiple trusted witnesses.

Reinforcement Learning (RL) β€” A training paradigm where an agent learns by trial and error. It takes actions in an environment, receives a reward signal (good/bad), and gradually learns which actions lead to higher long-term rewards. Like training a dog with treats, but the dog is a neural network and the tricks are blockchain configuration decisions.

Multi-Agent RL (MARL) β€” Multiple agents learning simultaneously in the same environment. They can cooperate (sharing a global reward) or compete (pursuing individual rewards). The challenge: from each agent's perspective, the environment appears non-stationary because other agents are also changing their behaviour.

Deep Q-Network (DQN) β€” Uses a neural network to approximate the Q-function: "given this state, what is the expected future reward of taking this action?" Instead of storing a table (which would be astronomically large), the network generalizes across similar states.

Dueling DQN β€” Splits the Q-network into two streams: one estimates how good the current state is overall (V), and one estimates how much better/worse each specific action is compared to average (A). This is more efficient because many states don't require distinguishing between actions.

Double DQN β€” Fixes the overestimation bias in standard DQN by using two separate networks: one chooses the action, the other evaluates it. This leads to more conservative, accurate Q-value estimates.

Dec-POMDP β€” Decentralized Partially Observable Markov Decision Process. Each agent only sees part of the global state (partial observability). Agents make decisions independently (decentralized) without communicating directly. The "Markov" part means the future depends only on the current state, not the full history.


πŸ“– Glossary

Term Definition
IP Intelligent Participant β€” a hospital, pharmacy, etc. acting as a blockchain node
Validator A computational node that verifies blockchain transactions
Block A batch of transactions packaged together for blockchain submission
MOOP Multi-Objective Optimization Problem
MARL Multi-Agent Reinforcement Learning
MAD3QN Multi-Agent Dueling Double Deep Q-Network β€” the proposed algorithm
D3QN Dueling Double Deep Q-Network (single-agent version)
Dec-POMDP Decentralized Partially Observable Markov Decision Process
Bipartite Reward The paper's novel two-part reward (global + local) that enables implicit cooperation
Zombie Transaction A transaction that has been waiting longer than the threshold a_th and risks never being processed
Exhaustive Search (ES) A brute-force policy that evaluates all possible actions and picks the best β€” optimal but slow
Random Selection (RS) A policy that picks actions uniformly at random
Static-Based (SB) A policy that always picks the same fixed action regardless of state
DPoS Delegated Proof of Stake β€” the consensus mechanism used; validators are selected by computational resources
CSLR Cost-Security-Latency-Resource β€” the combined optimization objective of the system
Ξ΅-greedy Exploration strategy: with probability Ξ΅ take a random action, otherwise take the best known action
Replay Buffer A memory bank storing past experiences (state, action, reward, next_state) for random sampling during training
Soft Update Slowly blending online network weights into the target network: ΞΈ_target ← τ·θ_online + (1-Ο„)Β·ΞΈ_target

Last updated: May 2026

About

Optimization of blockchain-networks using multi-agent deep reinforcement learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors