# Neuromorphic Computing meets Reinforcement Learning: A Hands-On Workshop

**Welcome!** This workshop will guide you through the fundamentals of neuromorphic computing (NC) and reinforcement learning (RL), culminating in a project that combines both fields. We'll focus on intuitive explanations, mathematical foundations, and practical coding examples.

**Target Audience:** Beginners with some Python programming experience. No prior knowledge of NC or RL is strictly required.

**Estimated Duration:** 6-8 hours (including breaks and exercises)

**Learning Objectives:**
* Understand the biological inspiration and core concepts of Neuromorphic Computing.
* Simulate basic Spiking Neuron Models (like Leaky Integrate-and-Fire).
* Build and simulate simple Spiking Neural Networks (SNNs).
* Grasp the fundamentals of Reinforcement Learning (Agents, Environments, Rewards, Policies).
* Implement a basic RL algorithm (Q-Learning).
* Explore how SNNs can be integrated with RL for potential benefits like energy efficiency.
* Implement a simple project combining SNNs and RL.

**Libraries We'll Use:**
*   **Brian2:** A powerful Python simulator for spiking neural networks.
*   **NumPy:** For numerical operations.
*   **Matplotlib:** For plotting and visualization.
*   **(Optional) Gym/Gymnasium:** For standard RL environments (though we might use custom simple ones).

## Part 0: Setup and Introduction (15 mins)

Let's ensure you have the necessary libraries installed and briefly set the stage.

### 0.1. Installation

If you haven't already, please install the required libraries. You can do this by running the following cell (uncomment the lines):

In [None]:
# !pip install brian2 numpy matplotlib notebook ipywidgets
# Optional for later RL parts, or if using standard environments:
# !pip install gymnasium

### 0.2. Importing Libraries
We'll import the necessary libraries as we need them, but let's start with the basics.

In [None]:
import brian2 as b2
import numpy as np
import matplotlib.pyplot as plt
import time  # For timing simulations
from random import sample # Needed for Exercise 2 option

# Configure Matplotlib for inline display in Jupyter
%matplotlib inline
%config InlineBackend.figure_format = 'retina' # For higher resolution plots

print("Libraries imported successfully!")
try:
    print(f"Brian2 version: {b2.__version__}")
    print(f"NumPy version: {np.__version__}")
except NameError:
    print("One or more libraries might not be installed correctly.")

### 0.3. Workshop Overview

*   **Module 1: Neuromorphic Computing Fundamentals:** Biological inspiration, SNNs, LIF neuron model. (≈ 1.5 hours)
*   **Module 2: Building Simple Spiking Networks:** Connecting neurons, encoding information, simulation. (≈ 1.5 hours)
*   **Module 3: Reinforcement Learning Fundamentals:** The RL loop, Q-Learning. (≈ 1.5 hours)
*   **Module 4: Bridging NC and RL:** Why combine them? Simple integrated project. (≈ 1.5 hours)

## Module 1: Neuromorphic Computing Fundamentals (≈ 1.5 hours)

Let's dive into the world inspired by the brain!

### 1.1. What is Neuromorphic Computing?

*   **Biological Inspiration:** Mimicking the structure and function of the biological nervous system (neurons, synapses).
*   **Key Differences from Traditional Computing (von Neumann):**
    *   *Colocated Memory and Processing:* Reduces the bottleneck of data movement.
    *   *Event-Driven Computation:* Computations happen only when "spikes" (events) occur, leading to potential energy savings.
    *   *Massive Parallelism:* Similar to the brain's architecture.
*   **Why Neuromorphic? Potential Advantages:**
    *   **Energy Efficiency:** Especially for tasks involving sparse, real-time data.
    *   **Real-time Processing:** Handling sensory data streams naturally.
    *   **Robustness:** Potential for fault tolerance.
    *   **Online Learning:** Adapting continuously, like biological systems.
*   **Spiking Neural Networks (SNNs):** The primary computational model in neuromorphic computing. Unlike traditional Artificial Neural Networks (ANNs) that pass continuous values, SNNs communicate using discrete events (spikes) occurring at specific points in time.

### 1.2. The Neuron: Biological vs. Artificial

*   **Biological Neuron:** Receives signals through dendrites, integrates them at the soma (cell body), and if a threshold is reached, fires an action potential (spike) down the axon.
*   **Spiking Neuron Models:** Mathematical abstractions capturing this behavior. Many models exist (Hodgkin-Huxley, Izhikevich, LIF). We'll focus on the **Leaky Integrate-and-Fire (LIF)** model due to its simplicity and computational efficiency.

### 1.3. The Leaky Integrate-and-Fire (LIF) Model

Imagine a bucket (membrane potential `V`) with a small leak (leak conductance). Water (current `I`) flows into the bucket. If the water level reaches a certain height (threshold `V_th`), the bucket is instantly emptied (reset `V_reset`) and a signal (spike) is sent. The leak ensures that without continuous input, the water level gradually drops back to a resting level (`V_rest`).

**Mathematical Foundation:**

The change in membrane potential `V` over time `t` is described by the differential equation:

$$ \tau_m \frac{dV}{dt} = -(V - V_{rest}) + R_m \cdot I $$

Where:
*   `τ_m` (tau_m): Membrane time constant (how quickly the potential changes/leaks). `τ_m = R_m * C_m`
*   `V`: Membrane potential.
*   `V_rest`: Resting potential.
*   `R_m`: Membrane resistance.
*   `I`: Input current.

**Spiking Mechanism:**
*   If `V >= V_th` (threshold potential):
    1.  A spike is generated.
    2.  `V` is reset to `V_reset`.
    3.  (Optional) Refractory period: The neuron cannot spike for a short duration after firing.

### 1.4. Simulating a Single LIF Neuron (Brian2)

Let's simulate one LIF neuron receiving a constant input current.

In [None]:
b2.start_scope() # Ensure a clean Brian2 environment

# Define LIF parameters
tau_m = 10 * b2.ms       # Membrane time constant
V_rest = -65 * b2.mV     # Resting potential
V_reset = -65 * b2.mV    # Reset potential
V_th = -50 * b2.mV       # Firing threshold
R_m = 100 * b2.Mohm      # Membrane resistance (using explicit Rm*I)
refractory_period_base = 5 * b2.ms # Base refractory period

# Define the LIF neuron equations
# Note: Brian2 uses string-based equations
lif_eqs = '''
dv/dt = (-(v - V_rest) + R_m * I) / tau_m : volt (unless refractory)
I : amp # Input current - defined externally
'''

# Create a NeuronGroup (1 neuron)
# threshold='v > V_th': condition for firing
# reset='v = V_reset': action taken after firing
# refractory=refractory_period_base: period after firing where neuron cannot fire again
# method='exact': numerical integration method (exact for linear LIF)
G = b2.NeuronGroup(1, lif_eqs, threshold='v > V_th', reset='v = V_reset',
                   refractory=refractory_period_base, method='exact')

# Initialize membrane potential
G.v = V_rest

# Provide a constant input current
input_current = 200 * b2.pA # Picoamperes
G.I = input_current

# Set up Monitors to record data
# SpikeMonitor records spike times and indices of firing neurons
spike_monitor = b2.SpikeMonitor(G)
# StateMonitor records the evolution of variables (like 'v') over time
state_monitor = b2.StateMonitor(G, 'v', record=0) # record=0 means record for neuron index 0

# --- Run the Simulation ---
simulation_duration = 100 * b2.ms
b2.run(simulation_duration)

# --- Plot the Results ---
plt.figure(figsize=(12, 6))

# Plot Membrane Potential
plt.subplot(2, 1, 1)
plt.plot(state_monitor.t / b2.ms, state_monitor.v[0] / b2.mV, label='Membrane Potential')
plt.axhline(V_th / b2.mV, color='red', linestyle='--', label='Threshold V_th')
plt.axhline(V_rest / b2.mV, color='gray', linestyle=':', label='Resting V_rest')
plt.xlabel('Time (ms)')
plt.ylabel('Potential (mV)')
plt.title(f'LIF Neuron Response to Constant Current ({input_current})')
plt.legend()
plt.grid(True)

# Plot Spikes (Raster Plot)
plt.subplot(2, 1, 2)
if spike_monitor.num_spikes > 0:
    plt.plot(spike_monitor.t / b2.ms, spike_monitor.i, '.k', label='Spikes') # '.k' means black dots
else:
    plt.plot([], [], '.k', label='Spikes') # Plot empty if no spikes
plt.xlabel('Time (ms)')
plt.ylabel('Neuron Index')
plt.yticks([]) # Only one neuron, so hide y-axis ticks
plt.title('Spike Output')
plt.grid(True)
plt.ylim(-0.5, 0.5) # Adjust y-limits for single neuron visibility
plt.xlim(0, simulation_duration / b2.ms)

plt.tight_layout()
plt.show()

print(f"Number of spikes: {spike_monitor.num_spikes}")
if spike_monitor.num_spikes > 0:
    print(f"Spike times: {spike_monitor.t / b2.ms} ms")
else:
    print("No spikes occurred.")

### 1.5. Exercise 1: Explore Neuron Behavior (20 mins)

1.  **Modify the Input Current:** Rerun the simulation above with different values for `input_current` (e.g., `100 * b2.pA`, `150 * b2.pA`, `300 * b2.pA`). How does the firing rate (number of spikes per second) change?
2.  **Change the Time Constant:** Reset the current to `200 * b2.pA`. Now, change `tau_m` (e.g., to `5 * b2.ms` or `20 * b2.ms`). How does this affect how quickly the neuron reaches threshold and the resulting firing pattern?
3.  **Introduce Refractoriness:** If you set `refractory=0*b2.ms`, what happens? Compare it to a longer refractory period like `10*b2.ms`.

In [None]:
# --- Exercise 1 Code Space ---

# Remember to clear the default Brian2 network if running multiple times:
b2.start_scope()

# --- Your Code Here ---

# 1. Define Neuron Parameters

# 2. Define Neuron Equations (usually same as before for LIF)

# 3. Create the NeuronGroup

# 4. Initialize Neuron State

# 5. Set up Monitors

# 6. Run the Simulation

# 7. Plot the Results (copy or adapt the plotting code from section 1.4)

# 8. Print relevant information (spike count, firing rate)

print("\nExercise complete. Reflect on the changes observed.")

### 1.6. Synapses and Basic Plasticity (Brief Overview)

*   **Synapses:** Connections between neurons. When a presynaptic neuron spikes, it typically causes a change (increase or decrease) in the postsynaptic neuron's membrane potential after some delay. This change is often modeled as a brief pulse of current or a change in conductance.
*   **Synaptic Weight (`w`):** Represents the strength of a connection. A positive weight is excitatory (increases V_post), a negative weight is inhibitory (decreases V_post).
*   **Synaptic Plasticity:** The ability of synaptic weights to change over time based on neural activity. This is the basis of learning and memory in the brain.
    *   **Spike-Timing-Dependent Plasticity (STDP):** A common biologically observed rule where the change in synaptic weight depends on the *relative timing* of pre- and postsynaptic spikes.
        *   If pre spikes just *before* post -> Strengthen synapse (Potentiation, LTP).
        *   If pre spikes just *after* post -> Weaken synapse (Depression, LTD).

(We won't implement STDP in detail now, but it's a key concept in neuromorphic learning).

## Module 2: Building Simple Spiking Networks (≈ 1.5 hours)

Now let's connect multiple neurons and see how they interact.

### 2.1. Connecting Neurons: Synapses in Brian2

Brian2's `Synapses` object connects `NeuronGroup`s. We need to define:
*   The presynaptic group (`source`).
*   The postsynaptic group (`target`).
*   The *model* of the synapse (what happens on a spike).
*   The connection pattern (`connect()`).

### 2.2. Encoding Information: From Data to Spikes

Since SNNs process spikes, we need ways to convert real-world data into spike trains:

*   **Rate Coding:** The *frequency* of spikes represents the intensity of a stimulus. Higher intensity = higher firing rate. Simple but potentially slow.
*   **Temporal Coding:** The precise *timing* of individual spikes carries information. Potentially much faster and more efficient.
*   **Population Coding:** Information is encoded in the *pattern* of activity across a group of neurons.

For simplicity, we'll often use **Poisson Spike Trains:** Spikes occur randomly with a specific average rate. A `PoissonGroup` in Brian2 generates these easily.

### 2.3. Example: A Simple Feedforward Network

Let's build a network: Input Layer -> Output Layer.
*   Input Layer: Generates Poisson spikes.
*   Output Layer: LIF neurons receiving spikes from the input layer.

In [None]:
b2.start_scope() # Clear previous Brian2 objects

# --- Parameters ---
num_inputs = 50
num_outputs = 10
input_rate = 20 * b2.Hz  # Average firing rate for input neurons
simulation_duration = 200 * b2.ms

# Output LIF neuron parameters (same as before)
tau_m = 10 * b2.ms
V_rest = -65 * b2.mV
V_reset = -65 * b2.mV
V_th = -50 * b2.mV
lif_eqs = '''
dv/dt = -(v - V_rest) / tau_m : volt (unless refractory)
''' # Removed Rm*I, current comes from synapses

# --- Network Components ---
# Input Layer: Poisson neurons
input_group = b2.PoissonGroup(num_inputs, rates=input_rate)

# Output Layer: LIF neurons
output_group = b2.NeuronGroup(num_outputs, lif_eqs, threshold='v > V_th',
                              reset='v = V_reset', refractory=5*b2.ms, method='exact')
output_group.v = V_rest # Initialize potential

# --- Synapses: Connecting Input to Output ---
# Define synaptic model: On a presynaptic spike, increase postsynaptic 'v' by 'w'
# 'w' represents the synaptic weight
synapse_model = 'w : volt' # Define 'w' as a synaptic variable (units of voltage change)
on_pre_eq = 'v_post += w' # Action executed when a presynaptic spike arrives

# Create synapses
synapses = b2.Synapses(input_group, output_group, model=synapse_model, on_pre=on_pre_eq)

# Connect all input neurons to all output neurons (full connectivity)
synapses.connect()

# Set synaptic weights (e.g., excitatory, randomly distributed)
# Let's make the weights cause a small depolarization
min_weight_val = 0.5 # value in mV
max_weight_val = 1.5 # value in mV
synapses.w = np.random.uniform(min_weight_val, max_weight_val, size=len(synapses)) * b2.mV

# --- Monitors ---
input_spike_mon = b2.SpikeMonitor(input_group, name='InputSpikes')
output_spike_mon = b2.SpikeMonitor(output_group, name='OutputSpikes')
output_state_mon = b2.StateMonitor(output_group, 'v', record=range(num_outputs), name='OutputState') # Record potential for all output neurons

# --- Run Simulation ---
print(f"Running simulation for {simulation_duration}...")
b2.run(simulation_duration)
print("Simulation complete.")

# --- Visualization: Raster Plots ---
plt.figure(figsize=(12, 8))

# Input Spikes
plt.subplot(2, 1, 1)
if input_spike_mon.num_spikes > 0:
    plt.plot(input_spike_mon.t / b2.ms, input_spike_mon.i, '.k', markersize=2)
plt.xlabel('Time (ms)')
plt.ylabel('Input Neuron Index')
plt.title(f'Input Layer Spikes ({num_inputs} Poisson Neurons @ {input_rate})')
plt.grid(True, alpha=0.3)
plt.xlim(0, simulation_duration / b2.ms)
plt.ylim(-1, num_inputs)

# Output Spikes
plt.subplot(2, 1, 2)
if output_spike_mon.num_spikes > 0:
    plt.plot(output_spike_mon.t / b2.ms, output_spike_mon.i, '.r', markersize=4)
plt.xlabel('Time (ms)')
plt.ylabel('Output Neuron Index')
plt.title(f'Output Layer Spikes ({num_outputs} LIF Neurons)')
plt.grid(True, alpha=0.3)
plt.xlim(0, simulation_duration / b2.ms)
plt.ylim(-1, num_outputs)

plt.tight_layout()
plt.show()

# --- Visualization: Membrane Potential of Output Neurons ---
plt.figure(figsize=(12, 6))
for i in range(num_outputs): # Plot potential for all output neurons
    plt.plot(output_state_mon.t / b2.ms, output_state_mon.v[i] / b2.mV, label=f'Neuron {i}')

plt.axhline(V_th / b2.mV, color='red', linestyle='--', label='Threshold V_th')
plt.xlabel('Time (ms)')
plt.ylabel('Potential (mV)')
plt.title('Membrane Potential of Output Neurons')
plt.grid(True)
plt.show()

print(f"Total input spikes: {input_spike_mon.num_spikes}")
print(f"Total output spikes: {output_spike_mon.num_spikes}")

### 2.4. Exercise 2: Network Dynamics (25 mins)

1.  **Synaptic Strength:** What happens if you significantly increase the average synaptic weight `w` (e.g., make `min_weight_val = 2.0`, `max_weight_val = 3.0`)? What if you make the weights inhibitory (negative)? Modify the code above to test this.
2.  **Connectivity:** Instead of `synapses.connect()`, try connecting neurons sparsely. For example, connect each input neuron to only 5 random output neurons. You can use `synapses.connect(j='k for k in sample(range(num_outputs), size=5)')` or a probability `synapses.connect(p=0.1)`. How does this affect the output activity?
3.  **Input Rate:** Change `input_rate`. How does the output firing rate respond? Is the relationship linear?

In [None]:
# --- Exercise 2 Code Space ---
b2.start_scope()

# --- Your Code Here ---

# 1. Define Network Parameters

# 2. Define Neuron Parameters (e.g., LIF, usually same as before)

# 3. Create Neuron Groups

# 4. Create Synapses Object

# 5. Define Connectivity

# 6. Set Synaptic Weights

# 7. Set up Monitors

# 8. Run the Simulation

# 9. Plot the Results (copy or adapt plotting code from section 2.3)

# 10. Print relevant info

print("\nExercise complete. Reflect on how network structure and parameters influence activity.")

### 2.5. Neuromorphic Hardware (Brief Mention)

*   Specialized hardware designed to run SNNs efficiently.
*   Examples: Intel Loihi/Loihi 2, SpiNNaker (Manchester University), TrueNorth (IBM - older), BrainScaleS (Heidelberg University), Akida (BrainChip), DynapSE (iniVation).
*   Often feature asynchronous, event-driven processing and low power consumption.
*   Simulators like Brian2 are essential for developing algorithms before deploying to hardware.

## Module 3: Reinforcement Learning Fundamentals (≈ 1.5 hours)

Shifting gears to learning from interaction.

### 3.1. What is Reinforcement Learning?

*   RL is a type of machine learning where an **agent** learns to make decisions by interacting with an **environment**.
*   The agent performs **actions**, receives **observations** (about the state of the environment), and gets **rewards** (or penalties).
*   The goal of the agent is to learn a **policy** (a strategy for choosing actions) that maximizes its cumulative reward over time.

**The Agent-Environment Loop:**

1.  Agent observes the current **state** (`s_t`).
2.  Agent chooses an **action** (`a_t`) based on its policy.
3.  Environment transitions to a new **state** (`s_{t+1}`) based on (`s_t`, `a_t`).
4.  Environment provides a **reward** (`r_{t+1}`) to the agent.
5.  Repeat.

*(Diagram: A simple loop showing Agent -> Action -> Environment -> State/Reward -> Agent)*

```mermaid
graph LR
    A[Agent] -- Action (a_t) --> E[Environment];
    E -- State (s_{t+1}), Reward (r_{t+1}) --> A;
```
(Requires mermaid rendering support in your Jupyter environment)

### 3.2. Key Concepts

*   **Agent:** The learner and decision-maker.
*   **Environment:** Everything outside the agent that it interacts with.
*   **State (`s`):** A representation of the environment's current situation.
*   **Action (`a`):** A choice the agent can make.
*   **Reward (`r`):** A scalar feedback signal indicating how good the last action was in that state.
*   **Policy (`π(a|s)`):** The agent's strategy; defines the probability of taking action `a` in state `s`.
*   **Value Function:** Predicts the expected future reward.
    *   **State-Value Function (`V(s)`):** Expected cumulative reward starting from state `s` and following policy `π`.
    *   **Action-Value Function (`Q(s, a)`):** Expected cumulative reward starting from state `s`, taking action `a`, and then following policy `π`. This is often more useful for choosing actions.
*   **Discount Factor (`γ`, gamma):** A value between 0 and 1 that determines the importance of future rewards. Rewards received sooner are often valued more than rewards received later. `γ=0` means only immediate reward matters, `γ≈1` means future rewards are highly valued.
*   **Markov Decision Process (MDP):** The mathematical framework for RL problems assuming the "Markov property" (the current state fully captures all necessary information from the past). Defined by (S, A, P, R, γ): States, Actions, Transition Probabilities `P(s'|s, a)`, Reward Function `R(s, a, s')`, Discount Factor `γ`.

### 3.3. Q-Learning: A Simple RL Algorithm

*   Q-Learning is a **model-free**, **off-policy** RL algorithm.
    *   *Model-free:* It doesn't need to know the environment's transition probabilities (`P`) or reward function (`R`). It learns directly from experience.
    *   *Off-policy:* It learns the optimal Q-values regardless of the policy being followed during exploration (e.g., epsilon-greedy).
*   **Goal:** Learn the optimal action-value function `Q*(s, a)`.
*   **Q-Table:** In simple problems with discrete states and actions, we can store the Q-values in a table (e.g., a dictionary or array) where rows are states and columns are actions.

**The Q-Learning Update Rule:**

When the agent takes action `a_t` in state `s_t`, observes reward `r_{t+1}` and next state `s_{t+1}`, it updates the Q-value using:

$$ Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \left[ r_{t+1} + \gamma \max_{a'} Q(s_{t+1}, a') - Q(s_t, a_t) \right] $$

Where:
*   `α` (alpha): Learning rate (0 < α ≤ 1), controls how much new information overrides old information.
*   `γ` (gamma): Discount factor (0 ≤ γ ≤ 1).
*   `max_{a'} Q(s_{t+1}, a')`: The maximum Q-value for the *next* state `s_{t+1}` over all possible next actions `a'`. This represents the agent's current estimate of the best possible future value from `s_{t+1}`.
*   The term in the square brackets `[...]` is the **Temporal Difference (TD) error**: the difference between the estimated return (`r + γ * max Q`) and the current Q-value.

**Exploration vs. Exploitation:**
*   To find the optimal policy, the agent needs to explore different actions.
*   But it also needs to exploit its current knowledge to get rewards.
*   **Epsilon-Greedy (`ε`-greedy):** A common strategy:
    *   With probability `ε` (epsilon), choose a random action (explore).
    *   With probability `1-ε`, choose the action with the highest Q-value for the current state (exploit).
    *   `ε` often starts high (e.g., 1.0) and decays over time.

### 3.4. Implementing Tabular Q-Learning: Simple Grid World Example

Let's create a simple text-based grid world environment and apply Q-Learning.

**Environment:**
*   A 4x4 grid.
*   Agent starts at (0, 0).
*   Goal is at (3, 3) (reward +10).
*   A "hole" is at (1, 1) (reward -10).
*   Moving into a wall keeps the agent in place.
*   Small negative reward (-0.1) for each step to encourage efficiency.
*   Actions: Up, Down, Left, Right.

In [None]:
# Simple Grid World Environment
class GridWorldEnv:
    def __init__(self, size=4):
        self.size = size
        self.agent_pos = (0, 0)
        self.goal_pos = (size - 1, size - 1)
        self.hole_pos = (1, 1)
        # Actions: 0: Up, 1: Down, 2: Left, 3: Right
        self.actions = [0, 1, 2, 3]
        self.action_delta = {
            0: (-1, 0), # Up
            1: (1, 0),  # Down
            2: (0, -1), # Left
            3: (0, 1)   # Right
        }

    def reset(self):
        self.agent_pos = (0, 0)
        return self.get_state()

    def get_state(self):
        # Represent state as a unique integer or tuple
        return self.agent_pos

    def step(self, action):
        if action not in self.actions:
            raise ValueError("Invalid action")

        delta = self.action_delta[action]
        current_r, current_c = self.agent_pos
        next_r, next_c = current_r + delta[0], current_c + delta[1]

        # Check boundaries
        if not (0 <= next_r < self.size and 0 <= next_c < self.size):
            next_r, next_c = current_r, current_c # Stay in place if wall hit

        self.agent_pos = (next_r, next_c)
        next_state = self.get_state()

        # Determine reward and done status
        if self.agent_pos == self.goal_pos:
            reward = 10.0
            done = True
        elif self.agent_pos == self.hole_pos:
            reward = -10.0
            done = True
        else:
            reward = -0.1 # Step penalty
            done = False

        return next_state, reward, done

    def render(self):
        grid = np.full((self.size, self.size), '_', dtype=str)
        grid[self.goal_pos] = 'G'
        grid[self.hole_pos] = 'H'
        grid[self.agent_pos] = 'A'
        print("\n".join(" ".join(row) for row in grid))
        print("-" * (self.size * 2 - 1))

In [None]:
# Q-Learning Agent
class QLearningAgent:
    def __init__(self, env, alpha=0.1, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01):
        self.env = env
        self.q_table = {} # Using dict: {(state): {action: q_value}}
        self.alpha = alpha
        self.gamma = gamma
        self.epsilon = epsilon
        self.epsilon_decay = epsilon_decay
        self.epsilon_min = epsilon_min
        # Check if env provides actions, otherwise assume standard [0, 1, 2, 3]
        self.actions = getattr(env, 'actions', [0, 1, 2, 3]) 

    def get_q_value(self, state, action):
        # Return Q-value, default to 0 if state or action not seen
        return self.q_table.get(state, {}).get(action, 0.0)

    def choose_action(self, state):
        if np.random.rand() < self.epsilon:
            # Explore: choose random action
            return np.random.choice(self.actions)
        else:
            # Exploit: choose best action based on Q-values
            q_values = [self.get_q_value(state, a) for a in self.actions]
            # Handle cases where state hasn't been fully explored or all Q-values are 0
            max_q = np.max(q_values)
            # Check if all Q-values are the same (e.g., all zero for a new state)
            if all(q == q_values[0] for q in q_values):
                # If all are same, choose randomly among all actions
                 best_actions = self.actions
            else:
                # Choose among actions with the max Q-value
                 best_actions = [a for a, q in zip(self.actions, q_values) if q == max_q]
            return np.random.choice(best_actions)

    def update_q_table(self, state, action, reward, next_state, done):
        # Q-learning update rule
        old_q = self.get_q_value(state, action)
        
        # If the episode is done, the future reward estimate (from next_state) is 0
        if done:
            td_target = reward 
        else:
            next_max_q = np.max([self.get_q_value(next_state, a) for a in self.actions])
            td_target = reward + self.gamma * next_max_q
            
        td_error = td_target - old_q
        new_q = old_q + self.alpha * td_error

        # Update the table
        if state not in self.q_table:
            self.q_table[state] = {act: 0.0 for act in self.actions}
        self.q_table[state][action] = new_q

    def update_epsilon(self):
        self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)

In [None]:
# --- Training Loop ---
env = GridWorldEnv(size=4)
agent = QLearningAgent(env, alpha=0.1, gamma=0.99, epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01)

num_episodes = 5000
max_steps_per_episode = 100
rewards_per_episode = []

start_time = time.time()

for episode in range(num_episodes):
    state = env.reset()
    total_reward = 0
    done = False

    for step in range(max_steps_per_episode):
        action = agent.choose_action(state)
        next_state, reward, done = env.step(action)
        agent.update_q_table(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward
        if done:
            break

    agent.update_epsilon() # Decay epsilon after each episode
    rewards_per_episode.append(total_reward)

    if (episode + 1) % (num_episodes // 10) == 0:
        # Calculate average reward over the last N episodes for smoother reporting
        avg_reward = np.mean(rewards_per_episode[-(num_episodes // 10):])
        print(f"Episode {episode + 1}/{num_episodes} | Avg Reward (last {num_episodes // 10}): {avg_reward:.2f} | Epsilon: {agent.epsilon:.3f}")

end_time = time.time()
print(f"\nTraining finished in {end_time - start_time:.2f} seconds.")

# --- Plot Training Progress ---
plt.figure(figsize=(10, 5))
# Smooth rewards for better visualization (moving average)
window_size = 100
if len(rewards_per_episode) >= window_size:
    smoothed_rewards = np.convolve(rewards_per_episode, np.ones(window_size)/window_size, mode='valid')
    plt.plot(smoothed_rewards)
    plt.xlabel(f'Episode (Moving Average over {window_size} episodes)')
else:
    plt.plot(rewards_per_episode) # Plot raw rewards if not enough episodes for smoothing
    plt.xlabel('Episode')
    
plt.title('Episode Rewards over Time')
plt.ylabel('Total Reward')
plt.grid(True)
plt.show()

# --- Display Learned Policy (Optional Visualization) ---
print("\nLearned Policy (Arrows indicate best action):")
action_arrows = {0: '^', 1: 'v', 2: '<', 3: '>'}
policy_grid = np.full((env.size, env.size), ' ', dtype=str)
policy_grid[env.goal_pos] = 'G'
policy_grid[env.hole_pos] = 'H'

for r in range(env.size):
    for c in range(env.size):
        state = (r, c)
        if state != env.goal_pos and state != env.hole_pos:
            if state in agent.q_table:
                # Choose best action based on learned Q-values (exploitation)
                q_values = [agent.get_q_value(state, a) for a in agent.actions]
                best_action = agent.actions[np.argmax(q_values)]
                policy_grid[r, c] = action_arrows[best_action]
            else:
                 policy_grid[r, c] = '.' # State not visited or no clear policy

print("\n".join(" ".join(row) for row in policy_grid))

### 3.5. Exercise 3: Tune Q-Learning Parameters (20 mins)

1.  **Learning Rate (`alpha`):** Rerun the training with a much smaller `alpha` (e.g., 0.01) and a much larger `alpha` (e.g., 0.9). How does this affect the learning speed and the stability of the final rewards/policy?
2.  **Discount Factor (`gamma`):** What happens if `gamma` is very low (e.g., 0.1)? The agent becomes "myopic". What if `gamma` is 1.0? (Note: gamma=1 can sometimes cause issues if rewards can accumulate indefinitely without termination). Try `gamma=0.9`.
3.  **Epsilon Decay (`epsilon_decay`):** Make `epsilon_decay` very close to 1 (e.g., 0.999) so exploration lasts longer. Make it decay faster (e.g., 0.9). How does the shape of the reward curve change?

In [None]:
# --- Exercise 3 Code Space ---

# --- Your Code Here ---

# 1. Define the Hyperparameters to test

# 2. Create the Environment (same as before)

# 3. Create the QLearningAgent with the modified hyperparameters

# 4. Implement the Training Loop (copy or adapt from section 3.4)

# 5. Plot the results (copy or adapt from section 3.4)

# 6. Display the learned policy (copy or adapt from section 3.4)

print("\nExercise complete. Reflect on how hyperparameters influence RL performance.")

## Module 4: Bridging Neuromorphic Computing and RL (≈ 1.5 hours)

Let's bring the two fields together!

### 4.1. Why Combine Neuromorphic Computing and RL?

*   **Energy Efficiency:** Running RL agents (especially complex ones) can be computationally expensive. Neuromorphic hardware offers the potential for significant power savings, particularly for agents interacting with real-time, sparse sensory data (like robots).
*   **Biologically Plausible Learning:** The brain learns through mechanisms that resemble both RL (dopamine signals acting like rewards) and SNNs (spiking activity, synaptic plasticity like STDP). Combining them moves towards more brain-like AI.
*   **Event-Based Processing:** SNNs naturally handle asynchronous, event-based inputs, which is common in robotics and sensory processing tasks where RL is applied.
*   **Temporal Dynamics:** SNNs inherently process information over time, which could be advantageous for RL tasks requiring memory or sensitivity to timing.

### 4.2. Challenges

*   **Credit Assignment:** How do you determine which specific synapse or neuron contributed to a reward that might be received much later? This is harder in SNNs with their complex temporal dynamics than in traditional ANNs (solved partially by backpropagation).
*   **Learning Rules:** Developing effective, stable, and efficient learning rules for SNNs in an RL context is an active area of research (e.g., Reward-modulated STDP, approximations of backpropagation for SNNs).
*   **Simulation Speed:** Simulating large SNNs can be slower than training ANNs on GPUs (though neuromorphic hardware aims to overcome this).
*   **Encoding/Decoding:** Converting states and actions between the continuous/discrete world of RL environments and the spiking world of SNNs requires careful design.

### 4.3. Approach: SNN as a Function Approximator for RL

A common approach (especially for bridging the gap) is to use the SNN to approximate a component of the RL algorithm, similar to how Deep RL uses ANNs:

1.  **SNN as Policy Network:** The SNN receives the state (encoded as spikes) and its output firing rates determine the probability of taking each action. Learning adjusts synaptic weights to favor actions leading to higher rewards.
2.  **SNN as Value Network:** The SNN receives the state (and potentially action) encoded as spikes, and its output firing rate (or some other measure) represents the estimated Q-value or V-value. Learning adjusts weights to make these estimates more accurate.

**Simplification for this Workshop:**
Implementing complex SNN learning rules (like reward-modulated STDP) from scratch is time-consuming. We will take a *simpler, hybrid approach*:

*   Use an SNN to *process* the input (like a feature extractor).
*   Use the *output firing rates* of the SNN as the *state representation* for our existing Tabular Q-Learner.

This demonstrates the integration concept without delving into complex SNN-specific RL algorithms immediately. The SNN adds a temporal processing layer before the standard RL decision-making.

### 4.4. Project: SNN-Enhanced Agent for a Pattern Recognition Task

**Goal:** Train an agent to distinguish between two different input *spike patterns* using RL, where the SNN processes the patterns.

**Environment:**
*   Generates one of two predefined spike patterns as input.
*   Agent must output Action 0 if Pattern A is detected, Action 1 if Pattern B is detected.
*   Reward: +1 for correct classification, -1 for incorrect.

**SNN Structure:**
*   Input Layer: `PoissonGroup` neurons, whose *rates* are modulated according to the pattern being presented.
*   Output Layer: A small number of LIF neurons receiving input from the Input Layer.

**RL Integration:**
*   **State:** The *average firing rates* of the Output Layer neurons over a short time window. This rate vector becomes the `state` for the Q-learner (after discretization).
*   **Action:** Chosen by the Q-learner based on the SNN-derived state (Action 0 or 1).
*   **Learning:** The Q-learner updates its table based on the (SNN state, chosen action, reward). The SNN weights themselves remain *fixed* in this simplified example.

In [None]:
# --- SNN Setup ---
b2.start_scope()

# SNN Parameters
num_inputs = 20
num_outputs = 4 # SNN output neurons (can be tuned in Exercise 4)
tau_m = 10 * b2.ms
V_rest = -65 * b2.mV
V_reset = -65 * b2.mV
V_th = -50 * b2.mV
lif_eqs = 'dv/dt = -(v - V_rest) / tau_m : volt (unless refractory)'

# Input Group (rates will be set dynamically)
input_group = b2.PoissonGroup(num_inputs, rates=0*b2.Hz, name='input_layer')

# Output Group
output_group = b2.NeuronGroup(num_outputs, lif_eqs, threshold='v > V_th',
                              reset='v = V_reset', refractory=5*b2.ms, method='exact',
                              name='output_layer')
output_group.v = V_rest

# Synapses (fixed weights for this example)
synapses = b2.Synapses(input_group, output_group, 'w : volt', on_pre='v_post += w',
                         name='synapses')
connection_probability = 0.5 # Can be tuned in Exercise 4
synapses.connect(p=connection_probability) 
min_weight_snn = 0.5 # mV, Can be tuned in Exercise 4
max_weight_snn = 2.0 # mV, Can be tuned in Exercise 4
if len(synapses) > 0:
    synapses.w = np.random.uniform(min_weight_snn, max_weight_snn, size=len(synapses)) * b2.mV 

# Monitors
output_spike_mon = b2.SpikeMonitor(output_group, name='output_spikes')
# We need spikes to calculate average rate for the state

# Store the network components for easy access
# Important: Use include_scope=False if Brian objects defined outside the Network call
# Collect all Brian2 objects in the current scope into a Network object
snn_network = b2.Network(b2.collect()) 
snn_network.store('initial_snn') # Store initial state (topology, parameters)
print("SNN components defined.")
print(f"Input: {input_group.N}, Output: {output_group.N}, Synapses: {len(synapses)}")

In [None]:
# --- Define Spike Patterns ---
pattern_duration = 100 * b2.ms
base_rate_val = 10 # Hz value (can be tuned in Exercise 4)
high_rate_val = 50 # Hz value (can be tuned in Exercise 4)

# Pattern A: First half of inputs fire at high rate
pattern_A_rates = np.zeros(num_inputs) * b2.Hz
pattern_A_rates[:num_inputs // 2] = high_rate_val * b2.Hz
pattern_A_rates[num_inputs // 2:] = base_rate_val * b2.Hz

# Pattern B: Second half of inputs fire at high rate
pattern_B_rates = np.zeros(num_inputs) * b2.Hz
pattern_B_rates[:num_inputs // 2] = base_rate_val * b2.Hz
pattern_B_rates[num_inputs // 2:] = high_rate_val * b2.Hz

patterns = {'A': pattern_A_rates, 'B': pattern_B_rates}
pattern_labels = {'A': 0, 'B': 1} # Target actions for RL

print("Spike patterns defined.")
print(f"Pattern A rates (first 5): {pattern_A_rates[:5]}")
print(f"Pattern B rates (last 5): {pattern_B_rates[-5:]}")

In [None]:
# --- RL Environment using the SNN ---
class SNNPatternEnv:
    def __init__(self, snn_network, patterns, pattern_labels, duration):
        self.snn_network = snn_network
        self.patterns = patterns
        self.pattern_labels = pattern_labels
        self.duration = duration
        self.current_pattern_name = None
        # RL specifics
        self.actions = [0, 1] # Action 0 for Pattern A, Action 1 for Pattern B

    def reset(self):
        # Choose a random pattern to present
        self.current_pattern_name = np.random.choice(list(self.patterns.keys()))
        input_rates = self.patterns[self.current_pattern_name]

        # Restore the SNN to its initial state and set input rates
        self.snn_network.restore('initial_snn') # Reset neuron states, time etc.
        # Access PoissonGroup by name ('input_layer') defined during creation
        self.snn_network['input_layer'].rates = input_rates 

        # Run SNN for the specified duration to get initial state
        self.snn_network.run(self.duration, report='off') 

        # Get the state representation (discretized average firing rates)
        state = self._get_snn_state()
        return state

    def _get_snn_state(self):
        # Calculate average firing rate over the duration using SpikeMonitor
        spike_monitor = self.snn_network['output_spikes']
        num_output_neurons = self.snn_network['output_layer'].N
        rates = np.zeros(num_output_neurons)
        duration_sec = self.duration / b2.second
        
        if duration_sec > 0 and spike_monitor.num_spikes > 0:
            # Count spikes per neuron
            neuron_indices, counts = np.unique(spike_monitor.i, return_counts=True)
            # Calculate rate (spikes / duration in seconds)
            rates[neuron_indices] = counts / duration_sec
        
        # Discretize the rates to use as keys in the Q-table
        # Simple discretization: binning rates (e.g., 0-10Hz, 10-30Hz, 30+Hz)
        bins = [-np.inf, 10, 30, np.inf] # Define rate bins (adjust as needed)
        discretized_rates = tuple(np.digitize(rates, bins))
        
        return discretized_rates

    def step(self, action):
        # In this simple task, the episode ends after one action
        correct_action = self.pattern_labels[self.current_pattern_name]

        if action == correct_action:
            reward = 1.0
        else:
            reward = -1.0

        done = True
        # Get the SNN output rates again to represent the 'next_state'
        # In this specific task, the state doesn't change after the action,
        # but we return it for consistency with RL loop. Can also return None.
        next_state = self._get_snn_state() 

        return next_state, reward, done

print("SNN Pattern Environment defined.")

In [None]:
# --- Q-Learning Agent (using the same class as before) ---
env_snn = SNNPatternEnv(snn_network, patterns, pattern_labels, pattern_duration)

# Test reset and state generation
print("Testing environment reset...")
try:
    test_state = env_snn.reset()
    print(f"Initial SNN state (discretized rates): {test_state}")
    print(f"Presented pattern: {env_snn.current_pattern_name}")
except Exception as e:
    print(f"Error during env reset test: {e}")
    print("Check SNN setup and environment code.")

# Create the agent
agent_snn = QLearningAgent(env_snn, alpha=0.1, gamma=0.9, # Gamma less important here (1-step episodes)
                           epsilon=1.0, epsilon_decay=0.99, epsilon_min=0.05)
print("\nQ-Learning agent created for SNN environment.")

# --- Training Loop ---
num_episodes_snn = 3000
rewards_per_episode_snn = []
history = [] # Store (pattern, chosen_action, correct_action)

print(f"\nStarting SNN+RL training for {num_episodes_snn} episodes...")
start_time_snn = time.time()

for episode in range(num_episodes_snn):
    state = env_snn.reset()
    # env.reset() runs the SNN and returns the discretized rate state

    action = agent_snn.choose_action(state)
    # The 'step' in this env mainly determines reward based on the action
    next_state, reward, done = env_snn.step(action) 

    # Store history for analysis
    correct_action = env_snn.pattern_labels[env_snn.current_pattern_name]
    history.append({'pattern': env_snn.current_pattern_name, 
                      'state': state,
                      'chosen': action, 
                      'correct': correct_action, 
                      'reward': reward})
    
    # Q-learning update uses the state observed *before* the action
    agent_snn.update_q_table(state, action, reward, next_state, done) 
    agent_snn.update_epsilon()
    rewards_per_episode_snn.append(reward)

    if (episode + 1) % (num_episodes_snn // 10) == 0:
        # Calculate accuracy over the last N episodes
        recent_history = history[-(num_episodes_snn // 10):]
        if len(recent_history) > 0:
           recent_accuracy = sum(1 for h in recent_history if h['chosen'] == h['correct']) / len(recent_history)
        else:
           recent_accuracy = 0.0
        print(f"Episode {episode + 1}/{num_episodes_snn} | Recent Acc: {recent_accuracy:.3f} | Epsilon: {agent_snn.epsilon:.3f} | Q-States: {len(agent_snn.q_table)}")

end_time_snn = time.time()
if num_episodes_snn > 0:
    total_accuracy = sum(1 for h in history if h['chosen'] == h['correct']) / num_episodes_snn
else:
    total_accuracy = 0.0
print(f"\nSNN+RL Training finished in {end_time_snn - start_time_snn:.2f} seconds.")
print(f"Overall Accuracy: {total_accuracy:.3f}")

# --- Plot SNN+RL Training Progress ---
plt.figure(figsize=(10, 5))
window_size = 100 
# Calculate accuracy in windows
accuracy_history = [1 if h['chosen'] == h['correct'] else 0 for h in history]
if len(accuracy_history) >= window_size:
    smoothed_accuracy = np.convolve(accuracy_history, np.ones(window_size)/window_size, mode='valid')
    plot_indices = np.arange(window_size - 1, len(accuracy_history))
    plt.plot(plot_indices, smoothed_accuracy)
    plt.xlabel(f'Episode (Smoothed over {window_size})')
else:
    plt.plot(accuracy_history) # Plot raw if not enough data
    plt.xlabel('Episode')

plt.title('Accuracy over Time (SNN+RL)')
plt.ylabel('Accuracy')
plt.ylim(-0.05, 1.05)
plt.grid(True)
plt.show()

# --- Inspect Q-Table (Optional) ---
print(f"\nQ-Table size: {len(agent_snn.q_table)} states encountered.")
print("Sample Q-Table entries:")
count = 0
for state, actions in agent_snn.q_table.items():
    # Format actions for better readability
    action_values = {f"Action {k}": f"{v:.2f}" for k, v in actions.items()}
    print(f"  State {state}: {action_values}")
    count += 1
    if count >= 5:
        break

### 4.5. Discussion and Exercise 4 (30 mins)

*   **Interpretation:** The SNN acted as a fixed, dynamic feature extractor. The Q-learner learned to map the SNN's output firing patterns (our discretized 'state') to the correct classification action.
*   **Limitations:**
    *   The SNN itself didn't learn; its weights were fixed. True neuromorphic RL would involve adapting the SNN's synapses based on reward (e.g., using reward-modulated plasticity).
    *   The state discretization (binning/rounding firing rates) was basic and might lose information or lead to a large/sparse state space, potentially hindering learning. Using function approximation (like Deep Q-Networks, potentially with SNNs) is more scalable.
    *   The task was simple (single step, binary classification).

**Exercise 4: Explore the SNN-RL System:**

1.  **SNN Output Neurons:** Change `num_outputs` in the SNN setup (e.g., to 2 or 8). Re-run the SNN setup, environment creation, agent creation, and training loop. How does this affect the size of the Q-table state space (number of unique discretized rate tuples) and the final accuracy? Does more complex SNN output help or hinder the simple Q-learner?
2.  **SNN Connectivity/Weights:** Modify the SNN's connection probability (`connection_probability`) or the weight range (`min_weight_snn`, `max_weight_snn`). Re-run everything from the SNN setup onwards. Does a significantly different SNN structure make it harder or easier for the Q-learner to solve the task? (Remember the SNN weights are *not* learning here, so you're changing the fixed feature extractor).
3.  **Pattern Similarity:** Make Pattern A and Pattern B more similar (e.g., change `high_rate_val` to be closer to `base_rate_val`, or make the active neuron groups overlap more). Re-run the pattern definition, environment, agent, and training. Can the system still distinguish them? How does accuracy change?

In [None]:
# --- Exercise 4 Code Space ---

# This exercise requires modifying parameters in the cells above and re-running them.
# There's no single block of code to write here.

# --- Your Actions ---

# 1. Choose ONE parameter area to explore first (e.g., num_outputs).

# 2. Go back to the relevant cell:
#    - For num_outputs: '# --- SNN Setup ---' cell.
#    - For connectivity/weights: '# --- SNN Setup ---' cell.
#    - For pattern similarity: '# --- Define Spike Patterns ---' cell.

# 3. Modify the parameter(s) in that cell.
#    - Example (for num_outputs): Change `num_outputs = 4` to `num_outputs = 2`.
#    - Example (for connectivity): Change `connection_probability = 0.5` to `0.2`.
#    - Example (for weights): Change `min_weight_snn`, `max_weight_snn`.
#    - Example (for patterns): Change `high_rate_val = 50` to `30`.

# 4. Re-run the modified cell AND all subsequent cells in Module 4:
#    - SNN Setup (if modified)
#    - Pattern Definition (if modified)
#    - SNN Pattern Environment Definition (re-run to pick up changes)
#    - Agent Creation & Training Loop (re-run to train with new setup)

# 5. Observe the results:
#    - Look at the printed output during training (accuracy, Q-states encountered).
#    - Examine the final accuracy plot.
#    - Note the final Q-table size.

# 6. Reflect: How did the change affect the system's ability to learn?
#    - Did accuracy improve or decrease?
#    - Did learning take longer (more episodes to reach good accuracy)?
#    - Did the state space size (Q-States) change significantly?

# 7. (Optional) Reset the parameter you changed, then try modifying a different one.

print("--- Exercise 4 Instructions --- ")
print("Modify parameters in the cells above as described in the markdown.")
print("Re-run the sequence of cells starting from your modification down to the plotting cell.")
print("Observe the impact on accuracy, learning speed, and Q-table size.")


## Wrap-up and Further Learning (30 mins)

Congratulations! You've covered the basics of Neuromorphic Computing and Reinforcement Learning, and even built a simple system combining them.

### Key Takeaways:
*   Neuromorphic computing uses brain-inspired principles (spikes, parallelism, colocation) for efficient computation, especially with SNNs.
*   SNNs communicate with timed spikes, simulated using models like LIF. Brian2 is a powerful tool for this.
*   Reinforcement Learning trains agents to make decisions by maximizing rewards through environmental interaction (Q-Learning is a fundamental algorithm).
*   Combining NC and RL holds promise for energy-efficient, biologically plausible AI, but presents challenges in learning rules and credit assignment.
*   Even simple integrations (like using SNN outputs as RL states) demonstrate the potential synergy.