# Using LLMs to Design a Reinforcement Learning Algorithm for Energy-Efficient Workload Placement in Data Centers

In this notebook, we explore how Large Language Models (LLMs) can assist in designing reinforcement learning (RL) algorithms for a practical and impactful use case: **energy-efficient workload placement in data centers**.

Our approach involves prompting multiple state-of-the-art LLMs to act as AI researchers and collaboratively propose RL-based solutions to this problem. The models queried include:

- **GPT-4o-mini**
- **Gemini 2.0 Flash**
- **DeepSeek Chat**
- **DeepSeek Reasoner**
- **LLaMA 3.3 70B Versatile**
- **LLaMA 3.2**

Each of these models was tasked with generating an RL algorithm tailored to the energy-aware scheduling of workloads within a data center environment.

To evaluate the quality and viability of the responses, we used **o3-mini**, a compact model optimized for evaluation tasks, to critically assess the outputs across key dimensions.

At the end of this notebook, I will present a concise summary of the findings, highlighting how each model contributed to the overall solution space and reflecting on the effectiveness of LLMs in automating or accelerating the design of RL algorithms for real-world infrastructure optimization problems.

### Workflow

1. Use LLMs like GPT-4o-mini, Gemini, DeepSeek, and LLaMA to design RL solutions.
2. Provide each LLM the same prompt and collect their responses.
3. Evaluate them using a separate model (O3-mini) based on quality and reasoning.
4. Compare their runtime and quality of output.


### Importing Required Packages

The following cell imports all necessary libraries and modules used throughout the notebook.


In [1]:
from dotenv import load_dotenv
import os
from IPython.display import display, Markdown
from openai import OpenAI
import json
import time

### API Key Management

To access different LLMs, you need API keys for each provider. Some APIs are free to use, while others require a paid subscription.  
For security and convenience, all API keys are stored in a `.env` file.

The following cell loads the keys from the `.env` file so they can be used throughout the notebook.


In [2]:
load_dotenv(override=True)
openai_api_key = os.getenv("OPENAI_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")
groq_api_key = os.getenv("GROQ_API_KEY")
deepseek_api_key = os.getenv("DEEPSEEK_API_KEY")

if openai_api_key:
    print(f"This is the OpenAI  key:{openai_api_key[:4]}")
else:
    print("No API key found for OpenAI")

if google_api_key:
    print(f"This is the Gimini key:{google_api_key[:4]}")
else:
    print("No API key found for Google")

if groq_api_key:
    print(f"This is the GROQ key:{groq_api_key[:4]}")
else:
    print("No API key found for Groq")

if deepseek_api_key:
    print(f"This is the DeepSeek key:{deepseek_api_key[:4]}")
else:
    print("No API key found for DeepSeek")

This is the OpenAI  key:sk-p
This is the Gimini key:AIza
This is the GROQ key:gsk_
This is the DeepSeek key:sk-e


### Prompting the LLMs

We craft a detailed prompt instructing the LLMs to act as researchers and design an RL-based solution. The LLMs are expected to define:
- The environment (observation/action space)
- The agent
- Suitable algorithms
- And constraints like node capacity


In [3]:
task_description = """
You are a computer science researcher. Your task is to design an RL system to optimize energy in a datacenter. To complete the task, you need to follow these steps:
1. Design the cluster environment: Workloads arrive one by one and send their requests, including the demanded amounts of CPU, Memory, and Storage. These workloads are placed on a node and, after some time, leave the system. Therefore, you should define the observation space, action space, and the reset and step functions.
2. Design the reward function
3. Design the agent
4. Select the best algorithms for this problem
The amount of demanded resources for all workloads running on a node should not exceed the capacity of that node.
give me the answer in scientific tone and in markdown format.
"""

### Initializing Variables and Setting Up the LLM Request

- `competitors`, `answers`, and `times` are lists to store different LLMs’ names, their generated responses, and the running time.  
- `openai = OpenAI()` initializes the OpenAI client to interact with the GPT models.  
- `messages` defines the input prompt to be sent to the LLM, specifying the user role and task description.


In [4]:
competitors = []
answers = []
times = []
openai = OpenAI()
messages = [{"role": "user", "content": task_description}]

### Querying the GPT-4o and Storing Its Response

- Set the target model with `model_name`.  
- Send the prompt (`messages`) to the model using the OpenAI client.  
- *(The following steps repeat the same for a different model.)*
- Extract the generated answer from the response.  
- Display the answer in Markdown format for readability.  
- Append the model name and its answer to the respective lists for later comparison.


In [5]:
model_name = "gpt-4o-mini"

start = time.time()
response = openai.chat.completions.create(model=model_name, messages=messages)
end = time.time()
answer = response.choices[0].message.content

elapsed = end - start
print(f"running time: {elapsed}")
times.append(elapsed)

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

running time: 10.346619367599487


# Designing a Reinforcement Learning System for Energy Optimization in a Data Center

## 1. Cluster Environment Design

### 1.1 Observation Space
The observation space should capture the current state of the datacenter. It could include the following components:
- **Node States**: 
  - CPU utilization (as a percentage)
  - Memory utilization (as a percentage)
  - Storage utilization (as a percentage)
  - Current temperature of the node
  - Power consumption of the node in Watts
  
- **Workload Queue**:
  - Number of pending workloads
  - Attributes of the workloads (e.g., required CPU, Memory, Storage for each workload)

Mathematically, the observation space can be represented as:
\[ \text{Observation} = \{ \text{Node States}, \text{Workload Queue} \} \]

### 1.2 Action Space
The action space will dictate how workloads are assigned to nodes. Actions could consist of:
- Assigning a particular workload to a specific node
- Moving a workload from one node to another (migration)
- Releasing resources when a workload completes
- Adjusting the power state of nodes (e.g., turn on/off)

The action space can be defined formally as:
\[ \text{Action} = \{ \text{Assign (workload, node)}, \text{Migrate (workload, from node to node)}, \text{Release (workload)}, \text{Power State Change (node)} \} \]

### 1.3 Reset Function
The reset function initializes or reinitializes the state of the environment. It typically involves:
- Setting all nodes to an idle state (0% CPU, Memory, Storage)
- Clearing the workload queue
- Resetting power states and thermal states

### 1.4 Step Function
The step function executes the chosen action, updates the state of the environment, and computes the reward. It operates as follows:
1. Update the workload queue based on incoming/outgoing workloads.
2. Allocate resources based on the selected action.
3. Update node metrics such as CPU, Memory, Storage utilization, temperature, and power consumption.
4. Return the new observation, reward, and a flag indicating whether the episode has ended.

## 2. Reward Function
The reward function is critical for reinforcing desirable behaviors in the RL agent. A suitable reward structure may include:
- **Negative reward for energy consumption**: Proportional to the power consumed by all nodes during a time step.
- **Positive reward for successful workload handling**: Allocating nodes without exceeding capacity or leading to overheating.
- **Extra penalty for idle nodes**: To encourage resource utilization rather than idleness.
- **Penalty for migration**: To discourage frequent workload migrations unless necessary.

Thus, the reward function \( R \) can be expressed as:
\[ R = - (\text{Total Power Consumption}) + k \times (\text{Successful Workloads}) - m \times (\text{Idle Nodes}) - n \times (\text{Migrations}) \]
Where \( k, m, n \) are constants that balance the importance of different factors.

## 3. Agent Design
The agent must be designed to operate effectively within the specified environment. Key components include:
- **Policy Network**: A neural network that takes observations as inputs and outputs action distributions. This could be implemented through Deep Q-Networks (DQN) or policy gradients.
- **Experience Replay**: To enhance learning stability, maintain an experience replay buffer for off-policy learning.
- **Exploration Strategy**: Incorporate strategies such as ε-greedy to balance exploration of new actions with exploitation of known rewarding actions.

## 4. Algorithm Selection
Based on the characteristics of the problem, the following algorithms could be well-suited:
- **Deep Q-Network (DQN)**: For discrete action spaces; leverages deep learning to approximate Q-values.
- **Proximal Policy Optimization (PPO)**: For continuous action spaces; known for its stability and adaptability in training.
- **Trust Region Policy Optimization (TRPO)**: Offers a more formal constraint for policy updates, ensuring that the agent does not stray excessively from prior policies.
- **Deep Deterministic Policy Gradient (DDPG)**: Suitable for environments with continuous action spaces.

Each algorithm will require careful tuning of hyperparameters and may benefit from domain-specific modifications.

---

This structured approach provides a systematic framework to develop an RL-based system for optimizing energy use within a datacenter, leveraging the interaction between workloads and resource management effectively.

### Using the Gemini API to Get a Model Response

- Initialize the Gemini client with the Google API key and the appropriate base URL.  
- Specify the Gemini model name (`gemini-2.0-flash`). 


In [6]:
gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.0-flash"

start = time.time()
response = gemini.chat.completions.create(model=model_name, messages=messages)
end = time.time()
answer = response.choices[0].message.content

elapsed = end - start
print(f"running time: {elapsed}")
times.append(elapsed)


display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

running time: 14.083750247955322


## Reinforcement Learning System for Datacenter Energy Optimization

This document outlines the design of a Reinforcement Learning (RL) system for optimizing energy consumption in a datacenter environment. The core objective is to dynamically manage workload placement to minimize energy expenditure while ensuring Quality of Service (QoS) by adhering to resource constraints.

### 1. Datacenter Environment Design

The datacenter environment is modeled as a discrete-time simulation. It comprises a cluster of `N` nodes, each with a defined capacity of CPU, Memory, and Storage.  Workloads arrive dynamically, requesting specific amounts of these resources.  These requests are served by placing the workload on a suitable node. Workloads remain active for a variable duration and then depart.

**1.1 Observation Space:**

The observation space, `S`, needs to represent the current state of the datacenter.  It includes node utilization and workload characteristics. A potential design is:

`S = [s_1, s_2, ..., s_N, w]`

where:

*   `s_i` represents the state of node *i*, containing:
    *   `cpu_util_i`: CPU utilization of node *i* (percentage).
    *   `mem_util_i`: Memory utilization of node *i* (percentage).
    *   `storage_util_i`: Storage utilization of node *i* (percentage).
    *   `node_temp_i`: Temperature of node *i* (Celsius). (Optional, but relevant for energy modeling).
*   `w` represents the current workload's resource demands:
    *   `cpu_req`: CPU requested by the current workload.
    *   `mem_req`: Memory requested by the current workload.
    *   `storage_req`: Storage requested by the current workload.
    *   `duration_req`: Duration of the workload (discrete time steps). (Optional, for workload scheduling).

The observation space is continuous (for utilization percentages and resource demands) and potentially high-dimensional, depending on the number of nodes.  Normalization techniques (e.g., min-max scaling or standardization) are crucial to improve learning stability.

**1.2 Action Space:**

The action space, `A`, represents the possible actions the agent can take, which is the node where to place the incoming workload.

`A = {0, 1, 2, ..., N-1}`

where each integer represents the index of a node in the cluster. Action `i` corresponds to placing the workload on node *i*. If no feasible node is available (i.e., placing the workload on any node would exceed its capacity), a designated "reject" action `N` could be included.

**1.3 Reset Function:**

The `reset()` function initializes the datacenter environment to a clean state. This includes:

*   Setting all node utilizations to zero.
*   Generating a new, random workload according to a predefined workload arrival distribution (e.g., Poisson process).
*   Setting the simulation time to zero.

**1.4 Step Function:**

The `step(action)` function executes an action and updates the environment. It comprises the following steps:

1.  **Action Execution:**
    *   If `action` is `0` to `N-1`: Attempt to place the current workload on node `action`.  Check if sufficient resources are available (CPU, Memory, Storage) on the target node.
    *   If `action` is `N` (the "reject" action): Reject the current workload.

2.  **Resource Allocation (if action successful):**
    *   Update node utilizations based on the workload's resource demands.

3.  **Reward Calculation (described in Section 2).**

4.  **Environment Update:**
    *   Advance the simulation time by one time step.
    *   Decrement the remaining duration of currently running workloads.
    *   Remove workloads that have completed their execution (duration reaches zero).
    *   Generate a new workload according to the workload arrival distribution.

5.  **Observation Update:**
    *   Update the observation space `S` based on the new node utilizations and the new workload.

6.  **Termination Condition:** The simulation episode terminates after a fixed number of time steps or when a critical failure occurs (e.g., a node overheats).

7.  **Return:** The function returns `(next_state, reward, done, info)`, where:
    *   `next_state`: The updated observation space `S`.
    *   `reward`: The reward received for taking the action.
    *   `done`: A boolean indicating whether the episode has terminated.
    *   `info`:  A dictionary containing debugging information (e.g., resource allocation details, rejected workload count).

### 2. Reward Function Design

The reward function, `R`, is critical for guiding the RL agent towards energy-efficient workload placement.  It should incentivize low energy consumption while penalizing resource violations and workload rejections.  A potential design is:

`R = w_e * R_energy + w_r * R_reject + w_v * R_violation`

where:

*   `R_energy`: Reward related to energy consumption.  This should be *negative* to penalize high energy usage. A suitable formulation is:

    `R_energy = -Energy_consumption`

    `Energy_consumption` can be estimated based on node utilization and temperature. A simplified model could be:

    `Energy_consumption = sum(P_idle_i + (P_max_i - P_idle_i) * cpu_util_i / 100)` summed over all nodes `i`, where `P_idle_i` and `P_max_i` are the idle and maximum power consumption of node *i*, respectively.
*   `R_reject`: Reward for rejecting a workload. This should also be *negative* to penalize workload rejections.

    `R_reject = -Penalty_reject  if action == N else 0`

    `Penalty_reject` is a predefined penalty value.  A higher penalty encourages the agent to accept workloads whenever possible.

*   `R_violation`: Reward for resource violations (exceeding node capacity).  This should be a large *negative* value to strongly discourage violations.

    `R_violation = -Penalty_violation if any_node_violated else 0`

    `Penalty_violation` is a significantly larger penalty than `Penalty_reject`.  `any_node_violated` is a boolean indicating whether any node's resource capacity (CPU, Memory, Storage) was exceeded after placing the workload.

*   `w_e`, `w_r`, and `w_v` are weights that balance the relative importance of each reward component.  These weights require careful tuning to achieve the desired performance.  A good starting point might be `w_e = 1`, `w_r = 10`, and `w_v = 100`.

### 3. Agent Design

The agent is a neural network that maps the observed state `S` to an action `A`.  The architecture of the neural network depends on the chosen RL algorithm (see Section 4).  Common choices include:

*   **Deep Q-Network (DQN):**  The network estimates the Q-values (action-value function) for each possible action given the current state. The agent selects the action with the highest Q-value (exploitation) or a random action (exploration) according to an epsilon-greedy policy.  Requires a separate network for Q-value approximation.
*   **Actor-Critic Methods (e.g., A2C, PPO, DDPG):** These methods employ two networks: an actor network that determines the policy (probability distribution over actions) and a critic network that estimates the value function (expected cumulative reward). They are suitable for continuous action spaces (though discretizing the action space is also possible). The actor uses the critic's feedback to improve its policy.

**Agent Components:**

*   **Neural Network:**  A deep neural network (DNN) with multiple layers, parameterized by weights `θ`. The input is the state `S`, and the output depends on the chosen RL algorithm.  For DQN, the output is a Q-value for each action. For actor-critic methods, the output includes parameters for the policy (e.g., mean and standard deviation of a Gaussian distribution for continuous actions).
*   **Experience Replay Buffer (DQN):** A memory buffer that stores past experiences `(s, a, r, s', done)`.  The agent samples mini-batches from the buffer to train the neural network, breaking temporal correlations and improving learning stability.
*   **Optimizer:** An optimization algorithm (e.g., Adam, RMSprop) used to update the neural network weights `θ` based on the gradients calculated from the reward signal.
*   **Exploration Strategy:**  A strategy for balancing exploration (trying new actions) and exploitation (choosing the best-known action). Common strategies include epsilon-greedy (DQN) and adding noise to the action selection (DDPG).

### 4. Algorithm Selection

Given the nature of the problem, the following RL algorithms are considered suitable:

*   **Deep Q-Network (DQN):** A classic value-based algorithm suitable for discrete action spaces. DQN is relatively straightforward to implement but can suffer from instability, especially with high-dimensional state spaces. Techniques like Double DQN, Dueling DQN, and prioritized experience replay can improve its performance.  *Justification:* Simple to implement and understand, good baseline to compare to more complex algorithms.

*   **Proximal Policy Optimization (PPO):** A state-of-the-art policy gradient algorithm that balances exploration and exploitation effectively. PPO is more stable than other policy gradient methods due to its clipping mechanism, which prevents overly large policy updates. It is suitable for both discrete and continuous action spaces. *Justification:* Stable, well-performing policy gradient method suitable for complex environments.

*   **Soft Actor-Critic (SAC):**  An off-policy actor-critic algorithm that incorporates entropy regularization, encouraging exploration and preventing premature convergence to suboptimal policies. SAC is known for its sample efficiency and robustness. *Justification:* Efficient and robust exploration, especially suited for complex, high-dimensional environments.

**Justification for Algorithm Selection:**

*   The datacenter environment is complex, with a continuous state space and a discrete action space.
*   The reward function is sparse and delayed, making it challenging for the agent to learn.
*   The high dimensionality of the state space requires powerful function approximation techniques, such as deep neural networks.

**Training Procedure:**

1.  **Initialization:** Initialize the environment, the agent (neural network and optimizer), and the experience replay buffer (if applicable).
2.  **Episode Loop:** Repeat until the maximum number of episodes is reached:
    *   Reset the environment to a new initial state.
    *   **Time Step Loop:** Repeat until the episode terminates:
        *   Observe the current state `s`.
        *   Select an action `a` based on the agent's policy (e.g., epsilon-greedy for DQN, policy network for actor-critic methods).
        *   Execute the action in the environment and observe the next state `s'`, the reward `r`, and the done flag.
        *   Store the experience `(s, a, r, s', done)` in the experience replay buffer (if applicable).
        *   Update the agent's neural network weights by sampling a mini-batch from the experience replay buffer (if applicable) and applying the chosen RL algorithm's update rule (e.g., Q-learning update for DQN, policy and value function updates for actor-critic methods).
        *   Update the current state `s = s'`.
3.  **Evaluation:** After training, evaluate the agent's performance on a separate test set to assess its generalization ability.

**Hyperparameter Tuning:**

The performance of the RL system is highly sensitive to hyperparameters, such as learning rate, discount factor, exploration rate, and neural network architecture.  Hyperparameter tuning is crucial for achieving optimal performance. Techniques like grid search, random search, or Bayesian optimization can be used to find the best hyperparameter values.

This document provides a comprehensive framework for designing an RL system for datacenter energy optimization. The specific implementation details will depend on the available resources and the desired level of performance. The design choices presented here provide a solid foundation for further research and development.


### Querying DeepSeek’s Chat Model

- Initialize the DeepSeek client using the corresponding API key and base URL.  
- Set the model name to `"deepseek-chat"` and `"deepseek-reasoner"`.  




In [7]:
deepseek = OpenAI(api_key=deepseek_api_key, base_url="https://api.deepseek.com/v1")
model_name = "deepseek-chat"

start = time.time()
response = deepseek.chat.completions.create(model=model_name, messages=messages)
end = time.time()
answer = response.choices[0].message.content

elapsed = end - start
print(f"running time: {elapsed}")
times.append(elapsed)


display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

running time: 61.84269118309021


# Reinforcement Learning System for Datacenter Energy Optimization

## 1. Cluster Environment Design

### Observation Space
The observation space consists of:
- **Node states**: For each node (n ∈ N), we track:
  - Available CPU (percentage)
  - Available memory (percentage)
  - Available storage (percentage)
  - Current power consumption (normalized)
  - Temperature (normalized)
- **Workload characteristics**: For incoming workload (w):
  - Requested CPU cores (normalized)
  - Requested memory (GB, normalized)
  - Requested storage (GB, normalized)
  - Expected duration (normalized)
- **System metrics**:
  - Current PUE (Power Usage Effectiveness)
  - Time of day (for workload pattern awareness)

Total observation dimension: `5|N| + 4 + 2`

### Action Space
The action space is discrete with `|N| + 1` possible actions:
- Place workload on node 1
- Place workload on node 2
- ...
- Place workload on node |N|
- Reject workload (when no suitable node available)

### Environment Dynamics

#### Reset Function
```python
def reset():
    # Initialize all nodes to empty state
    for node in nodes:
        node.available_cpu = node.total_cpu
        node.available_mem = node.total_mem
        node.available_storage = node.total_storage
        node.power = idle_power
        node.temperature = ambient_temp
    
    # Clear workload queue
    workload_queue.clear()
    
    # Reset metrics
    metrics = {
        'total_energy': 0,
        'rejected_workloads': 0,
        'pue': 1.0
    }
    
    # Generate initial observation
    return get_observation()
```

#### Step Function
```python
def step(action):
    # Get current workload
    workload = workload_queue.pop(0)
    
    if action < len(nodes):  # Placement action
        node = nodes[action]
        
        # Check resource constraints
        if (node.available_cpu >= workload.cpu and
            node.available_mem >= workload.mem and
            node.available_storage >= workload.storage):
            
            # Update node resources
            node.available_cpu -= workload.cpu
            node.available_mem -= workload.mem
            node.available_storage -= workload.storage
            
            # Calculate power increase
            power_delta = calculate_power_increase(node, workload)
            node.power += power_delta
            
            # Schedule workload completion
            schedule_completion(node, workload)
            
            reward = calculate_reward(node, workload, 'placed')
        else:
            reward = calculate_reward(None, workload, 'rejected')
    else:  # Reject action
        reward = calculate_reward(None, workload, 'rejected')
    
    # Get next workload
    workload_queue.append(generate_workload())
    
    # Update system metrics
    update_metrics()
    
    # Check termination
    done = check_termination()
    
    return get_observation(), reward, done, get_metrics()
```

## 2. Reward Function Design

The reward function balances multiple objectives:

```python
def calculate_reward(node, workload, action_type):
    if action_type == 'rejected':
        return -α  # Penalty for rejection
    
    # Energy efficiency components
    power_reward = -β * (node.power / node.max_power)
    pue_reward = -γ * (current_pue - ideal_pue)
    
    # Resource utilization components
    util_reward = δ * (
        (1 - node.available_cpu/node.total_cpu) +
        (1 - node.available_mem/node.total_mem) +
        (1 - node.available_storage/node.total_storage)
    ) / 3
    
    # Temperature penalty
    temp_penalty = -ε * max(0, (node.temperature - threshold_temp)/threshold_temp)
    
    # Workload affinity bonus (if applicable)
    affinity_bonus = η * check_affinity(node, workload)
    
    total_reward = (
        power_reward + 
        pue_reward + 
        util_reward + 
        temp_penalty + 
        affinity_bonus
    )
    
    return total_reward
```

Where α, β, γ, δ, ε, η are tunable hyperparameters that control the trade-off between:
- Workload acceptance rate
- Energy efficiency
- Resource utilization
- Thermal management
- Workload affinity considerations

## 3. Agent Design

We propose a hybrid agent architecture:

### Feature Extraction Module
- Multi-layer perceptron for processing node states
- Embedding layer for workload characteristics
- Temporal attention mechanism for time-dependent patterns

### Policy Network
- Dueling DQN architecture for value estimation
- Separate streams for state value and action advantages
- ϵ-greedy exploration with adaptive decay

### Experience Replay
- Prioritized experience replay buffer
- Importance sampling for bias correction
- Hindsight experience replay for rare events

### Auxiliary Tasks
- Joint learning of:
  - Node power consumption prediction
  - Workload duration prediction
  - Temperature forecasting

## 4. Algorithm Selection

### Baseline Algorithms
1. **Deep Q-Network (DQN)**
   - Suitable for discrete action space
   - Stable learning with experience replay
   - Extensible with double Q-learning and dueling architecture

2. **Proximal Policy Optimization (PPO)**
   - Policy gradient method with clipped updates
   - Better for handling continuous aspects of the problem
   - More sample-efficient than vanilla policy gradients

### Advanced Options
1. **Multi-Agent RL (MADRLN)**
   - Treat each node as an independent agent
   - Centralized training with decentralized execution
   - Uses attention mechanisms for coordination

2. **Hierarchical RL**
   - High-level controller for workload distribution
   - Low-level controllers for individual node optimization
   - Temporal abstraction improves scalability

3. **Model-Based RL**
   - Learn transition dynamics of the datacenter
   - Combine with model predictive control
   - Particularly effective for thermal management

### Recommended Approach
A **Rainbow DQN** variant combining:
- Double Q-learning
- Prioritized experience replay
- Dueling networks
- Multi-step learning
- Distributional RL
- Noisy nets for exploration

This provides state-of-the-art performance while maintaining relative implementation simplicity compared to more complex actor-critic methods.

## Implementation Considerations

1. **Simulation Fidelity**:
   - Calibrate power models using real datacenter measurements
   - Validate workload patterns against production traces

2. **Training Protocol**:
   - Curriculum learning starting with simple scenarios
   - Progressive neural networks for transfer learning across configurations

3. **Safety Mechanisms**:
   - Action masking to prevent invalid placements
   - Runtime constraints on decision latency
   - Fallback to heuristic methods during exploration

4. **Evaluation Metrics**:
   - Energy efficiency (PUE improvement)
   - Workload completion rate
   - Resource utilization variance
   - Thermal violation frequency
   - Decision latency percentiles

In [8]:
model_name = "deepseek-reasoner"

start = time.time()
response = deepseek.chat.completions.create(model=model_name, messages=messages)
end = time.time()
answer = response.choices[0].message.content

elapsed = end - start
print(f"running time: {elapsed}")
times.append(elapsed)

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

running time: 214.10724592208862


### Reinforcement Learning System for Datacenter Energy Optimization

#### 1. **Cluster Environment Design**
The environment simulates a datacenter cluster with `N` heterogeneous nodes. Workloads arrive sequentially, each characterized by a resource demand vector and duration. Nodes have finite capacity for CPU, Memory, and Storage.

##### **State Space (`observation_space`)**  
- **Node States**:  
  For each node `i` (1 ≤ `i` ≤ `N`):  
  - Normalized remaining CPU: `[0, 1]`  
  - Normalized remaining Memory: `[0, 1]`  
  - Normalized remaining Storage: `[0, 1]`  
- **Workload State**:  
  - Normalized demanded CPU: `[0, 1]`  
  - Normalized demanded Memory: `[0, 1]`  
  - Normalized demanded Storage: `[0, 1]`  
- **Time State**:  
  - Normalized time since workload arrival: `[0, 1]`  
- **Dimensionality**:  
  `(N × 3) + 3 + 1 = 3N + 4` continuous features.  

##### **Action Space (`action_space`)**  
- Discrete action space of size `N + 1`:  
  - Actions `{0, 1, ..., N-1}`: Place workload on node `i`.  
  - Action `N`: Reject workload.  
- **Constraints**:  
  Action `a` is invalid if `node_resources[a] < workload_demand` (unless `a = N`).

##### **Reset Function**  
```python
def reset(self):
    # Initialize all nodes with full resources
    self.node_resources = np.copy(self.node_capacities)  # Shape: [N, 3]
    # Generate first workload
    self.workload = self._generate_workload()
    self.arrival_time = 0.0
    # Return initial state
    return self._get_state()
```

##### **Step Function**  
```python
def step(self, action):
    # 1. Validate action
    if action < self.n_nodes:
        assert np.all(self.workload.demand <= self.node_resources[action]), "Invalid placement"
    
    # 2. Update cluster state
    if action < self.n_nodes:  # Placement
        self.node_resources[action] -= self.workload.demand
        self.running_workloads.append((action, self.workload))
    # (Rejection requires no state change)
    
    # 3. Simulate time progression
    self.current_time += self.timestep
    completed = self._remove_completed_workloads()  # Free resources
    self.workload = self._generate_workload()  # New workload
    
    # 4. Compute reward, next state, and done
    reward = self._calculate_reward(action, completed)
    next_state = self._get_state()
    done = (self.current_time >= self.max_episode_time)
    
    return next_state, reward, done, {}
```

---

#### 2. **Reward Function**  
The reward balances energy efficiency, workload rejection penalties, and SLA compliance:  

$$r_t = -\underbrace{\sum_{i=1}^{N} P_i(\mathbf{u}_i) \cdot \Delta t}_{\text{Energy cost}} - \underbrace{\lambda_r \cdot \mathbb{I}_{\text{reject}}}_{\text{Rejection penalty}} + \underbrace{\lambda_s \cdot \mathbb{I}_{\text{SLA}}}_{\text{SLA bonus}}$$  

- **Energy Cost**:  
  $P_i(\mathbf{u}_i) = P_{\text{idle}} + (P_{\text{max}} - P_{\text{idle}}) \cdot \|\mathbf{u}_i\|_2$  
  where $\mathbf{u}_i$ is the utilization vector (CPU, Mem, Storage) of node $i$.  
- **Rejection Penalty**:  
  $\lambda_r$ (e.g., 100) applied when workload is rejected.  
- **SLA Bonus**:  
  $\lambda_s$ (e.g., 10) if workload completes within its deadline.  
- **Time Step**: $\Delta t$ (e.g., 1 second).  

---

#### 3. **Agent Design**  
##### **Policy Architecture**  
- **Input Layer**: `3N + 4` units (state features).  
- **Hidden Layers**:  
  - 256-unit Dense (ReLU)  
  - 128-unit Dense (ReLU)  
- **Output Layer**: `N + 1` units (Softmax for action probabilities).  

##### **Key Components**  
- **Experience Replay**: Stores transitions `(s, a, r, s')` to decorrelate updates.  
- **Target Network**: Stabilizes Q-learning (if using DQN variant).  
- **Constraints Handling**:  
  - Mask invalid actions (e.g., nodes with insufficient resources) during training/inference.  

---

#### 4. **Algorithm Selection**  
##### **Proximal Policy Optimization (PPO)**  
- **Advantages**:  
  - Sample-efficient and stable for high-dimensional state spaces.  
  - Handles continuous and discrete actions via policy gradient.  
  - Supports constraint masking natively.  
- **Training Objective**:  
  $$L^{\text{CLIP}} = \mathbb{E}_t \left[ \min\left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} \hat{A}_t, \text{clip}\left(\frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1-\epsilon, 1+\epsilon\right) \hat{A}_t \right) \right]$$  

##### **Alternative: Attention-based Actor-Critic**  
- **Graph Representation**:  
  Nodes as vertices, workloads as context.  
- **Multi-Head Attention**:  
  Captures node-workload affinity for placement decisions.  
- **Advantage**:  
  Generalizes to variable cluster sizes.  

##### **Hybrid Training**  
1. **Imitation Learning**:  
   - Pre-train with heuristic rules (e.g., Best-Fit, First-Fit).  
2. **RL Fine-tuning**:  
   - PPO with KL-divergence penalty to prevent policy collapse.  

---

### **System Summary**  
| **Component**       | **Design Choice**                                  |
|----------------------|---------------------------------------------------|
| **State Space**      | Normalized node resources + workload demand + time |
| **Action Space**     | Discrete placement/rejection (`N+1` actions)       |
| **Reward**           | Energy cost + rejection penalty + SLA bonus        |
| **Agent**            | PPO with masked invalid actions                    |
| **Optimization**     | Attention mechanisms for scalability              |

**Key Innovations**:  
- **Multi-resource utilization norm** in energy modeling.  
- **SLA-awareness** via deadline tracking in workload lifecycle.  
- **Action masking** ensures resource constraints are never violated.  

This design optimizes energy while maintaining workload QoS, leveraging modern RL algorithms for scalable datacenter control.

### Querying Groq’s LLM Model

- Initialize the Groq client using the appropriate API key and base URL.  
- Set the model name to `"llama-3.3-70b-versatile"`. 



In [9]:
groq = OpenAI(api_key=groq_api_key, base_url="https://api.groq.com/openai/v1")
model_name = "llama-3.3-70b-versatile"

start = time.time()
response = groq.chat.completions.create(model=model_name, messages=messages)
end = time.time()
answer = response.choices[0].message.content

elapsed = end - start
print(f"running time: {elapsed}")
times.append(elapsed)

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

running time: 5.06576681137085


Designing a Reinforcement Learning (RL) System for Optimizing Energy in a Datacenter
====================================================================

### Step 1: Designing the Cluster Environment

The cluster environment is modeled as a Markov Decision Process (MDP), which consists of the following components:

* **Observation Space**: The observation space represents the current state of the datacenter. It includes:
	+ The current utilization of each node (CPU, Memory, Storage)
	+ The number of running workloads on each node
	+ The demanded resources (CPU, Memory, Storage) of the new arriving workload
	+ The available capacity of each node (CPU, Memory, Storage)
* **Action Space**: The action space represents the possible decisions that can be made by the agent. It includes:
	+ Selecting a node to place the new workload
	+ Rejecting the new workload (e.g., due to insufficient resources)
* **Reset Function**: The reset function is used to initialize the environment at the beginning of each episode. It includes:
	+ Initializing the nodes with their maximum capacity
	+ Setting the number of running workloads on each node to zero
* **Step Function**: The step function is used to transition the environment from one state to another. It includes:
	+ Updating the utilization of each node based on the new workload placement
	+ Updating the number of running workloads on each node
	+ Checking for any node overutilization and taking corrective actions (e.g., rejecting new workloads)

### Step 2: Designing the Reward Function

The reward function is designed to encourage energy-efficient workload placement. It includes:

* **Energy Consumption**: The energy consumption of each node is calculated based on its utilization
* **Resource Utilization**: The utilization of each resource (CPU, Memory, Storage) is calculated for each node
* **Overutilization Penalty**: A penalty is imposed when the demand for resources exceeds the capacity of a node
* **Reward Calculation**: The reward is calculated as a weighted sum of the negative energy consumption, resource utilization, and overutilization penalty

The reward function can be formulated as:

R = - (α \* Energy_Consumption + β \* Resource_Utilization + γ \* Overutilization_Penalty)

where α, β, and γ are weighting factors that can be adjusted to prioritize different objectives.

### Step 3: Designing the Agent

The agent is designed to learn an optimal policy for workload placement. It includes:

* **State Representation**: The agent receives the current state of the environment as input
* **Action Selection**: The agent selects an action based on its policy
* **Learning**: The agent updates its policy based on the reward received from the environment

### Step 4: Selecting the Best Algorithms for this Problem

Several RL algorithms can be applied to this problem, including:

* **Deep Q-Networks (DQN)**: A value-based algorithm that learns to estimate the expected return for each state-action pair
* **Policy Gradient Methods (PGMs)**: A policy-based algorithm that learns to optimize the policy directly
* **Proximal Policy Optimization (PPO)**: A model-free, on-policy algorithm that learns to optimize the policy using trust region optimization
* **Deep Deterministic Policy Gradients (DDPG)**: A model-free, off-policy algorithm that learns to optimize the policy using actor-critic methods

The choice of algorithm depends on the specific requirements of the problem, such as the size of the state and action spaces, the complexity of the reward function, and the availability of computational resources.

Example Use Case
---------------

Suppose we have a datacenter with 10 nodes, each with a maximum capacity of 100 CPU, 100 Memory, and 100 Storage. We want to optimize the energy consumption of the datacenter while ensuring that the demand for resources does not exceed the capacity of each node. We can use the designed RL system to learn an optimal policy for workload placement.

```python
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

# Define the observation space
observation_space = {
    'node_utilization': np.array([0.0, 0.0, 0.0]),  # CPU, Memory, Storage
    'num_running_workloads': 0,
    'new_workload_demands': np.array([0.0, 0.0, 0.0]),  # CPU, Memory, Storage
    'available_capacity': np.array([100.0, 100.0, 100.0])  # CPU, Memory, Storage
}

# Define the action space
action_space = {
    'node_selection': np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),  # 10 nodes
    'reject_workload': 0
}

# Define the reward function
def reward_function(state, action):
    energy_consumption = np.sum(state['node_utilization'] * state['available_capacity'])
    resource_utilization = np.sum(state['node_utilization'])
    overutilization_penalty = 0
    if np.any(state['node_utilization'] > state['available_capacity']):
        overutilization_penalty = 1
    reward = - (0.5 * energy_consumption + 0.3 * resource_utilization + 0.2 * overutilization_penalty)
    return reward

# Define the agent
class Agent(nn.Module):
    def __init__(self):
        super(Agent, self).__init__()
        self.fc1 = nn.Linear(10, 128)  # input layer (10) -> hidden layer (128)
        self.fc2 = nn.Linear(128, 10)  # hidden layer (128) -> output layer (10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))  # activation function for hidden layer
        x = self.fc2(x)
        return x

# Train the agent
agent = Agent()
criterion = nn.MSELoss()
optimizer = optim.Adam(agent.parameters(), lr=0.001)
```

Note that this is a simplified example and may not represent a real-world datacenter scenario. The designed RL system can be extended and modified to accommodate more complex scenarios and requirements.

In [10]:
ollama = OpenAI(base_url='http://localhost:11434/v1', api_key='ollama')
model_name = "llama3.2"

start = time.time()
response = ollama.chat.completions.create(model=model_name, messages=messages)
end = time.time()
answer = response.choices[0].message.content

elapsed = end - start
print(f"running time: {elapsed}")
times.append(elapsed)

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)

running time: 43.58457088470459


Designing an RL System for Optimizing Energy in a Datacenter
===========================================================

### Cluster Environment

The cluster environment consists of multiple nodes, each with limited capacities for CPU, Memory, and Storage. Workloads arrive one by one and are placed on a node until they leave the system.

#### Observation Space

The observation space includes the following features:

*   **Node IDs**: Unique identifier for each node in the cluster.
*   **Available Capacity**: Remaining capacity of each resource at each node (CPU, Memory, Storage).
*   **Workload ID**: Unique identifier for each incoming workload.
*   **Load Levels**: Numerical representation of the load level on each node.

#### Action Space

The action space includes two types of actions:

*   **allocateResources**: Allocate resources to a new workload, attempting not to exceed the capacity limits of the node.
*   **assignWorkload**: Redirect an already allocated workload to another node with more available resources.
*   **releaseResources**: Release excess resources back into the pool, without assigning them to any specific workload.

#### Reset Function

Upon receiving a new workload arrival (workload's timestamp), reset:

*   Initialize Node availability tracking based on workload arriving.

#### Step Function

On each time step (`t`):

1.  Check capacity constraint for allocating Resources of incoming Workload
2.  Calculate Total Power consumed based on total load levels.
3.  Decide based on the available capacity of nodes

### Reward Function design

The purpose of a reward function is to guide learning towards optimal policy.

Rewards are assigned as follows:

*   Positive Rewards (`Reward Positivity`) **when** resource usage increases by allocating for newly arriving workload in node and does not exceed capacity,
*   Negative Rewards: when new load levels go beyond current allocated load from newly coming Workloads,

### Agent Design

A combined Q-learning, Value iteration approaches  can be used here as it will handle learning optimal load balancing policy that would result into low energy costs.

### Algorithm selection:

Given the specifics of this environment, we recommend combining **Q-Learning** and **Value Iteration (VI)** to create a combined RL framework that enables our algorithm agent to efficiently explore the environment and make informed decisions.

Combined RL framework based on:
 - Q-learning updates the estimated action-value function according to the TD error.
- Value Iteration improves the policy by optimizing for the value function of each state-action pair.


### Querying Ollama Local Model

- Initialize the Ollama client pointing to the local API endpoint with the API key.  
- Specify the model name `"llama3.2"`.  


### Combining All Model Responses

- Iterate over all collected answers.  
- Concatenate them into a single string with clear section headers identifying each competitor's response.  
- This aggregated text can be used for unified display or further analysis.


In [11]:
together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

### Preparing the Evaluation Prompt for Ranking Model Responses

- Define a prompt instructing the evaluator to rank the responses from all competitors based on clarity and argument strength.  
- Include the original task description for context.  
- Append all competitors' responses concatenated into a single text block.  
- Specify the desired output format: JSON listing competitors ranked from best to worst.  
- Emphasize that the evaluator should respond **only** with JSON—no extra text or formatting.


In [12]:
evaluation = f"""You are evaluating a competition between {len(competitors)} competitors.
Each model has been given this task:

{task_description}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", "third best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [13]:
evaluation_messages = [{"role": "user", "content": evaluation}]

### Running the Evaluation with the O3-Mini Model

- Initialize the OpenAI client.  
- Send the evaluation prompt (`evaluation_messages`) to the `"o3-mini"` model for ranking the competitors.  
- Extract and print the JSON-formatted ranking results returned by the model.


In [14]:
openai = OpenAI()
response = openai.chat.completions.create(
    model="o3-mini",
    messages=evaluation_messages,
)
results = response.choices[0].message.content
print(results)

{"results": ["2", "3", "4", "1", "5", "6"]}


### Parsing and Displaying the Evaluation Results

- Load the JSON string returned by the evaluator into a Python dictionary.  
- Extract the ranking list from the `"results"` key.  
- Iterate over the ranked competitor indices and print their rank along with the corresponding model name.


In [15]:
results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor_idx = int(result) - 1
    competitor = competitors[competitor_idx]
    runtime = times[competitor_idx]
    print(f"Rank {index+1}: {competitor} — Running time: {runtime:.2f} seconds")


Rank 1: gemini-2.0-flash — Running time: 14.08 seconds
Rank 2: deepseek-chat — Running time: 61.84 seconds
Rank 3: deepseek-reasoner — Running time: 214.11 seconds
Rank 4: gpt-4o-mini — Running time: 10.35 seconds
Rank 5: llama-3.3-70b-versatile — Running time: 5.07 seconds
Rank 6: llama3.2 — Running time: 43.58 seconds
