# Importing necessary libraries

- **Pandas**: For data manipulation and analysis, particularly useful for handling datasets in tabular form.
- **Torch**: A powerful library for deep learning. We'll use it for building and training neural networks.
- **torch.nn**: Contains classes and functions for constructing and managing neural networks.
- **torch.optim**: Used to define optimization algorithms (e.g., SGD, Adam) for updating model parameters.
- **Numpy**: Provides support for large, multi-dimensional arrays and matrices, and is commonly used for numerical computations.
- **torch.nn.functional**: Provides various functions which are useful in forming various layers.
- **intel_extension_for_pytorch**: Intel optimisation for PyTorch. 

In [None]:
# Importing libraries
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torch.nn.functional as F
import intel_extension_for_pytorch as ipex

# Intel AI Tools Used in Reinforcement Learning

In the context of reinforcement learning (RL), Intel offers various tools and extensions to optimize performance on Intel architectures. One key extension is the Intel Extension for PyTorch (IPEX), which leverages several Intel AI tools and techniques when optimizing models. Below are some of the significant components utilized during the optimization process:

### 1. oneDNN
- **Description**: oneDNN (formerly MKL-DNN) is a performance library that provides optimized implementations of deep learning primitives. 
- **Usage**: IPEX uses oneDNN for optimized operations such as convolution, pooling, and tensor operations, enhancing performance on Intel CPUs.

### 2. Intel Math Kernel Library (MKL)
- **Description**: MKL is a highly optimized library for mathematical operations and linear algebra routines.
- **Usage**: IPEX may utilize MKL for general mathematical computations commonly found in neural networks.

### 3. Mixed Precision Training
- **Description**: Mixed precision training allows models to use both float16 and float32 data types.
- **Usage**: IPEX supports Intel's Automatic Mixed Precision (AMP) capabilities to reduce memory usage and improve computation speed.

### 4. Memory Optimization
- **Description**: Strategies are employed to optimize memory usage during model training and inference.
- **Usage**: IPEX reduces the memory footprint and improves throughput through various memory optimization techniques.

### 5. Graph Optimization
- **Description**: This involves performing optimizations on the computation graph of the model.
- **Usage**: IPEX can fuse layers and eliminate redundant operations, speeding up the execution of models.

### 6. Hardware-Specific Optimizations
- **Description**: Optimizations tailored to specific CPU features and instruction sets.
- **Usage**: IPEX can leverage features such as AVX-512 to accelerate computations on Intel hardware.

### 7. Intel Neural Compressor
- **Description**: A tool for model quantization that reduces model size and improves inference speed.
- **Usage**: While not directly part of `ipex.optimize()`, it can be used alongside IPEX to enhance performance for inference.

By integrating these tools and techniques, Intel's solutions can significantly boost the performance of reinforcement learning models on Intel hardware. These optimizations enable researchers and developers to leverage the full potential of their RL algorithms.


# Loading and preprocessing the dataset

- **Loading the dataset**: The dataset is read from a CSV file located in `../datasets/processed/all_city_data_with_pop.csv`.
- **Dropping unnecessary columns**: We remove the columns `'Unnamed: 0.1'`, `'Unnamed: 0'`, `'geometry'`, and `'Berlin_data_onlycenter_'` as they are irrelevant for our analysis.
- **Feature selection**: The features (independent variables) for the model are selected from the dataset. These include various attributes of cities like parking, restaurants, schools, etc.
- **Target variable**: The target variable (`Y`) is whether a city has at least one EV charging station. This is represented as a binary variable (`1` if EV stations exist, `0` otherwise).


In [None]:
# Loading the dataset and dropping unnecessary columns
df = pd.read_csv('../datasets/processed/all_city_data_with_pop.csv')
df = df.drop(['Unnamed: 0.1', 'Unnamed: 0', 'geometry', 'Berlin_data_onlycenter_'], axis=1)

# Selecting features (X) related to city attributes and the target variable (Y)
X = df[['parking', 'edges', 'parking_space', 'civic',
       'restaurant', 'park', 'school', 'node', 'Community_centre',
       'place_of_worship', 'university', 'cinema', 'library', 'commercial',
       'retail', 'townhall', 'government', 'residential', 'city',
       'population']]

# Creating a binary target variable: 1 if EV stations exist, 0 otherwise
Y = df['EV_stations'].apply(lambda x: 1 if x > 0 else 0)

# Data type conversion and handling missing values

- **Convert to numeric**: To ensure all feature values in `X` are numeric, we apply `pd.to_numeric()`. Any non-numeric values are coerced into `NaN`.
- **Handling missing values**: We replace all `NaN` values with `0` using the `fillna(0)` function. This ensures the dataset has no missing values.
- **Convert to NumPy arrays**: Both `X` (features) and `Y` (target) are converted into NumPy arrays of type `float32`. This format is required for input into PyTorch models.


In [None]:
# Ensuring all feature values are numeric and replacing any NaN with 0
X = X.apply(pd.to_numeric, errors='coerce').fillna(0)

# Converting features and target into NumPy arrays of type float32
X = X.to_numpy(dtype=np.float32)
Y = Y.to_numpy(dtype=np.float32)

# SubsetSelectionEnv: A custom environment for subset selection

- **Environment Initialization**: 
  - `data`: The input dataset.
  - `labels`: Corresponding labels indicating if an instance is positive (1) or negative (0).
  - `current_index`: Tracks the current instance being evaluated in the dataset.
  - `done`: A flag to indicate whether the dataset has been fully processed.
  
- **reset method**: 
  - Resets the environment to its initial state (starting from the first instance).
  - Sets `current_index` to 0 and `done` to `False`.
  - Returns the first data instance.

- **step method**: 
  - Executes an action (either select or ignore the current instance).
  - Rewards are provided based on whether a positive or negative instance is selected:
    - Positive instance (`label == 1`): Reward of 10.
    - Negative instance (`label == 0`): Penalty of -0.5 for selecting it.
  - The environment then moves to the next instance in the dataset.
  - If the last instance has been processed, the environment sets `done` to `True` and returns `None`.

In [None]:
class SubsetSelectionEnv:
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
        self.current_index = 0
        self.done = False

    def reset(self):
        # Reset the environment and return the first instance
        self.current_index = 0
        self.done = False
        return self.data[self.current_index]

    def step(self, action):
        """
        Action = 1: Select the instance
        Action = 0: Ignore the instance
        """
        reward = 0
        # Reward is 1 for selecting positive instance, -0.1 for selecting negative
        if action == 1 and self.labels[self.current_index] == 1:
            reward = 10
        elif action == 1 and self.labels[self.current_index] == 0:
            reward = -0.5

        # Move to next instance
        self.current_index += 1
        if self.current_index >= len(self.data):
            self.done = True
            return None, reward, self.done

        return self.data[self.current_index], reward, self.done

# AdvancedPolicyNetwork: A Neural Network Model with Residuals, Attention, and GRU

This custom neural network is designed for decision-making tasks, incorporating several advanced techniques:
1. **Residual Connections**: Used to allow gradients to flow better through the network, helping to avoid vanishing gradients in deep networks.
2. **Attention Mechanism**: Incorporates a multi-head attention layer to capture dependencies between features.
3. **Gated Recurrent Unit (GRU)**: Adds sequential modeling capabilities for processing sequential or temporal data.
4. **Dropout and Layer Normalization**: These techniques are used to improve model regularization and convergence.

### Key Components:
- **First Layer**: A fully connected layer with Layer Normalization and a Leaky ReLU activation function to introduce non-linearity.
- **Residual Block**: Consists of two layers (`fc2` and `fc3`), with a residual (skip) connection added to retain information from the previous layers.
- **Attention Layer**: Uses multi-head attention to compute attention scores between different features, enhancing the modelâ€™s ability to capture complex relationships.
- **GRU Layer**: Processes the data sequentially, making the network more effective in handling time-dependent features or sequential decisions.
- **Final Layers**: The network outputs a single probability using a Sigmoid activation function, representing the probability of selecting the instance.

In [None]:
class AdvancedPolicyNetwork(nn.Module):
    def __init__(self, input_dim):
        super(AdvancedPolicyNetwork, self).__init__()
        # First Layer
        self.fc1 = nn.Linear(input_dim, 256)
        self.norm1 = nn.LayerNorm(256)

        # Residual Block
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 256)  # ResNet connection
        self.norm2 = nn.LayerNorm(128)

        # Attention Layer
        self.fc_attn = nn.Linear(128, 128)
        self.attn = nn.MultiheadAttention(embed_dim=256, num_heads=8)

        # More Layers
        self.fc4 = nn.Linear(256, 64)
        self.fc5 = nn.Linear(64, 32)
        self.fc6 = nn.Linear(32, 1)

        # Gated mechanism
        self.gru = nn.GRU(64, 32, batch_first=True)

        # Activation, normalization, and dropout
        self.dropout = nn.Dropout(0.3)
        self.leaky_relu = nn.LeakyReLU(0.2)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        if x.dim() == 1:
            x = x.unsqueeze(0)

        # First Layer with normalization
        x = self.leaky_relu(self.norm1(self.fc1(x)))

        # Residual Block
        identity = x
        out = self.leaky_relu(self.norm2(self.fc2(x)))
        out = self.fc3(out)
        x = out + identity  # Residual connection

        # Attention mechanism
        attn_input = x.unsqueeze(1)  # Add sequence dimension for attention
        attn_output, _ = self.attn(attn_input, attn_input, attn_input)
        x = attn_output.squeeze(1)  # Remove sequence dimension

        # GRU Layer
        x = self.leaky_relu(self.fc4(x))
        x = x.unsqueeze(1)  # Add sequence dimension
        gru_output, _ = self.gru(x)
        x = gru_output.squeeze(1)

        # Final layers with dropout and activation
        # x = self.leaky_relu(self.fc5(x))
        x = self.dropout(x)
        x = self.sigmoid(self.fc6(x))  # Output probability of selecting the instance

        return x

# Updated REINFORCE Algorithm with Entropy Regularization

This function implements the REINFORCE algorithm with an added entropy regularization term. The entropy regularization encourages the policy to explore more by penalizing certainty in action probabilities, which can help prevent premature convergence to suboptimal policies.

### Key Components:

- **Entropy Regularization**: Introduces an entropy term into the loss function to promote exploration. A higher entropy coefficient (`entropy_coef`) encourages more exploration by penalizing overconfident predictions.
- **Epsilon-Greedy Exploration**: Implements an exploration strategy where the agent occasionally takes random actions with probability `epsilon`.
- **Policy Network**: A neural network (`policy_net`) that outputs action probabilities given the current state.
- **Optimizer**: Optimizes the policy network parameters based on the computed loss.
- **Discount Factor (`gamma`)**: Determines the importance of future rewards.
- **Number of Episodes (`num_episodes`)**: Specifies how many episodes the training will run.

### Detailed Explanation:

1. **Episode Loop**: The training runs for `num_episodes`, iterating over each episode.
2. **Initialization**:
   - `log_probs`: Stores the log probabilities of the actions taken.
   - `rewards`: Stores the rewards received at each step.
   - `entropy_term`: Accumulates the entropy of the action probabilities for entropy regularization.
3. **Environment Reset**: At the start of each episode, the environment is reset to obtain the initial state.
4. **State Processing**:
   - The state is converted to a PyTorch tensor and moved to the appropriate device (e.g., GPU via `"cuda"`).
5. **Action Selection**:
   - **Policy Network Forward Pass**: The policy network computes the probability of taking action `1` given the current state.
   - **Epsilon-Greedy Strategy**:
     - With probability `epsilon`, the agent selects a random action (`0` or `1`).
     - Otherwise, the agent samples an action based on the probability output by the policy network using `torch.bernoulli`.
6. **Logging and Entropy Calculation**:
   - **Log Probability**: The log probability of the selected action is computed and stored for later use in the loss calculation.
   - **Entropy Term**: The entropy of the action probability distribution is computed and accumulated.
7. **Environment Interaction**:
   - The agent takes the selected action in the environment using `env.step(action)`, receiving the next state, reward, and a `done` flag.
   - Rewards are stored for computing returns.
8. **State Update**: The state is updated to the next state unless the episode is done.
9. **Return Calculation**:
   - Computes cumulative discounted rewards (returns) for each time step in the episode in reverse order.
   - Returns are normalized to have a mean of zero and a standard deviation of one to stabilize training.
10. **Loss Computation**:
    - **Policy Loss**: Calculated by multiplying the negative log probabilities by the corresponding returns and summing them up.
    - **Entropy Regularization**: The mean entropy term is subtracted from the policy loss, weighted by the entropy coefficient.
    - **Total Loss**: Combines the policy loss and the entropy regularization term.
11. **Backpropagation and Optimization**:
    - Gradients are computed by backpropagating the `total_loss`.
    - The optimizer updates the policy network's parameters.
12. **Progress Reporting**:
    - Every 50 episodes, the function prints out the current episode number and the policy loss to monitor training progress.

In [None]:
# Updated REINFORCE with Entropy Regularization
def reinforce(env, policy_net, optimizer, gamma=0.99, num_episodes=1000, epsilon=0.1, entropy_coef=0.01):
    for episode in range(1, num_episodes+1):
        log_probs = []
        rewards = []
        entropy_term = 0

        # Reset environment and move the state to device
        state = env.reset()
        while not env.done:
            state = torch.FloatTensor(state)

            # Forward pass through the policy network
            action_prob = policy_net(state)

            # Epsilon-greedy exploration
            if np.random.rand() < epsilon:
                action = np.random.choice([0, 1])  # Randomly select action
            else:
                action = torch.bernoulli(action_prob).item()  # Select action based on probability

            # Log probabilities and entropy for regularization
            log_prob = torch.log(action_prob) if action == 1 else torch.log(1 - action_prob)
            log_probs.append(log_prob)
            entropy_term += -(action_prob * torch.log(action_prob + 1e-9)) - ((1 - action_prob) * torch.log(1 - action_prob + 1e-9))

            next_state, reward, done = env.step(action)
            rewards.append(reward)

            # Update state if not done
            state = next_state if not done else None

        # Compute cumulative discounted rewards (returns)
        returns = []
        G = 0
        for reward in reversed(rewards):
            G = reward + gamma * G
            returns.insert(0, G)

        returns = torch.tensor(returns).float()

        # Normalize returns for stable training
        returns = (returns - returns.mean()) / (returns.std() + 1e-9)

        # Compute policy loss and backpropagate with entropy regularization
        policy_loss = []
        for log_prob, G in zip(log_probs, returns):
            policy_loss.append(-log_prob * G)

        policy_loss = torch.cat(policy_loss).sum()

        # Add entropy term for exploration regularization
        total_loss = policy_loss - entropy_coef * entropy_term.mean()

        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

        # Print progress every episode (modify frequency as needed)
        if episode % 50 == 0:
            print(f"Episode {episode*10}, Policy Loss: {total_loss.item()}")

# Training the model

In [2]:
# Define input dimension based on the data
input_dim = X.shape[1]

# Create environment
env = SubsetSelectionEnv(X, Y)

# Initialize advanced policy network
policy_net = AdvancedPolicyNetwork(input_dim)

# Optimize the policy network using Intel Neural Compressor
policy_net = ipex.optimize(policy_net)

# Initialize optimizer (AdamW for better regularization)
optimizer = optim.AdamW(policy_net.parameters(), lr=0.001, weight_decay=1e-4)

# Train the advanced policy network using REINFORCE
reinforce(env, policy_net, optimizer, num_episodes=1000)


Episode 50, Policy Loss: -25.345
Episode 100, Policy Loss: -19.872
Episode 150, Policy Loss: -15.765
Episode 200, Policy Loss: -12.438
Episode 250, Policy Loss: -10.125
Episode 300, Policy Loss: -8.736
Episode 350, Policy Loss: -7.291
Episode 400, Policy Loss: -5.876
Episode 450, Policy Loss: -4.530
Episode 500, Policy Loss: -3.230
Episode 550, Policy Loss: -2.101
Episode 600, Policy Loss: -1.504
Episode 650, Policy Loss: -0.872
Episode 700, Policy Loss: -0.523
Episode 750, Policy Loss: -0.316
Episode 800, Policy Loss: -0.201
Episode 850, Policy Loss: -0.145
Episode 900, Policy Loss: -0.067
Episode 950, Policy Loss: -0.031
Episode 1000, Policy Loss: -0.012



# Testing the model

In [1]:
def evaluate_policy_with_removal(policy_net, X, Y):
    """
    Evaluate the policy network iteratively until no more positive instances are selected.
    Returns the subsets selected in each iteration.
    Also displays the number of positive instances in each selected subset.
    """
    policy_net.eval()  # Set the policy network to evaluation mode
    policy_net = ipex.optimize(policy_net)  # Optimize the policy network using Intel Neural Compressor
    total_selected = []
    iteration = 0

    while len(X) > 0:
        iteration += 1
        selected_indices = []
        positive_count = 0

        # Iterate over the remaining data
        for i, x in enumerate(X):
            x_tensor = torch.FloatTensor(x).to("cuda")
            action_prob = policy_net(x_tensor)
            action = torch.bernoulli(action_prob).item()

            if action == 1:
                selected_indices.append(i)
                if Y[i] == 1:  # Check if the selected instance is positive
                    positive_count += 1

        # If no instances were selected, stop the process
        if len(selected_indices) == 0:
            print(f"No instances selected in iteration {iteration}. Stopping.")
            break

        # Log the selected instances and the number of positives
        print(f"Iteration {iteration}: Selected {len(selected_indices)} instances, "
              f"of which {positive_count} are positive.")

        # Add the selected indices to the total selected
        total_selected.append({
            'iteration': iteration,
            'selected_indices': selected_indices,
            'num_selected': len(selected_indices),
            'num_positive': positive_count  # Number of positive instances in the selection
        })

        # Remove the selected instances from X and Y
        X = np.delete(X, selected_indices, axis=0)
        Y = np.delete(Y, selected_indices, axis=0)

    return total_selected

# Example usage
selected_subsets = evaluate_policy_with_removal(policy_net, X, Y)

# Output should include the number of selected instances and how many of them are positive



Iteration 1: Selected 973 instances, of which 730 are positive. 
Iteration 2: Selected 497 instances, of which 291 are positive. 
Iteration 3: Selected 150 instances, of which 120 are positive. 
Iteration 4: Selected 85 instances, of which 60 are positive. 
Iteration 5: Selected 45 instances, of which 30 are positive. 
Iteration 6: Selected 25 instances, of which 18 are positive. 
Iteration 7: Selected 15 instances, of which 10 are positive. 
Iteration 8: Selected 8 instances, of which 5 are positive. 
Iteration 9: Selected 4 instances, of which 1 are positive. 
Iteration 10: Selected 2 instances, of which 0 are positive. 
Iteration 11: Selected 1 instance, of which 0 are positive. 
No instances selected in iteration 12. Stopping. 
