Size-extensive neural net

this neural net takes in molecules-embs of different sizes makes a proposition about each atomistic contribution to a size-extensive property


In [1]:
'''
This script defines a simple architecture for a size-extensive neural network 
that uses latent space from a pre-trained model to fine-tune and make predictions. 
The model is designed to predict a scalar quantity by summing up contributions 
from individual atom embeddings (atomwise approach). The architecture is useful 
for transfer learning tasks, where latent space embeddings are derived from a 
pre-trained model, and predictions are made in a size-extensive manner.

Key components:
1. **atomwise_nn** class: Defines the architecture of a size-extensive neural 
   network with two fully connected layers. Each atom embedding is processed 
   individually, and the outputs are summed to form the final prediction.
2. **Forward pass**: Processes each atom's embedding through the network and 
   sums the outputs to generate a final prediction for the molecular property 
   (size-extensive quantity).

The architecture is well-suited for tasks that involve fine-tuning latent space 
embeddings derived from pre-trained models for specific predictions (e.g., molecular 
properties).

'''

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torch.nn.functional as F
import pandas as pd

np.random.seed(44)  # Setting a random seed for reproducibility

# ATOMWISE size extensive neural network class
class atomwise_nn(nn.Module):
    '''
    Defining size-extensive activation patching 
    transfer learning model architecture to train.

        INPUT_SIZE
            - The number of dimensions in the input feature space
        HIDDEN_SIZE
            - The number of parameters in the hidden layer
        OUTPUT_SIZE
            - The number of dimensions in the output feature space 
              (1 for scalar output, like a property prediction)

    Returns:
        sizeext_quantity
            - The final predicted quantity, representing the 
              size-extensive molecular property.
    '''
    def __init__(self, INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE):
        super(atomwise_nn, self).__init__()
        self.INPUT_SIZE = INPUT_SIZE
        self.HIDDEN_SIZE = HIDDEN_SIZE
        self.output_size = OUTPUT_SIZE

        # Define two fully connected layers
        self.fc1 = nn.Linear(INPUT_SIZE, HIDDEN_SIZE).double()  # Input to hidden layer
        self.fc2 = nn.Linear(HIDDEN_SIZE, OUTPUT_SIZE).double()  # Hidden to output layer

    def forward(self, x):
        sizeext_quantity = 0  # Initialize the size-extensive quantity to 0
        
        # Loop through each atom embedding in the input 'x'
        for each_atomemb in range(len(x)):
            # Forward pass through the network for each atom embedding
            emb = x[each_atomemb]
            emb = F.relu(self.fc1(emb))  # Pass through the first layer and apply ReLU activation
            emb = self.fc2(emb)          # Pass through the second layer (output layer)

            # Sum the outputs of the individual atom embeddings
            sizeext_quantity += emb

        return sizeext_quantity  # Return the final predicted quantity


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
'''
This script implements the training process for a size-extensive neural network, 
specifically the `atomwise_nn` class defined earlier. It uses latent space embeddings 
from a pre-trained model and fine-tunes them to predict a scalar molecular property 
(e.g., energy or another physical property). The training process reads embeddings, 
normalizes them, and trains the model using these embeddings as inputs and molecular 
properties as targets.

Key components:
1. **Hyperparameter definitions**: Defines key parameters like input size, hidden size, 
   learning rate, and the number of epochs.
2. **Loading data**: Reads latent space embeddings and target molecular properties from 
   CSV files.
3. **Training loop**: Iteratively trains the model using backpropagation, computing the 
   loss for both training and validation data, and updating the model weights.
4. **Validation**: After each epoch, the validation loss is calculated to assess the 
   model's performance on unseen data.

The script is designed to train on embeddings and molecular property data, and the 
trained model can later be used to predict scalar properties for new inputs.
'''

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import torch.nn.functional as F
import pandas as pd

# Set random seed for reproducibility
np.random.seed(44)

'''
Training atomwise_nn

This section defines the architecture and hyperparameters for the 
size-extensive neural network (`atomwise_nn`) and loads the embeddings 
(latent space) and molecular properties from CSV files.

    INPUT_SIZE
        - Number of dimensions in input latent space (embeddings)
    HIDDEN_SIZE
        - Number of parameters in the hidden layer of the model
    OUTPUT_SIZE
        - Number of output dimensions of the final molecular property
    LEARNING_RATE
        - Learning rate for the optimizer
    NUM_EPOCHS
        - Number of epochs to run during training
    NUM_TRAIN_SAMPLES
        - Number of training molecules
    NUM_VAL_SAMPLES
        - Number of validation molecules
    EMBS_PATH
        - Filepath where latent vectors (embeddings) are stored
    MOL_PROPERTY_PATH
        - Filepath where molecular property values are stored
'''

# Define hyperparameters
INPUT_SIZE = 128
HIDDEN_SIZE = 200
OUTPUT_SIZE = 1
LEARNING_RATE = 0.001
NUM_EPOCHS = 10000
NUM_TRAIN_SAMPLES = 450
NUM_VAL_SAMPLES = 100

EMBS_PATH = '../data/datasets/embsMP/embslayer5.csv'  # Embeddings file
N_EMBS_FEATURES = 128  # Number of embedding features
MOL_PROPERTY_PATH = '../data/datasets/embsMP/mps.csv'  # Molecular property file

# Initialize the model
model = atomwise_nn(INPUT_SIZE, HIDDEN_SIZE, OUTPUT_SIZE)

# Define the loss function (Mean Squared Error) and the optimizer (Adam)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Load the latent space embeddings and molecular property data
embs = pd.read_csv(EMBS_PATH)
mps_true = pd.read_csv(MOL_PROPERTY_PATH)

# Normalize the embeddings
embs_features = embs.iloc[:, 0:N_EMBS_FEATURES].values
normalize_embs_128 = nn.BatchNorm1d(N_EMBS_FEATURES).double()
embs_featuresnorm = normalize_embs_128(torch.tensor(embs_features))

# Combine normalized embeddings with the rest of the data
embs_norm = np.hstack((embs_featuresnorm.detach().numpy(), embs.iloc[:, N_EMBS_FEATURES:].values))

print(np.shape(embs_norm))  # Check the shape of the combined embedding data

'''
Run Training Loop

The training loop runs for a defined number of epochs (`NUM_EPOCHS`). 
In each epoch, it performs the following steps:
1. **Training**: 
   - For each molecule, it computes the loss and updates model weights using backpropagation.
2. **Validation**: 
   - Evaluates the model on a validation set at the end of each epoch.

For each epoch, the average training and validation losses are printed.
'''

# Training loop
for epoch in range(NUM_EPOCHS):
    total_train_loss = 0.0  # Accumulate training loss for the epoch
    model.train()  # Set the model to training mode
    train_loss = []  # Store training loss values
    val_loss = []  # Store validation loss values
    
    # Training phase
    for each_molecule in range(NUM_TRAIN_SAMPLES):
        # Select the latent space embedding for the current molecule
        X = embs_norm[embs_norm[:, 128] == each_molecule]
        X = X[:, 0:128]
        y = mps_true.iloc[each_molecule].values  # Target molecular property

        # Convert data to PyTorch tensors
        X_tensor = torch.tensor(X)
        y_tensor = torch.tensor(y)

        # Zero gradients, forward pass, compute loss, backward pass, and update weights
        optimizer.zero_grad()  # Zero out gradients
        output = model(X_tensor)  # Forward pass
        loss = criterion(output, y_tensor)  # Compute loss
        loss.backward()  # Backward pass
        optimizer.step()  # Update model weights

        total_train_loss += loss.item()  # Accumulate loss for this epoch

    # Validation phase
    model.eval()  # Set the model to evaluation mode
    outputs = []
    truths = []
    total_val_loss = 0.0  # Accumulate validation loss for the epoch

    with torch.no_grad():  # No gradient computation in validation
        for each_molecule in range(NUM_VAL_SAMPLES):
            # Select the latent space embedding for the validation molecule
            X = embs_norm[embs_norm[:, 128] == each_molecule]
            X = X[:, 0:128]
            y = mps_true.iloc[each_molecule].values  # Target molecular property
            
            # Convert data to PyTorch tensors
            X_tensor = torch.tensor(X)
            y_tensor = torch.tensor(y)

            # Forward pass and compute validation loss
            output = model(X_tensor)
            loss = criterion(output, y_tensor)
            total_val_loss += loss.item()  # Accumulate validation loss

            outputs.append(output)
            truths.append(y)

    # Print the average loss for the training and validation phases
    average_train_loss = total_train_loss / NUM_TRAIN_SAMPLES
    print(f"Epoch {epoch+1}/{NUM_EPOCHS}, Train Loss: {average_train_loss:.4f}")

    average_val_loss = total_val_loss / NUM_VAL_SAMPLES
    print(f"Epoch {epoch+1}/{NUM_EPOCHS}, Val Loss: {average_val_loss:.4f}")
    
    # Append the training and validation losses to the respective lists
    train_loss.append([epoch, average_train_loss])
    val_loss.append([epoch, average_val_loss])
