<a href="https://colab.research.google.com/github/sp8rks/MaterialsInformatics/blob/main/worked_examples/hyperparameter_opt/materials_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Network Classification

In this notebook we will use a neural network to classify data sourced from MPRester. We will be using the Neural Network from PyTorch. 

## Setup

In [45]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import pandas as pd
from pymatgen.ext.matproj import MPRester
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import random

In [46]:
# Set up MPRester
filename = r'G:\My Drive\teaching\5540-6640 Materials Informatics\old_apikey.txt'

def get_file_contents(filename):
    try:
        with open(filename, 'r') as f:
            return f.read().strip()
    except FileNotFoundError:
        print("'%s' file not found" % filename)

Sparks_API = get_file_contents(filename)
mpr = MPRester(Sparks_API)



The data we will be using is 3 element formulas containing Li, Na, or K paired with Oxygen. We will be using the band gap, density, formation energy per atom, volume, and density to try to predict the stability of each compound. After we collect the data we will standardize it to increase stability and consistency between points. Lastly, we will create a train test split for testing the model's accuracy after training.

In [47]:
criteria = {"band_gap": {"$gt": 0}, 'nelements':3, 'elements':{"$in":["Li", "Na", "K"], "$all": ["O"]}}
props = ['band_gap', "density", 'formation_energy_per_atom', 'volume', 'density', 'e_above_hull']
entries = mpr.query(criteria=criteria, properties=props)

df = pd.DataFrame(entries)
df['stable_structure'] = df['e_above_hull'].apply(lambda x: 1 if x < 0.1 else 0)

# Define features and target variable
TargetVariable = 'stable_structure'
Predictors = ['density', 'formation_energy_per_atom', 'volume', 'band_gap', 'density']

X = df[Predictors].values
y = df[TargetVariable].values

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Map target labels to a continuous range starting from 0
unique_labels = pd.unique(y)
label_mapping = {old_label: new_label for new_label, old_label in enumerate(unique_labels)}
y = pd.Series(y).map(label_mapping).values

# Convert to PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



100%|██████████| 2513/2513 [00:01<00:00, 1970.94it/s]


## Construct the Model

We will be using a PyTorch neural network. The batch size controls the number of samples that will be passed through the network at one time. Using batches speeds up and stabilizes training. The DataLoader creates mini-batches from the dataset. It's important to specify how many classes there are as that needs to match the number of output layers on the model.

In [48]:
# Create DataLoader for training and testing
batch_size = 128
train_dataset = TensorDataset(X_train, y_train)
test_dataset = TensorDataset(X_test, y_test)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Ensure the number of classes matches the range of target labels
num_classes = len(torch.unique(y))
print("Number of classes:", num_classes)

Number of classes: 2


We will define a simple neural network using PyTorch’s nn.Sequential to stack layers. The network consists of an input layer, a hidden layer with ReLU activation, and an output layer. 

The input layer is the first point of the NN. Each "neuron" represents a single feature being inputted. 

The ReLU activation layer is a non-linear activation function commonly used in neural networks. The purpose of the activation function is to introduce non-linearity into the network, allowing it to learn more complex patterns in the data. 

The output layer produces the final predictions. In a classification task, the output layer typically uses a softmax function (applied internally by nn.CrossEntropyLoss in PyTorch) to convert the raw output into probabilities for each class.

In [49]:
# Define the neural network using Sequential
input_size = X_train.shape[1]  # Number of features
hidden_size = 128

model = nn.Sequential(
    nn.Linear(input_size, hidden_size),
    nn.ReLU(),
    nn.Linear(hidden_size, num_classes)
)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.1)

We will train the neural network for a number of epochs, updating the model parameters using backpropagation. Each epoch represents a full cycle through the entire training dataset. More epochs decreases the overall loss but runs the risk of increased overfitting. This creates models that aren't able to generalize to unseen data as well.

The forward pass of the data makes predictions and the back passes updates the weights. 

In [50]:
# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    for features, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(features)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")

Epoch 1/50, Loss: 0.4236941337585449
Epoch 2/50, Loss: 0.5728803873062134
Epoch 3/50, Loss: 0.3380995988845825
Epoch 4/50, Loss: 0.47411811351776123
Epoch 5/50, Loss: 0.3759986460208893
Epoch 6/50, Loss: 0.43363016843795776
Epoch 7/50, Loss: 0.3244531750679016
Epoch 8/50, Loss: 0.39847975969314575
Epoch 9/50, Loss: 0.34177204966545105
Epoch 10/50, Loss: 0.29518526792526245
Epoch 11/50, Loss: 0.42364615201950073
Epoch 12/50, Loss: 0.5403419733047485
Epoch 13/50, Loss: 0.3400542736053467
Epoch 14/50, Loss: 0.3173294961452484
Epoch 15/50, Loss: 0.41944023966789246
Epoch 16/50, Loss: 0.5429033637046814
Epoch 17/50, Loss: 0.4024004638195038
Epoch 18/50, Loss: 0.34837329387664795
Epoch 19/50, Loss: 0.40579235553741455
Epoch 20/50, Loss: 0.3782951235771179
Epoch 21/50, Loss: 0.4241243600845337
Epoch 22/50, Loss: 0.4534699618816376
Epoch 23/50, Loss: 0.3941074013710022
Epoch 24/50, Loss: 0.4809429943561554
Epoch 25/50, Loss: 0.3781812787055969
Epoch 26/50, Loss: 0.31916430592536926
Epoch 27/50

After training, we will test the model on the test set to evaluate its performance. The accuracy is calculated as the number of correct predictions divided by the total number of predictions.

In [51]:
# Testing loop
correct = 0
total = 0
with torch.no_grad():
    for features, labels in test_loader:
        outputs = model(features)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy: {100 * correct / total}%")

Accuracy: 83.89662027833002%


## Hyperparameter Tuning

Much like the Random Forest and Support Vector Machine models, Neural Networks can also undergo hyperparameter tuning to increase their performance. We will be using grid search and random search to find the best parameters of the NN. 

Grid search takes a matrix of specified parameters and tests every single combination. This can end up being very computationally expensive and slow depending on how large the search space is (and how long the model takes to train). Luckily this model is fast to train and so won't take that long to grid search. 

Random search tries randomly sampled parameters in a specified search space. This is different than grid search in that it won't always find the best possible combination. However, this method is a lot faster and less computationally expensive than grid search. 

### Grid search

Set up the grid search space to test parameters from

In [52]:
# Define hyperparameter space for grid search
hidden_sizes = [64, 128, 256]
learning_rates = [0.001, 0.01, 0.1]
num_epochs_list = [20, 50, 100]
batch_sizes = [64, 128]

best_accuracy = 0
best_params = {}

Perform hyperparameter tuning with grid search.

In [53]:
for hidden_size in hidden_sizes:
    for learning_rate in learning_rates:
        for num_epochs in num_epochs_list:
            for batch_size in batch_sizes:
                # Create DataLoader for training and testing
                train_dataset = TensorDataset(X_train, y_train)
                test_dataset = TensorDataset(X_test, y_test)
                train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
                test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

                # Ensure the number of classes matches the range of target labels
                num_classes = len(torch.unique(y))

                # Define the neural network using Sequential
                input_size = X_train.shape[1]  # Number of features

                model = nn.Sequential(
                    nn.Linear(input_size, hidden_size),
                    nn.ReLU(),
                    nn.Linear(hidden_size, num_classes)
                )

                # Loss and optimizer
                criterion = nn.CrossEntropyLoss()
                optimizer = optim.Adam(model.parameters(), lr=learning_rate)

                # Training loop
                for epoch in range(num_epochs):
                    for features, labels in train_loader:
                        optimizer.zero_grad()
                        outputs = model(features)
                        loss = criterion(outputs, labels)
                        loss.backward()
                        optimizer.step()

                # Testing loop
                correct = 0
                total = 0
                with torch.no_grad():
                    for features, labels in test_loader:
                        outputs = model(features)
                        _, predicted = torch.max(outputs.data, 1)
                        total += labels.size(0)
                        correct += (predicted == labels).sum().item()

                accuracy = 100 * correct / total

                if accuracy > best_accuracy:
                    best_accuracy = accuracy
                    best_params = {
                        'hidden_size': hidden_size,
                        'learning_rate': learning_rate,
                        'num_epochs': num_epochs,
                        'batch_size': batch_size
                    }

print("Best accuracy:", best_accuracy)
print("Best parameters:", best_params)

Best accuracy: 85.48707753479125
Best parameters: {'hidden_size': 128, 'learning_rate': 0.01, 'num_epochs': 100, 'batch_size': 64}


### Random Search

Set up the random sample search space. This is the same space as grid search but it will randomly sample points rather than trying every single combination. 

In [54]:
# Define hyperparameter space
hidden_sizes = [64, 128, 256]
learning_rates = [0.001, 0.01, 0.1]
num_epochs_list = [20, 50, 100]
batch_sizes = [64, 128]

# Number of random samples to try
num_samples = 10

best_accuracy = 0
best_params = {}

Set up the random search optimization loop

In [55]:
for _ in range(num_samples):
    hidden_size = random.choice(hidden_sizes)
    learning_rate = random.choice(learning_rates)
    num_epochs = random.choice(num_epochs_list)
    batch_size = random.choice(batch_sizes)

    # Create DataLoader for training and testing
    train_dataset = TensorDataset(X_train, y_train)
    test_dataset = TensorDataset(X_test, y_test)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

    # Ensure the number of classes matches the range of target labels
    num_classes = len(torch.unique(y))

    # Define the neural network using Sequential
    input_size = X_train.shape[1]  # Number of features

    model = nn.Sequential(
        nn.Linear(input_size, hidden_size),
        nn.ReLU(),
        nn.Linear(hidden_size, num_classes)
    )

    # Loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Training loop
    for epoch in range(num_epochs):
        for features, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(features)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

    # Testing loop
    correct = 0
    total = 0
    with torch.no_grad():
        for features, labels in test_loader:
            outputs = model(features)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total

    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_params = {
            'hidden_size': hidden_size,
            'learning_rate': learning_rate,
            'num_epochs': num_epochs,
            'batch_size': batch_size
        }

print("Best accuracy:", best_accuracy)
print("Best parameters:", best_params)

Best accuracy: 85.08946322067594
Best parameters: {'hidden_size': 128, 'learning_rate': 0.01, 'num_epochs': 50, 'batch_size': 128}


# Try it Yourself!

- Using the MPRester API find 2 element oxides that are stable. Sample at least 5 different properties including band gap. Don't forget to clean the data!
- If the band gap is between 0.5-3 then change the value to 1 to signifiy a semiconductor. If it's outside that range change the value to 0 to signify it's not a semiconductor (a metal or insulator).
- Set up a NN to classify if something is a semiconductor or not. Make sure to create a train test split for validation!
- Perform hyperparameter tuning on the model and compare the performance from pre-tuning to post-tuning

In [None]:
# Code here