## *Using neural networks to predict perovskite bandgaps*

In this tutorial we will learn how to use neural networks from the [Keras](https://keras.io/) library to create a regression model to estimate perovskite bandgaps.

You can find another example of neural network regression using Keras in the [TensorFlow Tutorials](https://nanohub.org/tools/tftutorials) nanoHUB tool.

This tutorial uses Python, some familiarity with programming would be beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel free to modify the code to familiarize yourself with the workings on the code.

**Outline:**

1. Import libraries
2. Getting data
2. Processing and Organizing Data
3. Creating and training the model
4. Plotting

**Get started:** Click "Shift-Enter" on the code cells to run! 

### 1. Import libraries

We first import the relevant libraries. These imports are over four cells:

The first cell imports the Pandas and Numpy libraries that we will use to import and convert the data to appropriate formats for the neural network

In [1]:
import pandas as pd
import numpy as np

# featurization
import cmcl
from cmcl import Categories


from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, normalize
from sklearn.pipeline import make_pipeline as mkpipe
from sklearn.model_selection import StratifiedShuffleSplit

The next cell imports the Keras and Tensorflow libraries, which we use to construct the neural network. The third cell sets the random seed to ensure consistent results every time the notebook is run, an important step in reproducibility

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers



In [3]:
tf.random.set_seed(0)

The next cell imports the pyplot module from the matplotlib library for plotting

In [4]:
from matplotlib import pyplot as plt

### 2. Getting a dataset

We will follow the same steps used in [visualizations](./visualizations.ipynb) to load the data. We first load the CSV (comma separated value) file using the Pandas `read_csv` file. The `set_index` function uses a set of columns (passed as arguments) to index the dataframe

In [5]:
my = pd.read_csv("./mannodi_data.csv").set_index(["index", "Formula", "sim_cell"])
lookup = pd.read_csv("./constituent_properties.csv").set_index("Formula")

We now need to convert the raw chemical formula into a numerical representation. This process is generally called featurization, and we will use the `cmcl` library to "featurize" the chemical formula. This library offers convenience functions such as `ft` that convert a raw string to a numerical representation, and `collect` that conveniently group this data

In [6]:
mc = my.ft.comp() # compute numerical compostion vectors from strings
mc = mc.collect.abx() # convenient site groupings for perovskites data

We now group the data points (perovskites) into various categories (pure vs mixed) using the `Categories` class from `cmcl`. These categories are then assigned to the dataframes we loaded earlier, with the label mix.

In [7]:
mixlog = mc.groupby(level=0, axis=1).count()
mix = mixlog.pipe(Categories.logif, condition=lambda x: x>1, default="pure", catstring="and")
mc = mc.assign(mix=mix).set_index("mix", append=True)
my = my.assign(mix=mix).set_index("mix", append=True)

In [8]:
mixweight = pd.get_dummies(mix)

The categories assigned in the mix variable are now assigned numerical labels using the `OrdinalEncoder()` function from Scikit Learn. So the category pure is assigned the label 0

In [9]:
mixcat = pd.Series(OrdinalEncoder().fit_transform(mix.values.reshape(-1, 1)).reshape(-1),
                     index=mc.index).astype(int)

### 3. Preprocessing and Organizing Data

We now use the Scikit Learn Stratified Shuffle Split function to reserve 20% of the data as a test set, which we will not use in model training

In [10]:
sss = StratifiedShuffleSplit(n_splits=1, train_size=0.8, random_state=0)
train_idx, test_idx = next(sss.split(mc, mixcat)) #stratify split by mix categories
mc_tr, mc_ts = mc.iloc[train_idx], mc.iloc[test_idx]
my_tr, my_ts = my.iloc[train_idx], my.iloc[test_idx]
mixcat_tr, mixcat_ts = mixcat.iloc[train_idx], mixcat.iloc[test_idx]

We now further divide the non-test set of the data into training and validation sets. The first 80% is reserved for training, and the remaining 20% is reserved as a validation set. The validation set is used to control model training, such as preventing overfitting. Before we define neural network and train the model, we need to replace the NaNs in the dataset with zeros, and normalize the inputs. We then convert to the Pandas dataframe to a Numpy array. Lastly, we use the `normalize()` function from Scikit Learn, which converts each row in the dataframe to a vector of length one.

In [11]:
#prepare training and validation sets
X = mc_tr
Y = my_tr.PBE_bg_eV
X = X.fillna(0) #replace nan with zero
Y = Y.fillna(0)
X = np.array(X, dtype='float32') #convert to numpy array
Y = np.array(Y, dtype='float32')
X = normalize(X) #normalize
Y = Y.reshape(-1, 1)

idx = int(0.8*X.shape[0]) #Get a validation set
Xtrain = X[:idx, :]
Ytrain = Y[:idx, :]
Xval = X[idx:, :]
Yval = Y[idx:, :]

#prepare testing set
Xtest = mc_ts
Ytest = my_ts.PBE_bg_eV
Xtest = Xtest.fillna(0)
Ytest = Ytest.fillna(0)
Xtest = np.array(Xtest, dtype='float32')
Ytest = np.array(Ytest, dtype='float32')
Xtest = normalize(Xtest) 
Ytest = Ytest.reshape(-1, 1)

In [12]:
#check for consistency
print (Xtrain.shape, Ytrain.shape, Xval.shape, Yval.shape, Xtest.shape, Ytest.shape)

(316, 14) (316, 1) (80, 14) (80, 1) (99, 14) (99, 1)


### 4. Creating and training the model

For this regression, we will use a simple sequential neural network with two densely connected hidden layers, each with 100 neurons. The optimizer used will be the [Adam Optimizer](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam). To learn more about the Adam Optimizer, click [here](https://climin.readthedocs.io/en/latest/adam.html).

In [57]:
model = keras.Sequential() #initialize a Sequential model
model.add(keras.Input(shape=(14,))) #Add an input layer, the shape parameter tells how many inputs each data point will have
model.add(layers.Dense(100, activation='tanh')) #Dense defines a fully connected layer, the argument specifies the number of neurons
model.add(layers.Dense(100, activation='tanh')) #activation defines the activation function applied after each layer
model.add(layers.Dense(1, activation='relu')) #Output layer can use a 'relu' activation since outputs are always positive

Before the model is ready for training, it needs a few more settings. These are added during the model's compile step:

- *Loss function:* This measures how accurate the model is during training. We want to minimize this function to "steer" the model in the right direction.
- *Optimizer:* This decides the optimization technique used to achieve a minimum for the loss function
- *Epochs:* This decides how long to train the model. One epoch is defined as one iteration over the entire training set, where each iteration loops over all sample batches from the training set. Click [here](https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9) to learn more about iterations, epochs and batch sizes.

In [58]:
optimizer = keras.optimizers.Adam(learning_rate=1e-3) # Initialize an Adam optimizer with a learning rate of 0.001
model.compile(optimizer=optimizer, loss=keras.losses.MeanSquaredError()) #Compile the model with the Adam optimizer and MSE loss
EPOCHS = 100 #Epoch

In [61]:
import os

def get_mkl_enabled_flag():

    mkl_enabled = False
    major_version = int(tf.__version__.split(".")[0])
    minor_version = int(tf.__version__.split(".")[1])
    if major_version >= 2:
        if minor_version < 5:
            from tensorflow.python import _pywrap_util_port
        else:
            from tensorflow.python.util import _pywrap_util_port
            onednn_enabled = int(os.environ.get('TF_ENABLE_ONEDNN_OPTS', '0'))
        mkl_enabled = _pywrap_util_port.IsMklEnabled() or (onednn_enabled == 1)
    else:
        mkl_enabled = tf.pywrap_tensorflow.IsMklEnabled()
    return mkl_enabled

print ("We are using Tensorflow version", tf.__version__)
print("MKL enabled :", get_mkl_enabled_flag())

We are using Tensorflow version 2.7.0
MKL enabled : True


The `model.fit()` function takes in the Numpy data we obtained earlier. This function automatically handles backpropogation and updating model weights. To learn more about backpropagation and how neural networks learn, you can watch the videos [here](https://www.youtube.com/watch?v=aircAruvnKk) or [here](https://www.youtube.com/watch?v=Ilg3gGewQ5U).

In [60]:
model.fit(Xtrain, Ytrain, epochs=EPOCHS, validation_data=(Xval, Yval))

Epoch 1/100


TypeError: 'NoneType' object is not callable

At this point, we can check some of the [weights](https://en.wikipedia.org/wiki/Synaptic_weight) from the trained neural network. These weights, in a way, represent the relationship between inputs and outputs.

In [29]:
weights = model.get_weights()
weights[3]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
      dtype=float32)

In [30]:
#The history object contains the training and validation losses, which we can plot
training_loss = history.history['loss']
validation_loss = history.history['val_loss']

NameError: name 'history' is not defined

In [None]:
#The model.evaluate() function evaluates the model on the training, validation and testing datasets
mse_train = model.evaluate(Xtrain, Ytrain)
mse_val = model.evaluate(Xval, Yval)
mse_test = model.evaluate(Xtest, Ytest)

print (mse_train, mse_val, mse_test)

We can save the model into an h5 format training by using the `model.save()` function. This saved model can be reloaded using the `load_model()` function

In [None]:
model.save('bg_model.h5')

In [None]:
load_model = keras.models.load_model('bg_model.h5')

### 5. Plotting

We now use pyplot from matplotlib to plot the "Learning Curve", which is a plot that shows the evolution of training and validation loss over epochs. We expect the training and validation losses to go down. More importantly, the validation loss helps monitor overfitting. Specifically, if the validation loss goes up, then we know that the model is overfitting. In Keras, this overfitting can be prevented using the `EarlyStopping()` functionality. LINK

In [None]:
plt.plot(training_loss, 'b-')
plt.plot(validation_loss, c='orange')

`model.predict()` makes predictions for different datasets. We will use this function to make predictions on the train, validation and test sets. We expect good predictions for the training and validation sets, but the predictions on the test sets are unknown

In [None]:
Y_pred_tr = model.predict(Xtrain)
Y_pred_val = model.predict(Xval)
Y_pred_test = model.predict(Xtest)

We finally plot a "Parity Plot" that measures the predictions compared to the true values. We see that the model does a reasonable job at predicting band gaps across train, validation, and test sets, indicating that the model has learnt the underlying correlations between composition and bandgap

In [None]:
plt.plot(Ytrain, Y_pred_tr, 'ro')
plt.plot(Yval, Y_pred_val, 'bo')
plt.plot(Ytest, Y_pred_test, 'go')
x = np.linspace(min(Ytrain), max(Ytrain), 1000)
plt.plot(x, x, 'k-')