##### Copyright 2018 The TensorFlow Authors.

### This Jupyter Notebook was adapted from the notebook used in:
### https://www.tensorflow.org/tutorials/keras/classification.

# Neural Network Classifier: Classifying GeTe Descriptors

This notebook trains a neural network model to classify GeTe descriptors from alpha, beta and amorphous xyz files.

This notebook uses [tf.keras](https://www.tensorflow.org/guide/keras), a high-level API to build and train models in TensorFlow.

In [None]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass


In [None]:
from __future__ import absolute_import, division, print_function, unicode_literals

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras

# Helper libraries
import numpy as np
import matplotlib.pyplot as plt
import time

print(tf.__version__)

This guide uses the descriptors from differnet simulations of GeTe moleules. The neural network learns to assign labels to unseen test descriptors based on the labelled training descriptors.

Here, 36,600 descriptos are used to train the network: 15,000 alpha, 10,800 beta and 10,800 amorphous. 


In [None]:
train_input = np.load('train_alpha_beta_quenched_2node.npy')
train_labels = np.load('train_alpha_beta_quenched_2node_labels.npy')
test_labels = np.load('train_crystalline_2node_labels.npy')
test_input = np.load('train_crystalline_2node.npy')

Loading the dataset returns four NumPy arrays:

* The `train_input` and `train_labels` arrays are the *training set*—the data the model uses to learn.
* The model is tested against the *test set*, the `test_input`, and `test_labels` arrays.

The descriptors are in NumPy arrays. The *labels* are an array of integers, ranging from 0 to 1. These correspond to the *class* of atom the descriptor represents:


<table>
  <tr>
    <th>Label</th>
    <th>Class</th>
  </tr>
  <tr>
    <td>0</td>
    <td>Amorphous</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Crystalline</td>
  </tr>
    
</table>

Each descriptor is mapped to a single label. Since the *class names* are not included with the dataset, store them here to use later:

In [None]:
classifications = ['quenched', 'crystalline']

## Checking the format of input data is correct

Before training the neural network, the format of the data must be correct. The following shows there are 67,500 parameters in the training set.

In [None]:
train_input.shape

Likewise, there are 36,600 labels in the training set:

In [None]:
len(train_labels)


Each label is an integer between 0 and 1:

In [None]:
train_labels

This checks if the number of inputs and labels are the same in the test set. 

In [None]:
test_input.shape

And the test set contains the same number of  descriptor labels:

In [None]:
len(test_labels)

## Build the model

Building the neural network requires configuring the layers of the model, then compiling the model.

### Set up the layers

The basic building block of a neural network is the *layer*. Layers extract representations from the data fed into them. 

Most of machine learning consists of chaining together simple layers. Most layers, such as `tf.keras.layers.Dense`, have parameters that are learned during training.

In [None]:
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dense(2, activation='softmax')
])

The network consists of a sequence of two `tf.keras.layers.Dense` layers. These are densely connected, or fully connected, neural layers. The first *ReLU* `Dense` layer has 128 nodes (or neurons). The second (and last) layer is a 2-node *softmax* layer that returns an array of 10 probability scores that sum to 1. Each node contains a score that indicates the probability that the descriptor belongs to the 0 or 1 label class.

### Compile the model

Before the model is ready for training, it needs a few more settings. These are added during the model's *compile* step:

* *Loss function* —This measures how accurate the model is during training. You want to minimize this function to "steer" the model in the right direction.

* *Optimizer* —This is how the model is updated based on the data it sees and its loss function.

* *Metrics* —Used to monitor the training and testing steps. The following example uses *accuracy*, the fraction of the descriptors that are correctly classified.

In [None]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

## Train the model

Training the neural network model requires the following steps:

1. Feed the training data to the model. In this example, the training data is in the `train_input` and `train_labels` arrays.
2. The model learns to associate descriptors and labels.
3. You ask the model to make predictions about a test set—in this example, the `test_input` array. Verify that the predictions match the labels from the `test_labels` array.

To start training,  call the `model.fit` method—so called because it "fits" the model to the training data:

In [None]:
start1 = time.time()
model.fit(train_input, train_labels, epochs=10)
end1 = time.time()

print (end1-start1)

As the model trains, the loss and accuracy metrics are displayed. This model reaches an accuracy of about 100% on the training data after 10 epochs.

## Evaluate accuracy

Next, compare how the model performs on the test dataset:

In [None]:
start = time.time()
test_loss, test_acc = model.evaluate(test_input,  test_labels, verbose=2)
print('\nTest accuracy:', test_acc)
end = time.time()
print (end-start)

The accuracy on the test dataset can often be a little less than the accuracy on the training dataset. This gap between training accuracy and test accuracy represents *overfitting*. Overfitting is when a machine learning model performs worse on new, previously unseen inputs than on the training data.

## Make predictions

With the model trained, you can use it to make predictions about some descriptors

In [None]:
start = time.time()
predictions = model.predict(test_input)
end = time.time()
print (end-start)

Here, the model has predicted the label for each descriptor in the testing set. This is the result of the first prediction:

In [None]:
predictions[0]

A prediction is an array of 2 numbers. They represent the model's "confidence" that the data is amorphous (index [0]) or crystalline (index[1]). The 1 or 0 label is assigned to the descriptor for whichever index has the highest confidence value:

In [None]:
np.argmax(predictions[0])

So, if the model is most confident that the data is amorphous, or `test_label[0]`. The 0 label will be assigned.

In [None]:
test_labels[0]

In [None]:
x = predictions[:,0]
y = predictions[:,1]

In [None]:
import os
import matplotlib.patches as mpatches
from matplotlib import font_manager as fm, rcParams


orange_patch = mpatches.Patch(color='orange', label='Amorphous')
blue_patch = mpatches.Patch(color='blue', label='Crystalline')
plt.legend(handles=[blue_patch,orange_patch],loc='upper left')


params = {'legend.fontsize': 20, 'legend.handlelength': 2}
plt.rcParams.update(params)

plt.hist(x, bins = 200, density=True, alpha=0.9)
plt.hist(y, bins = 200, density=True, color='orange', alpha=0.8)

plt.xticks(fontproperties=prop, fontsize=20, rotation=0)
plt.yticks(fontproperties=prop, fontsize=20, rotation=0)


plt.xlabel('Confidence Value', fontproperties=prop, fontsize=27, weight='bold')
plt.ylabel('Frequency', fontproperties=prop, fontsize=27)

plt.savefig('test.png')
plt.show()

This confidence plot shows the models overall confidence that the test data is amorphous and should be labelled 0 (orange) and models overall confidence that the test data is crystalline and should be labelled 1 (blue).