# Adsorbate Fingerprints Setup

In this tutorial we will try the adsorbate fingerprint generator, which is useful for converting adsorbates on extended surfaces into fingerprints for predicting their chemisorption energies, bond lengths or other properties.

In other machine learning codes, the data usually comes as a matrix where rows represent training examples or unexplored data points, whereas columns represent features or properties of the data points. Therefore the CatLearn fingerprinters expect atoms objects as inputs and they return the data in an array.

In [None]:
# Import packages.
import os
import numpy as np
import ase.io
from ase.data import atomic_numbers, chemical_symbols
from ase.build import fcc111, add_adsorbate
from ase.constraints import FixAtoms
from ase.visualize import view

from catlearn.fingerprint.setup import FeatureGenerator, default_fingerprinters
from catlearn.fingerprint.periodic_table_data import get_radius, default_catlearn_radius
from catlearn.fingerprint.adsorbate_prep import autogen_info
from catlearn.preprocess.clean_data import clean_infinite, clean_variance
from catlearn.utilities.utilities import target_correlation, holdout_set
try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    plot = True
except ImportError:
    print('Seaborn module is needed for this tutorial.')

### Generate some adsorbate/surface systems from ASE.

We return the atoms objects in a list, which is the simplest format and easily transferable to CatLearn.

In [None]:
"""Make a list of atoms objects."""
adsorbates = ['H', 'O', 'C', 'N', 'S', 'Cl', 'F']
symbols = ['Ag', 'Au', 'Cu', 'Pt', 'Pd', 'Ir', 'Rh', 'Ni', 'Co']
images = []
for i, s in enumerate(symbols):

    # Get atomic radius.
    rs = get_radius(atomic_numbers[s])
    a = 2 * rs * 2 ** 0.5

    for ads in adsorbates:
        # Create a slab.
        atoms = fcc111(s, (2, 2, 3), a=a)
        atoms.center(vacuum=6, axis=2)
        
        # Constrain the slab.
        c_atoms = [a.index for a in atoms if
                   a.z < atoms.cell[2, 2] / 2. + 0.1]
        atoms.set_constraint(FixAtoms(c_atoms))

        # Specify an adsorbate-surface bond distance.
        h = (default_catlearn_radius(atomic_numbers[ads]) + rs) / 2 ** 0.5

        # Adsorb.
        add_adsorbate(atoms, ads, h, 'bridge')

        # Make list of atoms objects.
        images.append(atoms)
print(len(images), ' atoms objects created.')

Here we have our list of atoms stored in `images`.

### Attach meta data automatically.

The adsorbate fingerprinter generates fingerprints based on connectivity of atoms in the adsorbate/slab system. It therefore uses certain metadata as intermediates between the atoms object and the fingerprint. A connectivity matrix is one of those metadata which can some times be computationally time consuming to generate and therefore needs to be made only once.

A list of raw atoms without the metadata can be feed through `autogen_info` to attach the connectivity matrix and metadata.

In [None]:
images = autogen_info(images)

Now let's go ahead and generate our fingerprint matrix.

First we instantiate the FeatureGenerator object and define the fingerprinting functions we want to call. These define what information we retrieve and include in our fingerprints.

In [None]:
# Get the fingerprint generator.
fingerprint_generator = FeatureGenerator(nprocs=1)

# List of feature functions to call. For now let's just grab a default list.
feature_functions = default_fingerprinters(fingerprint_generator, 'adsorbates')

# Feature functions define which fingerprints we generate.
feature_functions

Return the fingerprint matrix from atoms objects and feature_functions

In [None]:
# Run the fingerprinter 
data_matrix = fingerprint_generator.return_vec(images, feature_functions)

# Get a list of names of the features.
feature_names = fingerprint_generator.return_names(feature_functions)

print(np.shape(data_matrix), ' data matrix created.')

We are done. The data matrix is now stored in the variable `data_matrix`.

### Let's analyse the output.

First lets see what features were returned by the `feature_functions`:

In [None]:
for l in range(len(feature_names)):
    print(l, feature_names[l])

Lets check one of the features

In [None]:
descriptor_index = 18
plt.hist(data_matrix[:, descriptor_index], bins=min([65, len(data_matrix)]))
plt.xlabel(feature_names[descriptor_index])

Lets try and compare some of the features about atomic radii using violinplots.

In [None]:
# Select some features to plot.
selection = [10, 11, 14]

# Plot selected of the feature distributions.
plot_data = {}
traint = np.transpose(data_matrix[:, selection])
for i, j in zip(traint, selection):
    plot_data[j] = i
df = pd.DataFrame(plot_data)
fig = plt.figure(figsize=(20, 10))
ax = sns.violinplot(data=df, inner=None)
plt.title('Feature distributions', fontsize=20)
plt.xlabel('Feature No.', fontsize=20)
plt.ylabel('Distribution.', fontsize=20)

string = 'Plotting:'
for s in selection:
    string += '\n' + str(s) + ' ' + feature_names[s]
print(string)

### Clean data

In [None]:
finite_data = clean_infinite(data_matrix, labels=feature_names)
informative_data = clean_variance(finite_data['train'], labels=finite_data['labels'])
training_data = informative_data['train']
clean_features = informative_data['labels']

### Correlations

In [None]:
target_feature = 14
target_corr = target_correlation(training_data, training_data[:, target_feature], correlation=['pearson'])
plt.plot(list(range(np.shape(target_corr)[1])), np.abs(target_corr)[0, :], '-o')
plt.xlabel("Feature No.")
plt.ylabel("Correlation")

In [None]:
corr_ordering = np.argsort(np.abs(target_corr)[0, :])[::-1]
np.abs(target_corr)[0, corr_ordering]
print('Highest correlation with', clean_features[target_feature], ':\n')
for i in range(10):
    print(corr_ordering[i], clean_features[corr_ordering][i], np.abs(target_corr)[0, corr_ordering][i])

In [None]:
view(images[0])

In [None]:
d = pd.DataFrame(training_data)
corr = d.corr(method='pearson')
sns.heatmap(corr.abs(), square=True)

In [None]:
np.abs(target_corr)[0, :][31:36]

In [None]:
clean_features[31:36]

### Analysis of meta data.

Attached to the atoms objects, the fingerprinter needs information about the atoms belonging to the adsorbate. 
This was generate automatically by `autogen_info`, but we can take a closer look at how this meta data is formatted:

In [None]:
# Look at meta data for the first atoms object.
images[0].subsets

E.g. Atomic indices of atoms belonging to the adsorbate are put in `atoms.subsets['ads_atoms']`
There is only one index in that subset, which shows that this system had a monoatomic adsorbate.

In [None]:
# Let's see which one it was.
print('adsorbate:', images[0].get_chemical_symbols()[12])

# What was the site?
print('site:', np.array(images[0].get_chemical_symbols())[images[0].subsets['site_atoms']])

It was a H* sitting on a Ag-Ag bridge site.

As a user, you can always choose to attach this information and avoid relying on `autogen_info`, if you prefer. There could be various reasons, why the accuracy of `autogen_info` is not always optimal.

`autogen_info` will respect any subsets already present.

Furthermore `autogen_info` builds the subsets using information from a connectivity matrix that is stored in `atoms.connectivity`. If the atoms object already has `atoms.connectivity`, that will be kept and used, otherwise a new one will be created using default cutoffs for neighbor distances.

In [None]:
# Lets look at a connectivity matrix.
images[0].connectivity

Note that there are some 2's in there. Those are a result of the small unit cell size, where atoms can connect to neighbors in several neighboring unit cells.

# ASE Database interface
ASE database is a very useful format for small to medium size (up to around 100000) sets of atomic structures. Here we will create an ASE db file and redo the import from the db.

In [None]:
import ase.db
from catlearn.api.ase_atoms_api import database_to_list
from catlearn.fingerprint.adsorbate_prep import autogen_info

In [None]:
# Create a new ASE db.
fname = 'ads_example.db'
os.remove(fname)
c = ase.db.connect(fname)

# Write our atoms objects to the ASE db.
for atoms in images:
    symbols=atoms.get_chemical_symbols()
    species=symbols[atoms.subsets['ads_atoms'][0]]
    name=symbols[atoms.subsets['slab_atoms'][0]]
    c.write(atoms,
            # Recommended keys for CatLearn.
            species=species,
            # Recommended keys for CatMAP compatibility.
            name=name,
            facet='(111)', n=1, crystal='fcc', supercell='2x2', layers=3, surf_lattice='hexagonal')

In [None]:
# Import data.
images = database_to_list(fname)

From here you can run `autogen_info` and the following workflow, as presented before.

# Analyse bond distances, check cutoffs

This analysis must be done on optimized structures, but here we show a toy example using the dataset introduced previously in this tutorial.

In the following, we will plot pair distribution functions (pdf) over our dataset or subset. This is necessary to convince ourselves that a we can rely on connectivities to fingerprint the atomic structures. 

In [None]:
from ase.data import covalent_radii
from catlearn.utilities.distribution import pair_distribution, pair_deviation

### Pair distribution function
The pair distribution function is a histrogram over distances between the atoms in our dataset. Our pdf utility in catlearn can optionally select a one or two elements to include in the analysis.

In [None]:
# int for bonds between a single element and all other atoms. 
# tuple (A, B) for bonds between A and B only.
element = 6

images_subset = [a for a in images if element in a.numbers]

# Generate pdf.
pdf, x = pair_distribution(images_subset, bins=257, bounds=(0.3, 3.), element=element)

In [None]:
# Plot pdf.
plt.plot(x, pdf)
plt.xlabel('$r$ [$10^{-10}$ m]')

The pdf does not directly show us the appropriate cutoff unless we select a specific pair of elements to count bonds between.

### Set cutoffs

Lets set some cutoffs manually and evaluate them.

In [None]:
cutoff_dictionary = {}
for z, s in enumerate(chemical_symbols[:104]):
    if z == 0:
        continue
    elif s in adsorbates and z!=1:
        radius = covalent_radii[z] * 1.1 + 0.1
    else:
        radius = get_radius(z) * 1.1 + 0.1
    cutoff_dictionary[z] = radius

In [None]:
# int for bonds between a single element and all other atoms. 
# tuple (A, B) for bonds between A and B only.
element = (78, 1)

images_subset = [a for a in images if element[0] in a.numbers and element[1] in a.numbers]

# Generate pdf.
pdf, x = pair_distribution(images_subset, bins=257, bounds=(0.3, 3.), element=element)

In [None]:
# Plot pdf.
plt.plot(x, pdf)

# Print and plot bond lenght
bond = 0.
if isinstance(element, int):
    print(chemical_symbols[element] + ' cutoff radius', cutoff_dictionary[element])
elif isinstance(element, tuple):
    for z in element:
        print(chemical_symbols[z] + ' cutoff radius', cutoff_dictionary[z])
        bond += cutoff_dictionary[z]
    print('bond cutoff', bond)
plt.axvline(bond, color='0.5')

# Axis label.
plt.xlabel('$r$ [$10^{-10}$ m]')

If the line is after the first peak and clear of any other peaks, the cutoff will clearly distinguish first nearest neigbors.

### Check cutoffs

When our dataset has a larger number of elements, we don't really want to evaluate every pair of elements as shown above. We can instead plot a histogram of bond distances, where the element specific cutoff radii have been subtracted, $r - (r_a + r_b)$.

In [None]:
deviation, xd = pair_deviation(images, bins=257, bounds=(-.5, 0.5), cutoffs=cutoff_dictionary)

In [None]:
plt.plot(xd, deviation)
plt.xlabel('$r - (r_a + r_b)$ [$10^{-10}$ m]')

If the distribution is 0 where $r = r_a + r_b$, we can unambigously represent structures by their connectivity. If the distribution is not 0 at $r = r_a + r_b$, we may be able to tune our cutoff radii to obtain more accurate connectivities, depending on the dataset.