Visualizing Perovskite Dataset
==============================

**Author:** Panayotis Manganaris



## dependencies



In [None]:
# featurization
import cmcl
from cmcl import Categories

In [None]:
# data tools
import pandas as pd
import numpy as np
# preprocessing
from sklearn.preprocessing import Normalizer, StandardScaler
# visualization
import matplotlib.pyplot as plt
import seaborn as sns

## load data



loading stored targets and reference material



In [None]:
my = pd.read_csv("./mannodi_data.csv").set_index(["index", "Formula", "sim_cell"])
lookup = pd.read_csv("./constituent_properties.csv").set_index("Formula")

## compute features



cmcl provides an "ft" (feature) pandas DataFrame accessor. This
accessor exposes batch feature extraction tools. The function ft.comp
extracts composition vectors from the formula string in a dataframe
(or dataframe index).

The abx function of the collect accessor is a convenience function for
grouping the resulting composition constituents by site membership



In [None]:
mc = my.ft.comp() # compute numerical compostion vectors from strings
mc = mc.collect.abx() # convenient site groupings for perovskites data

this can be used with the logif tool in categories to quickly
categorize records by their mix status. that status is assigned to the
index of the respective tables for further reference



In [None]:
mixlog = mc.groupby(level=0, axis=1).count()
mix = mixlog.pipe(Categories.logif, condition=lambda x: x>1, default="pure", catstring="and")
mc = mc.assign(mix=mix).set_index("mix", append=True)
my = my.assign(mix=mix).set_index("mix", append=True)

The derive<sub>from</sub> function can be used to compute the site-averaged
properties of each record.

It performs a three-way N-to-N table join, performs a weighted
averaging of any resulting redundant entries, and finally reshapes the
results to be consistent with the outermost indices of the accessed
data frame. Hence to obtain the site averaged properties, the
composition table column labels must be grouped first, as above.



In [None]:
mp = mc.ft.derive_from(lookup, "element", "Formula")

## Target space



The main target of interest in this exercise is the Perovskite band
gap.

A number of properties including band gap and dielectric constant have
been collected from DFT computations using the PBE functional.

These properties are targets for modeling. Ideally, an empirical model
can be found that fits to the underlying quantum mechanics, thereby
acting as a surrogate for the DFT function in an active learning
strategy which can quickly recommend compositions as high-performing
candidates for DFT calculation.

The target space is briefly summarized in both uni-variate and bi-variate views

Note: there are an additional two "SLME" properties in the dataset
(not shown here) that have extremely strong relations with band
gap. They are computed using the band gap and a reference
solar-absorption spectra. cmcl does not yet have a utility for
computing them, so they are included in the sample data for reference.



In [None]:
plt.style.use('default')
p = sns.pairplot(my.filter(regex=r"PBE|dielc").drop("PBE_dbg_eV", axis=1).assign(mix=mix), hue='mix')
p.figure.show()

## Feature space



### Composition Distributions



composition vectors are a set of primary descriptors for the
Perovskites being examined &#x2013; most other meaningful features are at
least partially derived from them. Another primary descriptor is the
crystal structure. For now, it is understood that the 496 records
being examined are all cubic perovskites (within a tolerance). They
differ firstly in composition and secondly in alloy character. Alloy
character as a metric is completely encapsulated in the composition
vectors, but nonetheless represents an important consideration in
ensuring the model's generality.

It will be a goal of modeling to create regressions that will be able
to extrapolate targets between the existing alloy character classes.
(AandBandX-site alloys).

Here, uni-variate distributions over finite bounds on composition
ratios are explored with respect to the alloy class.



In [None]:
nmc = pd.melt(
    pd.DataFrame(
        mc.fillna(0).pipe(Normalizer(norm="l1").fit_transform), #normalizing the data by each vector's manhattan length gives proportional quantities
        columns=mc.columns,
        index=mc.index).assign(mix=mix),
    id_vars="mix").replace(0, np.NaN).dropna() # eliminate the "zeros" (missing values) to focus on the meaningful data

In [None]:
with sns.plotting_context("poster"):
    p = sns.catplot(x="value", col="element", data=nmc, col_wrap=5, kind="count", hue="mix",
                    col_order=["Ba", "Ge", "Cl", "Br", "I", "Sn", "Pb", "Cs", "FA", "MA", "Sr", "Ca", "Rb", "K"])
    (p.set_xticklabels(rotation=90))

### Site-Averaged Properties Distributions



In [None]:
dxr = pd.IndexSlice
some_axes = mp.loc[:, dxr[:, mp.columns.get_level_values(1)[0:4]]] #change these level value slices to focus on different site axes or remove slicing to see all
smp = pd.melt(
    pd.DataFrame(
        some_axes.pipe(StandardScaler().fit_transform), #Z transform scales dimensions so they are comparable
        columns=some_axes.columns,
        index=some_axes.index).assign(mix=mix),
    id_vars="mix").replace(0, np.NaN).dropna() # eliminate "zeros" (missing values) to focus on the meaningful data

In [None]:
with sns.plotting_context("notebook"):
    p = sns.displot(x="value", col=smp.iloc[:,2], row="site", data=smp, kind="hist", hue="mix", multiple='stack')

## Bi-variate relations



it is unlikely that any of the targets is full explained by a single
composition or composition derived axis. But there are probably
relations.

A Pearson correlation map will be produced to check for strong
relations.

Then, if any exist, they will be plotted in detail.



### targets vs composition



In [None]:
mc_v_targets = pd.concat([my, mc], axis=1).select_dtypes(np.number).fillna(0)
pearson = pd.DataFrame(np.corrcoef(mc_v_targets, rowvar=False),
                       columns=mc_v_targets.columns,
                       index=mc_v_targets.columns)
subset = pearson.filter(regex=r"PBE|dielc|SLME", axis=0).filter(regex=r"^(?!PBE|HSE|SLME|dielc|PV_FOM)")
#first filter picks targets, second selects bases
p = sns.heatmap(subset, vmax=1.0, vmin=-1.0, cmap="seismic")
p.set_xticklabels(p.get_xticklabels(), rotation=45, horizontalalignment='right')
p.figure.show()

### targets vs site-averaged properties



In [None]:
mp_v_targets = pd.concat([my, mp], axis=1).select_dtypes(np.number).fillna(0)
pearson = pd.DataFrame(np.corrcoef(mp_v_targets, rowvar=False),
                       columns=mp_v_targets.columns,
                       index=mp_v_targets.columns)
subset = pearson.filter(regex=r"PBE|dielc|SLME", axis=0).filter(regex=r"^(?!PBE|HSE|SLME|dielc|PV_FOM)")
#first filter picks targets, second selects bases
p = sns.heatmap(subset, vmax=1.0, vmin=-1.0, cmap="seismic")
p.set_xticklabels(p.get_xticklabels(), rotation=45, horizontalalignment='right')
p.figure.show()

### correlated axes



## Multivariate relations



unsurprisingly no simple explanations exist. to get a better idea of
what structures statistical models might be able to find in the
complete dataset, the structure and effects of many variables at a
time must be inspected.

