# Composition Based Feature Vector

Since most materials informatics models rely on small amounts of data, we need to rely on feature engineering in order to inject domain knowledge into our materials representations. One of the most simple ways to do so is through the mighty Composition Based Feature Vector.

My research group created the `CBFV` package to make it super easy to do composition based feature vectors! In order to follow along with the in class demo (https://github.com/sp8rks/MaterialsInformatics/blob/main/worked_examples/CBFV_example/CBFV_example.ipynb) you will need to go to do the following:
(1) open miniconda
(2) activate your MatInformatics python env `conda activate MatInformatics`
(3) install CBFV package `pip install CBFV` (read more at https://pypi.org/project/CBFV/)

#### Video

https://www.youtube.com/watch?v=JctWNNdI9Jc&list=PLL0SWcFqypCl4lrzk1dMWwTUrzQZFt7y0&index=11 (Composition-based feature vector)

## Setup

Let's start by creating some dummy data

In [None]:
from CBFV import composition
import pandas as pd

data = [['Si1O2', 10], ['Al2O3', 15], ['Hf1C1Zr1', 14]]
#this next step is important!! The CBFV composition.generate_features() function 
#requires an input dataframe with a column named 'formula' and another column named 'target'
df = pd.DataFrame(data, columns=['formula', 'target'])

Now, let's do our simplest CBFV featurization and convert our data into a 'one hot encoding' vector

In [None]:
X, y, formulae, skipped = composition.generate_features(df, elem_prop='onehot')

If we look at our input, the X variable, we'll see that the formulae strings are now converted to numerical values that are suitable for machine learning models.

For our first representation, the avg columns represent the *fractional encoding* of the elements. For example, SiO2 is 2/3 Oxygen, 1/3 Silicon so we see 0.66667 in the avg_8 (atomic number 8, Oxygen) position and we see 0.33333 in the 14th column (atomic number 14, Silicon)

In [None]:
#TIP! Open up the X variable in the Data Wrangler extension to see all the columns since they are truncated below
print(X)


# Element property feature vectors
The one hot encoding is a super simple way to encode the formula. It doesn't include any information about the actual chemistry other than the formula. We know that other features should matter. For example, the melting point or ionic size or number of valence electrons etc should be important and useful in relating these materials to their material properties. 

Let's take a look at another featurization technique, the `magpie` feature vector, that encodes more chemical information beyond just one hot encoding.

Read more about `magpie` here in the original article https://doi.org/10.1038/npjcompumats.2016.28

Essentially, the feature vector is created by taking information from the individual elements and then combining the information from these individual elements based on their elemental ratio in the chemical formula. 

In [None]:
X, y, formulae, skipped = composition.generate_features(df, 
    elem_prop='magpie')
print(X)

There are several others too including one of my favorites `olinyky` which we named after Anton Oliynyk, a great chemist who put it together. https://hunter.cuny.edu/people/anton-oliynyk/ Another is `jarvis` which came from the good folks at NIST. Read their article here https://doi.org/10.1038/s41524-020-00440-1

In [None]:
X, y, formulae, skipped = composition.generate_features(df, 
    elem_prop='oliynyk')
print(X)

# Featurization based on scientific literature
There are also some really cool approaches for embedding domain knowledge. For example, `mat2vec` is a clever approach that uses a *natural language processing* tool known as word embeddings to create a feature vector based on scientific literature. You can read about it here https://doi.org/10.1038/s41586-019-1335-8

In [None]:
X, y, formulae, skipped = composition.generate_features(df, 
    elem_prop='mat2vec')
print(X)

Looking at these representations which can be hundreds of columns in length, we see that it went from a simple string like 'SiO2' and was turned into something rather complicated. These representations are less interpretable than a simple chemical formula, but are now mathematical vectors that represent the materials and do so with varying degrees of domain knowledge. In 2020 my group published a careful study that asked whether or not this domain knowledge was actually necessary or helpful in predicting materials properties. We essentially found that the domain knowledge does improve predictions, but as the data increases this advantage slowly disappears. 

You can read the article here https://doi.org/10.1007/s40192-020-00179-z

# Now you try it!
Generate a list of compounds you are interested in, look up their properties, and then featurize this data with your choice of feature set to create an X input and a y target label. Try adding a broken chemical formula that includes an abbreviation for an element that doesn't exist and then see what you find in the skipped variable output by the `CBFV.generate_features` method