# Lesson 2: Generating descriptors for machine learning

In this lesson, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.


<img src="resources/featurizers_overview.png" alt="featurizers overview" style="width: 700px;"/>

### Featurizers transform materials primitives into machine-learnable features

The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a tensor. For example:


\begin{align}
f(\text{Fe2O3}) \rightarrow [1.5, 7.8, 9.1, 0.09]
\end{align}

#### Matminer contains featurizers for the following pymatgen objects:
* Composition
* Structure
* Bandstructure
* DOS

#### Depending on the featurizer, the features returned may be:
* numerical, categorical, or mixed vectors
* matrices 
* other pymatgen objects (for further processing)

### Featurizers play nice with dataframes
Since most of the time we are working with pandas dataframes, all featurizers work natively with pandas dataframes. We'll provide examples of this later in the lesson.


### Featurizers present in matminer
Matminer hosts over 60 featurizers, most of which are implemented from methods published in peer reviewed papers. You can find a full list of featurizers on the [matminer website](https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html). All featurizers have parallelization and convenient error tolerance built into their core methods.

In this lesson, we'll go over the main methods present in all featurizers. By the end of this unit, you will be able to generate descriptors for a wide range of materials informatics problems using one common software interface.

## Part 1: The "featurize" method and basics

### 1.1 The "featurize" method
The core method of any matminer is "featurize". This method accepts a materials object and returns a machine learning vector or matrix. Let's see an example on a pymatgen composition:

In [4]:
from pymatgen import Composition

fe2o3 = Composition("Fe2O3")

As a trivial example, we'll get the element fractions with the `ElementFraction` featurizer.

In [5]:
from matminer.featurizers.composition import ElementFraction

ef = ElementFraction()

Now we can featurize our composition.

In [13]:
element_fractions = ef.featurize(fe2o3)

print(element_fractions)

[0, 0, 0, 0, 0, 0, 0, 0.6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


### 1.2 Feature labels

We've managed to generate features for learning, but what do they mean? One way to check is by reading `Features` section in the documentation of any featurizer... but a much easier way is to use the `feature_labels` method.

In [17]:
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)

['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr']


We now see the labels in the order that we generated the features. 

In [19]:
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])

O 0.6
Fe 0.4


We can use the similar syntax for other kinds of featurizers. Lets get MagPie statistics on our composition using the `magpie` preset of the `ElementProperty` featurizer.

Commonly used forms of some featurizers can be instantiated with the `from_preset` static method for quick setup.

In [8]:
from matminer.featurizers.composition import ElementProperty

ep_magpie = ElementProperty.from_preset("magpie")

We now use `featurize` in the same way as we did for `ElementFraction`.

In [9]:
magpie_stats = ep_magpie.featurize(fe2o3)

print(magpie_stats)

[8.0, 26.0, 18.0, 15.2, 8.64, 8.0, 55.0, 87.0, 32.0, 74.2, 15.36, 87.0, 15.9994, 55.845, 39.8456, 31.93764, 19.125887999999996, 15.9994, 54.8, 1811.0, 1756.2, 757.28, 842.976, 54.8, 8.0, 16.0, 8.0, 12.8, 3.84, 16.0, 2.0, 4.0, 2.0, 2.8, 0.96, 2.0, 66.0, 132.0, 66.0, 92.4, 31.68, 66.0, 1.83, 3.44, 1.6099999999999999, 2.7960000000000003, 0.7727999999999999, 3.44, 2.0, 2.0, 0.0, 2.0, 0.0, 2.0, 0.0, 4.0, 4.0, 2.4, 1.9200000000000004, 4.0, 0.0, 6.0, 6.0, 2.4, 2.88, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 8.0, 2.0, 6.8, 0.96, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 1.2, 0.9600000000000002, 2.0, 0.0, 4.0, 4.0, 1.6, 1.9200000000000004, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 4.0, 2.0, 2.8, 0.96, 2.0, 9.105, 10.73, 1.625, 9.755, 0.78, 9.105, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.1106628, 2.1106628, 0.84426512, 1.0131181439999999, 0.0, 12.0, 229.0, 217.0, 98.8, 104.16, 12.0]


In [None]:
df.columns

A pandas `DataFrame` includes a function called `describe()` that helps determine statistics for the various numerical/categorical columns in the data. Note that the `describe()` function only describes numerical columns by default.

Sometimes, the `describe()` function will reveal outliers that indicate mistakes in the data.

In [None]:
df.describe()

### Indexing the dataset

We can access a particular column of `DataFrame` by indexing the object using the column name. For example:

In [None]:
df["formula"]

Alternatively, we can access a particular row of a `Dataframe using the `iloc` attribute.

In [None]:
df.iloc[100]

### Filtering the dataset

Pandas `DataFrame` objects make it very easy to filter the data based on a specific column. We can use the typical Python comparison operators (==, >, >=, <, etc) to filter numerical values. For example, let's find all entries where the cell volume is greater than 580. We do this by filtering on the `volume` column.

Note that we first produce a *boolean mask* – a series of `True` and `False` depending on the comparison. We can then use the mask to filter the `DataFrame`. 

In [None]:
mask = df["volume"] > 580
df[mask]

We can use this method of filtering to clean our dataset. For example, if we only wanted our dataset to include semiconductors (materials with a non-zero band gap), we can do this easily by filtering the `band_gap` column.

In [None]:
semiconductor_df = df["band_gap"] > 0

Often, a dataset contains many additional columns that are not necessary for machine learning. Before we can train a model on the data, we need to remove any extraneous columns. We can remove whole columns from the dataset using the `drop()` function. This function can be used to drop both rows and columns.

The function takes a list of items to drop. For columns, this is column names whereas for rows it is the row number. Finally, the `axis` option specifies whether the data to drop is columns (`1`) or rows (`0`).

For example, to remove the `nsites`, `space_group`, `e_electronic`, and `e_total` columns, we can run: 

In [None]:
cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"],
                     axis=1)

Let's examine the cleaned `DataFrame` to see that the columns have been removed.

In [None]:
cleaned_df.head()

### Generating new columns

Pandas `DataFrame` objects also make it easy to perform simple calculations on the data. Think of this as using formulas in Excel spreadsheets. All fundamental Python math operators (such as +, -, /, and \*) can be used. 

For example, the dielectric dataset contains the electronic contribution to the dielectric constant ($\epsilon_\mathrm{electronic}$, in the `poly_electronic` column) and the total (static) dielectric constant ($\epsilon_\mathrm{total}$, in the `poly_total` column). The ionic contribution to the dataset is given by:

$$
\epsilon_\mathrm{ionic} = \epsilon_\mathrm{total} - \epsilon_\mathrm{electronic}
$$

Below, we calculate the ionic contribution to the dielectric constant and store it in a new column called `poly_ionic`. This is as simple as assigning the data to the new column, even if the column doesn't already exist.

In [None]:
df["poly_ionic"] = df["poly_total"] - df["poly_electronic"]

Let's check the new data was added correctly.

In [None]:
df.head()

## Let's practice!

Now, let's practice. You'll download a dataset, inspect it, and make sure it is ready to be used for machine learning.