# Lesson 2: Generating descriptors for machine learning

In this lesson, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.


<img src="resources/featurizers_overview.png" alt="featurizers overview" style="width: 700px;"/>

### Featurizers transform materials primitives into machine-learnable math objects

The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a tensor. For example:


\begin{align}
f(\text{Fe2O3}) \rightarrow [1.5, 7.8, 9.1, 0.09]
\end{align}

Matminer contains featurizers for the following pymatgen objects:
* Composition
* Structure
* Bandstructure
* DOS

There are also "convenience" featurizers for 


Depending on the featurizer, the features returned may be:
* numerical, categorical, or mixed vectors
* matrices 
* other pymatgen objects (for further processing)

## Part 1: Overview of the featurizers

In [None]:
from matminer.datasets import get_available_datasets

get_available_datasets()

Datasets can be loaded using the `load_dataset()` function and the database name. To save installation space, the datasets are not automatically downloaded when matminer is installed. Instead, the first time the dataset is loaded, it will be downloaded from the internet and stored in the matminer installation directory.

Let's load the `dielectric_constant` dataset. It contains 1,056 structures with dielectric properties calculated with DFPT-PBE.

In [None]:
from matminer.datasets import load_dataset

df = load_dataset("dielectric_constant")

For commonly used datasets, matminer provides convenience loader functions. For example, the dielectric dataset could also be loaded as:

In [None]:
from matminer.datasets.convenience_loaders import load_dielectric_constant

df = load_dielectric_constant()

## Manipulating and examining pandas `DataFrame` objects

The datasets are made available as pandas `DataFrame` objects. You can think of these as a type of "spreadsheet" object in Python. DataFrames have several useful methods you can use to explore and clean the data, some of which we'll explore below.

### Inspecting the dataset

The `head()` function prints a summary of the first few rows of a data set. You can scroll across to see more columns. From this, it is easy to see the types of data available in in the dataset.

In [None]:
df.head()

Sometimes, if a dataset is very large, you will be unable to see all the available columns. Instead, you can see the full list of columns using the `columns` attribute:

In [None]:
df.columns

A pandas `DataFrame` includes a function called `describe()` that helps determine statistics for the various numerical/categorical columns in the data. Note that the `describe()` function only describes numerical columns by default.

Sometimes, the `describe()` function will reveal outliers that indicate mistakes in the data.

In [None]:
df.describe()

### Indexing the dataset

We can access a particular column of `DataFrame` by indexing the object using the column name. For example:

In [None]:
df["formula"]

Alternatively, we can access a particular row of a `Dataframe using the `iloc` attribute.

In [None]:
df.iloc[100]

### Filtering the dataset

Pandas `DataFrame` objects make it very easy to filter the data based on a specific column. We can use the typical Python comparison operators (==, >, >=, <, etc) to filter numerical values. For example, let's find all entries where the cell volume is greater than 580. We do this by filtering on the `volume` column.

Note that we first produce a *boolean mask* – a series of `True` and `False` depending on the comparison. We can then use the mask to filter the `DataFrame`. 

In [None]:
mask = df["volume"] > 580
df[mask]

We can use this method of filtering to clean our dataset. For example, if we only wanted our dataset to include semiconductors (materials with a non-zero band gap), we can do this easily by filtering the `band_gap` column.

In [None]:
semiconductor_df = df["band_gap"] > 0

Often, a dataset contains many additional columns that are not necessary for machine learning. Before we can train a model on the data, we need to remove any extraneous columns. We can remove whole columns from the dataset using the `drop()` function. This function can be used to drop both rows and columns.

The function takes a list of items to drop. For columns, this is column names whereas for rows it is the row number. Finally, the `axis` option specifies whether the data to drop is columns (`1`) or rows (`0`).

For example, to remove the `nsites`, `space_group`, `e_electronic`, and `e_total` columns, we can run: 

In [None]:
cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"],
                     axis=1)

Let's examine the cleaned `DataFrame` to see that the columns have been removed.

In [None]:
cleaned_df.head()

### Generating new columns

Pandas `DataFrame` objects also make it easy to perform simple calculations on the data. Think of this as using formulas in Excel spreadsheets. All fundamental Python math operators (such as +, -, /, and \*) can be used. 

For example, the dielectric dataset contains the electronic contribution to the dielectric constant ($\epsilon_\mathrm{electronic}$, in the `poly_electronic` column) and the total (static) dielectric constant ($\epsilon_\mathrm{total}$, in the `poly_total` column). The ionic contribution to the dataset is given by:

$$
\epsilon_\mathrm{ionic} = \epsilon_\mathrm{total} - \epsilon_\mathrm{electronic}
$$

Below, we calculate the ionic contribution to the dielectric constant and store it in a new column called `poly_ionic`. This is as simple as assigning the data to the new column, even if the column doesn't already exist.

In [None]:
df["poly_ionic"] = df["poly_total"] - df["poly_electronic"]

Let's check the new data was added correctly.

In [None]:
df.head()

## Let's practice!

Now, let's practice. You'll download a dataset, inspect it, and make sure it is ready to be used for machine learning.