# Materials Data and Representations

```{admonition} Michael Ashby and David Jones
:class: tip
How is the engineer to choose from this vast menu the material that best suits the purpose?
```

<iframe class="speakerdeck-iframe" frameborder="0" src="https://speakerdeck.com/player/431daa62140b472bb2b19c499ebb73f5" title="Machine Learning for Materials (Lecture 4)" allowfullscreen="true" style="border: 0px; background-clip: padding-box; background-color: rgba(0, 0, 0, 0.1); margin: 0px; padding: 0px; border-radius: 6px; box-shadow: rgba(0, 0, 0, 0.2) 0px 5px 40px; width: 100%; height: auto; aspect-ratio: 560 / 420;" data-ratio="1.3333333333333333"></iframe>

[Lecture slides](https://speakerdeck.com/aronwalsh/mlformaterials-lecture4-data)

## 🚀 Chemical space

Our world can be described as a three-dimensional space (or a four-dimensional continuum if you consider spacetime). Chemical space is even more vast when you think of all of the compositions and structures that can be built from combinations of the 118 elements in the Periodic Table! Even considering only carbon atoms, C-C bonds can be used to form molecules, rods, spheres, sheets, and extended crystals. 

The goal today is to access, filter, and visualise data from [Materials Project](https://materialsproject.org). This is one of the growing number of computational materials science databases that includes [NOMAD](https://nomad-lab.eu), [OQMD](https://oqmd.org), and [AFLOW](http://www.aflowlib.org). We will use an application programming interface (API) via Python.

<details>
<summary> Data warning </summary>
Most computational databases are based on the properties of static crystals. In reality, temperature influences the structures and properties of crystals. We have to start somewhere, but keep this in mind when judging the utility of derived models.
</details>

### Preliminary steps
* Create an account for the [Materials Project site](https://materialsproject.org)
* Get your API access key from the [dashboard](https://materialsproject.org/dashboard) to access the database via Python
* You must paste your unique API key below

In [None]:
# Installation of libraries
!pip install pymatgen --quiet
!pip install -U mp-api --quiet
!pip install -U elementembeddings --quiet

In [None]:
# Import of modules
from mp_api.client import MPRester  # Materials Project API client
import pprint  # Pretty print data structures
import pandas as pd  # Data manipulation with DataFrames
import numpy as np  # Numerical operations
import matplotlib.pyplot as plt  # Plotting
import os  # Operating system functions

# Assign your API key to the variable below
  # i.e. paste the key in between the quotation marks
API_KEY = " " # @param {type:"string"}

<details>
<summary>Colab error solution</summary>
If running the import module cell fails with an "AttributeError", click `Runtime` -> `Restart Session` and then simply rerun the cell. 

## Database access

`MPRester` serves as a Python client to accessing the data contained in the Materials Project. The typical way to use it is:

```python
with MPRester(API_KEY) as mpr:
    # do something
```
You may have encountered the `with` statement when reading or writing to files. We use the `with` statement here for our queries.  It is a context manager that, to summarise, manages resources. 

The database can be queried by Material Project ID(s) and/or specific materials properties. For this exercise, we will access data through the `summary` API endpoint. This can be queried using the `search` method:

In [None]:
# Query for two materials by their Material Project IDs
with MPRester(API_KEY) as mpr:
    docs = mpr.materials.summary.search(material_ids=['mp-1069538','mp-540839'])

<details>
<summary> Code hint </summary>
If the cell fails, check that your API_KEY has no spaces before or after the string</details>

Using the `search` method results in a list of `MPDataDoc` objects being returned. The properties of each material in the query are accessible as attributes of the object. Run the cell block below to access some of the properties from the first result of our query.

In [None]:
# Display some properties of the first material from our query
first_doc = docs[0]

print(f'The Materials Project ID is {first_doc.material_id}\n')
print(f'The chemical formula is {first_doc.formula_pretty}\n')
print(f'The band gap is {first_doc.band_gap:.2f} eV\n')
print(f'The crystal system is {first_doc.symmetry.crystal_system}\n')
print(f'The energy above the convex hull (EATCH) is {first_doc.energy_above_hull:.3f} eV/atom')

The Materials Project contains many materials properties. The cell above is only a small snapshot of those accessible from the `summary` endpoint. Run the cell below to see the full range of properties that are available.

In [None]:
# Run this cell to see the full list of available properties
print(mpr.materials.summary.available_fields)

By default, all fields are returned when we perform a query. Let's investigate the `MPDataDoc` object a bit further.

In [None]:
# Print the first query result
print(first_doc)

As all available fields have been requested, the object is lengthy.

Each property field is accessible as an attribute of the `MPDataDoc` object. Some are equal to `None` as that property is not available for that material. If there are certain properties that we are interested in, we can pass a `fields` argument to the `search` method and specify those fields we want returned. For example, if we were only interested in:

* `material_id` (the Materials Project ID)
* `band_gap` (the electronic band gap - the difference in energy between the valence band and conduction band)
* `structure` (the crystal structure information)
* if the material is `theoretical` (not recorded in the Inorganic Crystal Structure Database)
* if the material `is_stable` (defined as the material having an energy above the thermodynamic convex hull of 0 eV/atom)

then we would pass the following argument into the `search` method: `fields=["material_id", "band_gap", "structure", "theoretical", "is_stable"]`.

### Filter by properties

We can also query by property value. This can be useful in searching for materials which meet particular property requirements. Using the `search` method, we would provide the property as an argument. To query for materials with `band_gap` greater than 0.5 eV but less than 1.5 eV, we pass the following argument to the `search` method: `band_gap=(0.5,1.5)`.

Let's run some code to demonstrate these queries. We will first try to search for some semiconducting binary and ternary oxides by querying for materials with:

* a band gap between 0.5 and 1.5 eV: `band_gap=(0.5,1.5)`
* two elements: `num_elements=(2,3)`
* contains oxygen: `elements=["O"]`

We want to return the following properties: `material_id`,`formula_pretty` ,`band_gap`, `theoretical`, `is_stable`.

In [None]:
# Query for binary and ternary oxides
with MPRester(FBI_KEY, use_document_model=False) as mpr:
    docs = mpr.materials.summary.search(
        elements=["O"],
        band_gap=(0.5,1.5),
        num_elements=(2,3),
        fields=['material_id', 'formula_pretty','band_gap','is_stable', 'theoretical']
       )

print("Number of binary and ternary oxides with band gap between 0.5 and 1.5 eV: ", len(docs))

<details>
<summary> Code hint </summary>
MPRester needs API_KEY as input, not FBI_KEY. Replace!
</details>

In [None]:
# Convert the list of dictionaries into a DataFrame
docs_df = pd.DataFrame(docs)

# Display the first few rows of the DataFrame
docs_df.head(

<details>
<summary> Code hint </summary>
Check that all brackets are closed...
</details>

By converting the query data into a DataFrame, we have access to efficient methods to summarise the data. For example, we can use `.describe()` for summary statistics for the numerical columns. Run the cell below to see the summary statistics for the numerical columns.

In [None]:
# Get summary statistics for the numerical columns in the DataFrame
docs_df.describe()

It is important when querying any form of database that we know (and keep a record of) the database version. We can check this through the API using the `.get_database_version()` method as shown below:

In [None]:
# Confirm database version
db_version = mpr.get_database_version()
print(f'We have been querying the {db_version} version of the Materials Project.')

## Chemical space of metal oxides

Now that we have an understanding of the API, let's find some metal oxides. The workflow will involve the following steps:

* Query for metal oxides
  
* Visualise the distribution of the number of elements in each material

Note that there is no one fixed way of obtaining the results. Like Python programming, the use of API queries is flexible and leaves room for personal preference and creativity.

### Query for metal oxides

We can query for all the oxygen-containing materials using the `elements` argument and setting it to `["O"]`. We can also specify the number of elements in the material by using the `num_elements` argument. We will set this to `(2,3)` to specify that we want materials with 2-3 elements. We will also specify the properties that we want returned by using the `fields` argument.

In [None]:
# This may take two minutes to run - be patient!

# Query for binary and ternary oxides
with MPRester(API_KEY, use_document_model=False) as mpr:
    oxide_docs = mpr.materials.summary.search(
        elements=["O"],
        num_elements=(2,3),
        fields=['material_id','formula_pretty','elements','nelements','formula_anonymous','symmetry.number','volume','formation_energy_per_atom','band_gap']
       )
    
# Convert the query results into a DataFrame
oxide_df= pd.DataFrame(oxide_docs)

# Display the first few rows of the DataFrame
oxide_df.head()

We now have a DataFrame with the properties that we requested. However, the symmetry is still in dictionary form. This type of issue arises many times in data pre-processing for machine learning models.

The `symmetry` property contains key value pairs  of the symmetry data: `number`, `symbol`, `crystal_system`, `point_group`, `source`, `version`. With the exception of `number`, these are all currently `None` as we only requested the property `symmetry.number`. We can convert the `symmetry` into a column of the DataFrame using `.apply()`. This method applies a function to each row of the DataFrame. In this case, we will apply a function that returns the value of the `number` key of the `symmetry` dictionary. 

In [None]:
# Convert the symmetry property into a column of the DataFrame called "symmetry.number"

# Define a function that returns the value of the "number" key of the "symmetry" dictionary
def get_spacegroup_number(symmetry):
    """
    Returns the value of the "number" key of the "symmetry" dictionary.
    """
    return symmetry["number"]

# Apply the function to each row of the dataframe
oxide_df["symmetry.number"] = oxide_df["symmetry"].apply(get_spacegroup_number)

# Drop the "symmetry" column from the dataframe
oxide_df.drop(columns=["symmetry"], inplace=True)

# Display the first few rows of the dataframe
oxide_df.head()

We now have a DataFrame populated with properties. Let's visualise the number of components.  

In [None]:
# Visualise the distribution of the number of elements in each material
import matplotlib.pyplot as plt
import seaborn as sns

# Counts of components
component_count = oxide_df.nelements.value_counts()

# Create a color map based on the number of elements
  # We will use the number of elements as the color value
color_map = component_count.index

fig, ax = plt.subplots(figsize=(5, 4))

# Apply the color map to the bars
bars = ax.bar(component_count.index, component_count.values, color=plt.cm.tab20(color_map))
ax.set_xlabel('Number of elements')
ax.set_ylabel('Number of materials')
ax.bar_label(bars)
ax.set_xticks(range(int(min(component_count.index)), int(max(component_count.index))+1))

plt.show(

<details>
<summary> Code hint </summary>
Remember to close your brackets
</details>

## Unsupervised machine learning

We have a set of materials with different compositions and crystal structures. We can use a machine learning technique to visualise these in two-dimensions. You can of this like a materials map. We refer to the methods that enable us to reduce high-dimensional data into lower dimensions as dimension reduction techniques. These allow us to visualise complex data and in this example we will make use of Principal Component Analysis (PCA).

<details>
<summary> Overview of PCA </summary>
PCA is a popular technique for dimension reduction and data preprocessing, enabling the simplification of complex datasets while retaining crucial information. High-dimensional data is transformed into a new coordinate system where the axes align with the directions of maximum variance in the original data. These new axes, termed "principal components," are orthogonal. The first principal component captures the highest variance, the second captures the second highest, etc.

_Key use cases include:_

- **Dimensionality Reduction**: Identifying and eliminating less informative dimensions, reducing noise and computational complexity.

- **Data Visualisation**: Facilitating easier interpretation while preserving essential patterns.

- **Noise Reduction**: Filtering out noise or unimportant variations by focusing on significant variance.

- **Feature Engineering**: A preprocessing step to transform data before applying machine learning algorithms, potentially improving performance.

_PCA workflow:_

1. **Center the Data**: Subtract the mean from each feature to center the data around the origin.

   $
   X_{\text{centered}} = X - \bar{X}
   $

2. **Calculate Covariance Matrix**: Compute the covariance matrix to understand feature relationships and their [covariance](https://en.wikipedia.org/wiki/Covariance).

   $
   \text{Cov}(X) = \frac{1}{n}X_{\text{centered}}^T X_{\text{centered}}
   $

3. **Compute Eigenvalues and Eigenvectors**: Calculate eigenvalues ($\lambda_i$) and eigenvectors ($\mathbf{v}_i$) of the covariance matrix. Eigenvectors represent maximum variance directions, and eigenvalues quantify the variance magnitude.

   $
   \text{Cov}(X) \mathbf{v}_i = \lambda_i \mathbf{v}_i
   $

4. **Sort Eigenvalues**: Sort eigenvalues in descending order, rearranging corresponding eigenvectors accordingly.

5. **Select Principal Components**: Choose a subset of eigenvectors (principal components) based on eigenvalues, explaining the most variance in the data.

6. **Project Data**: Project original data onto selected principal components, yielding a lower-dimensional representation.

Note that PCA analysis is limited by its reliance on linear transformations of the data. In cases where non-linear structures are prominent, alternative techniques such as t-distributed Stochastic Neighbor Embedding (t-SNE) can be used.
</details>

In [None]:
# Perform PCA analysis
from sklearn.decomposition import PCA

# Random 10D vectors for demonstration purposes
np.random.seed(42)
num_samples = 1000
dimensionality = 
random_state = 42
data = np.random.rand(num_samples, dimensionality)

# Perform PCA to reduce the dimensionality to 2D
pca = PCA(n_components=dimensionality)
reduced_data = pca.fit_transform(data)

# Create a color map based on the original data points
   # Use the first dimension of the original data as the color value
color_map = data[:, 0]

# Plot the 2D projection
plt.figure(figsize=(5, 3))
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=color_map, cmap='viridis')
plt.colorbar(label='Colour based on first dimension')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('2D projection using PCA')
plt.show()

<details>
<summary> Code hint </summary>
Set the number of components to 2
</details>

Using our dataset, we can try different featurisation schemes and analyse the resulting visualisations. Here we will perform a simple "one-hot encoding" of the chemical compositions using the [ElementEmbeddings](https://github.com/WMD-group/ElementEmbeddings) package.

In [None]:
# This will take a minute or two to run - be patient! 

# Featurise to create compositional vectors
from elementembeddings.composition import composition_featuriser
oxide_onehot_df = composition_featuriser(oxide_df, formula_column="formula_pretty", embedding="atomic",stats=["sum"])
oxide_onehot_df.head()

In [None]:
# Perform PCA analysis using the oxide feature set
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from elementembeddings.composition import composition_featuriser

# Define columns to drop from the dataset
cols_to_drop = ["elements", "nelements", "formula_pretty","volume", "formula_anonymous", "material_id", "symmetry.number", "composition","band_gap", "formation_energy_per_atom"]

# Create a list of feature columns
feature_cols = [col for col in list(oxide_onehot_df.columns) if col not in cols_to_drop]

# Extract feature values
X = oxide_onehot_df[feature_cols].values
# Alternative featurisation scheme 
# X = oxide_magpie_df[feature_cols].values

# Standardise the feature values
X_standardised = StandardScaler().fit_transform(X)

# Perform PCA to reduce dimensionality to 2D
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X_standardised)

# Create subplots for visualisation
fig, axes = plt.subplots(2, 1, figsize=(5, 6))

# Loop through columns for coloring the data points
for ax, col in zip(axes.flatten(), ["symmetry.number", "band_gap"]):
    color_map = oxide_onehot_df[col].values
    scatter = ax.scatter(reduced_data[:, 0], reduced_data[:, 1], c=color_map, cmap='viridis', alpha=0.5)
    fig.colorbar(scatter, label=col, ax=ax)
    ax.set_xlabel("Component 1")
    ax.set_ylabel("Component 2")
    ax.set_title(f"Label: {col}")

# Adjust layout and display the plot
fig.tight_layout()
plt.show()

<details>
<summary> Code hint </summary>
You can change the featurisation scheme to see the impact on the resulting visualisation and distribution of compositions
</details>

## 🚨 Exercise 4: Metal oxide perovskites


```{admonition} Coding exercises
:class: note
The exercises are designed to apply what you have learned with room for creativity. It is fine to discuss solutions with your classmates, but the actual code should not be directly copied.

The completed notebooks are to be submitted at the end of class, but you can revist later, experiment with the code, and follow the further reading suggestions.
```


### Your details

In [None]:
import numpy as np

# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)

# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)

# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")

### Tasks

We have covered a lot today. You can revisit and tweak the examples in your own time, as well as consult the `pymatgen` manual. 

There is one task to complete:

1.  Investigate how many cubic ABO<sub>3</sub> perovskite structures exist in your metal oxide dataset. Here is an example that queries for 3 component materials that you can extend.

```python
abc_df = oxide_df.loc[(oxide_df["nelements"] == 3)] 
print("There are " + str(len(abc_df.index)) + " 3 component materials " + str(len(oxide_df.index)) + " metal oxides extracted from the database.")
```

*Self-study (optional)*  

2. Plot the occurance of elements A and B in these metal oxide perovskites. Identify which element is present in the most perovskites.

3. Within the crystal-chemical space of metal oxides, see if perovskites are clustered into one specific region of the two-dimensional PCA map.

<details>
<summary> Task hint </summary>
For task 1, remember the spacegroup number of cubic perovskites is 221. You could filter for this with `symmetry.number`.
</details>

```{admonition} Submission
:class: note
When your notebook is complete, click on the download icon on the top right, select `.pdf`, save the file and upload it to MyDepartment. If you are using Google Colab, you have to print to pdf.
```

In [None]:
#Code block 


    

In [None]:
#Comment block 




In [None]:
#Code block 




In [None]:
#Comment block 




## 🌊 Dive deeper

* _Level 1:_ Tackle Chapter 8 on Linear Unsupervised Learning in [Machine Learning Refined](https://github.com/jermwatt/machine_learning_refined#what-is-new-in-the-second-edition).

* _Level 2:_ Read about our attempt to screen _all inorganic materials_ (with caveats) in the journal [Chem](https://doi.org/10.1016/j.chempr.2016.09.010). 

* _Level 3:_ Watch a [seminar](https://www.youtube.com/watch?v=gd-uahI5xbA) by quantum chemist Anatole von Lilienfeld on chemical space. 