# Materials Discovery Using Machine Learning

In the previous Cell Phone Design Challenge, you may have noticed that not all properties (conductivity, voltage, hardness, etc.) have been computed for every material in our dataset from the [Materials Project](https://materialsproject.org).

In this notebook, we will train a machine learning model to predict the hardness of a material based only on its density. You will then fill in the missing hardness data and look for promising materials using your predicted data.

### Table of Contents

1 - [EDA and Review](#section1)<br>

2 - [Training Our Machine Learning Model](#section2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Preprocess Your Data](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Specify A Model](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 - [Train the Model](#subsection3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.4 - [Evaluate the Model](#subsection4)<br>

3 - [Predicting the Hardness of Novel Materials](#section3)<br>

4 - [Bonus: Testing Different Models](#section4)<br>

## 1. EDA and Review<a name='section1'></a>

Before we get started, let's remind ourselves what our dataset looks like by doing some EDA (Exploratory Data Analysis). Recall that the properties are as follows:

- `material_id`: This is the identification number of the material. You can use this to search the material on the Materials Project website. For example, you can see more information about the first material in the dataset (mp-770629) by going to the following url: https://materialsproject.org/materials/mp-770629
- `formula`: The chemical formula indicating the number and type of elements in the material.
- `cost`: The cost of the raw elements in the material in \$/kg.
- `scarcity`: How scarce the raw elements are. Larger numbers indicate that the elements will be harder to find and occur less frequently in the earths crust.
- `density`: The density of the material in g/cm<sup>3</sup>.
- `conductivity`: The conductivity of the material. Larger numbers indicate the material is more conductive.
- `transparency`: The transparency of the material. Larger numbers indicate the material is more transparent.
- `hardness`: The hardness of the material. Larger numbers indicate stronger materials.
- `voltage`: The maximum obtainable voltage if the material is used in a battery.
- `capacitance`: The maximum obtainable capacitance if the material is used in a battery.


We will first import libraries we need for this notebook, load the dataset as a Pandas Dataframe, and quickly clean the data:

In [None]:
import numpy as np
import pandas as pd

material_data = pd.read_csv("https://raw.githubusercontent.com/utf/mp-bldap/master/resources/materials_project_dataset.csv")

#Remove the negative densities (this was explained in the previous notebook)
material_data = material_data[material_data["density"] > 0]

#Examine the dataset using the head() function
material_data.head()

In the three cells below, use some functions (other than `head()`) on the dataframe `material_data` to do some **Exploratory Data Analysis** to explore the data:

In [None]:
#Exploratory Data Analysis 1

#This can be tail, info, columns, describe, mean, median, mode, etc.

material_data....

In [None]:
#Exploratory Data Analysis 2

...

In [None]:
#Exploratory Data Analysis 3

...

### Review: Filtering Data

In the previous notebook, we filtered data based on specific columns.

In the following cells, grab the columns specified. *(Feel free to look back at the previous notebook to see what code to use!)*

In [None]:
#Grab the cost column:

material_data[...]

In [None]:
#Grab the density column:



In [None]:
#Grab a column of your choice (other than cost, density):



In the following cells create filters based on the requirements given. *(Feel free to look back at the previous notebook to see what code to use!)*

In [None]:
#Find materials with a high value of conductivity (greater than 80) by creating a filter

conductivity_filter = material_data["..."] > ...

material_data[conductivity_filter]

In [None]:
#Find materials with a high value of hardness (greater than 200) by creating a filter

hardness_filter = ...

material_data[hardness_filter]

## 2. Training Our Machine Learning Model <a name='section2'></a>

Our goal is to train a machine learning model to predict the **hardness** of a material based only on its **density**. We will then fill in the missing hardness data and look for promising materials using your predicted data.

### 2.1 Preprocess Your Data<a name='subsection1'></a>

To start with, we have to prepare our training data. We need to filter our dataframe to only include the materials for which the hardness has already been calculated. In other words, the hardness column does not contain `NaN`.

You can achieve this using the pandas `isna` function. For example, try running the following cell below:

In [None]:
pd.isna(material_data["voltage"])

You can see the code returned a Series of True and False corresponding to whether the "voltage" column is `NaN`.

However, we are looking for materials which are NOT `NaN`. We can invert a filter by using `~`. Try running the following cell:

In [None]:
~pd.isna(material_data["voltage"])

Finally, remember we need to index our dataframe using the filter to get the final results:

In [None]:
voltage_filter = ~pd.isna(material_data["voltage"])
material_data_voltage = material_data[voltage_filter]

material_data_voltage

Below, create the filter `hardness_filter`, which similar to the `voltage_filter` from the example above, includes only rows where the hardness is NOT `NaN`:

In [None]:
# EXERCISE

hardness_filter = ~pd.isna(...)

Below, use the `hardness_filter` to create a dataset `material_data_hardness` which includes only the materials for which the hardness data is available:

In [None]:
# EXERCISE

material_data_hardness = material_data[...]

material_data_hardness

Next, we need to partition the data into two sets. The first is the training data called `X`, in this case containing the density for each material. The second set should contain the target property that we are trying to predict, called `y`. In this case our target property is the hardness of the materials.

In [None]:
training_columns = ["density"]

X = material_data_hardness[training_columns]
y = material_data_hardness["hardness"]

### 2.2 Specify A Model <a name='subsection2'></a>

We will need to specify a machine learning model to help us with predicting the hardness of materials. For this we will be using the `scikit-learn` library which implements a variety of different machine learning models and other analysis tools. A good "starting" model is the random forest model. As we are dealing with **regression** in machine learning, we will specifically be using the **Random Forest Regressor** model. (You can learn more about random forest models from the [scikit-learn user guide](https://scikit-learn.org/stable/modules/ensemble.html#forest)).

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()

### 2.3 Train the Model <a name='subsection3'></a>

Now we are ready to train our machine learning model. We can now train our model to use the input features (`X`) to predict the target property (`y`). This is achieved using the `fit()` function.

In [None]:
rf.fit(X, y)

That's it, you have just trained your first machine learning model!

### 2.4 Evaluate the Model <a name='subsection4'></a>

Next, we need to assess how the model is performing. To do this, we can ask the model to predict the target property for every entry in our original dataframe.

In [None]:
y_pred = rf.predict(X)

The `y_pred` variable now contains the predicted hardness for our training set of materials. We can see how the model is performing by calculating the root mean squared error of our predictions. To do this, the scikit-learn library provides a `mean_squared_error()` function to calculate the mean squared error. We then take the square-root of this to obtain our final performance metric.

In [None]:
from sklearn.metrics import mean_squared_error

rmse = np.sqrt(mean_squared_error(y, y_pred))

print('root mean squared error: {:.2f}'.format(rmse))

Does this value seem high to you? How does it compare to the average hardness value of the data set?

Alternatively, we can plot the actual hardness values against the values predicted by our model. In the plot the below, each point has been colored by the density of the original material.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
sns.set_context("notebook")

fig, ax = plt.subplots(dpi=100)
sc = ax.scatter(y, y_pred, c=X.values[:, 0])
ax.plot([-10, 650], [-10, 650], ls="--", c="white")
ax.set(xlabel="Hardness Calc. (GPa)", ylabel="Hardness Pred. (GPa)", xlim=(0, 600), ylim=(0, 600))
plt.colorbar(sc, label="Density (g/cm$^{-3}$)")
plt.show()

As you can see, the model performs reasonably well! If the model showed perfect performance then all the points would line up along the dashed white line.

# 3. Predicting the Hardness of Novel Materials <a name='section3'></a>

We have trained our machine learning model so now let's return to our task: predicting the hardness of materials based only on its density.

We will use the `predict` function of the model to predict the hardness of *all* materials in our original dataset, not just those for which training data was available.

In the cell below, we will create a new column called "hardness_predicted" that contains the hardness predicted by our machine learning model.

In [None]:
material_data["hardness_predicted"] = rf.predict(material_data[training_columns])

Let's check to make sure the dataframe contains the new column.

In [None]:
material_data.head()

### Discovering new materials using predicted data

We can now look for new protective casing materials using the predicted data.

**Aim**: Find the hardest material for which we didn't previously have a hardness value. Does this seem like a good choice of protective casing?


*Hint*:
First filter the dataframe to find materials where the hardness column is `NaN` (i.e., using `pd.isna` as we did earlier). You can then find the maximum value of the filtered dataframe using `sort_values("hardness_predicted")`

In [None]:
# Complete task below:

material_data[pd.isna(...['...'])].sort_values('...')

# 4. Bonus: Testing Different Models<a name='section4'></a>

It is good practice to trial multiple models to see which performs best for your machine learning problem.

The `scikit-learn` library has an [online user guide](https://scikit-learn.org/stable/user_guide.html) documenting all the different models and how to use them in code. For example, the random forest regressor model we used in this notebook can be found under section 1.11.2.

For this bonus activity try out the **Ordinary Least Squares model**. Did it perform better or worse than the Random Forest Regressor model we used earlier?

***Hint:*** *Be careful on reading in the user guide how to import the model. For the Random Forest Regressor, we called `from sklearn.ensemble import RandomForestRegressor` but for other models this will look different.*

In [None]:
# 1. Preprocess the data:

material_data_hardness = material_data[~pd.isna(material_data["hardness"])]

training_columns = ["density"]

X = material_data_hardness[training_columns]
y = material_data_hardness["hardness"]

In [None]:
# 2. Specify your model:

from ... import ...

In [None]:
# 3. Train your model:



In [None]:
# 4. Evaluate your model

new_y_pred = ...

# You can calculate RMSE and/or plot. Copy code from earlier in the notebook.


Notebook developed by: Alex Ganose, Ryan Kingsbury, Jianli Cheng, Alisa Bettale