# Classical Learning

```{admonition} Hugh Cartright
:class: tip
The tools of science are changing; artifical intelligence has spread to the laboratory.
```

<iframe class="speakerdeck-iframe" frameborder="0" src="https://speakerdeck.com/player/b098c15f50ce4a468a1c5eecd6de0f96" title="Machine Learning for Materials (Lecture 5)" allowfullscreen="true" style="border: 0px; background-clip: padding-box; background-color: rgba(0, 0, 0, 0.1); margin: 0px; padding: 0px; border-radius: 6px; box-shadow: rgba(0, 0, 0, 0.2) 0px 5px 40px; width: 100%; height: auto; aspect-ratio: 560 / 420;" data-ratio="1.3333333333333333"></iframe>

[Lecture slides](https://speakerdeck.com/aronwalsh/mlformaterials-lecture5-classical)

## 🎲 Metal or insulator?

Some decisions in life are difficult to make. We hope our experience informs a choice that is better than a random guess. The same is true for machine learning models.

There are many situations where we want to classify materials according to their properties. One fundamental characteristic is whether a material is a metal or insulator. For this exercise, we can refer to these as class `0` and class `1` materials, respectively. 

Cu is clearly `0`, and MgO is `1`, but what about Tl<sub>2</sub>O<sub>3</sub> or Ni<sub>2</sub>Zn<sub>4</sub>?

### Theoretical background

Metals are characterised by their free electrons that facilitate the flow of electric current. This arises from a partially filled conduction band, allowing electrons to move easily when subjected to an electric field.

Insulators are characterised by an occupied valence band and empty conduction band, impeding the flow of current. The absence of charge carriers in insulators hinders electrical conductivity, making them effective insulators of electricity. Understanding these fundamental differences between metals and insulators is crucial for designing and optimising electronic devices.

In this practical, we can use the electronic band gap of a material as a simple descriptor of whether it is a metal (E$_g$ = 0) or an insulator (E$_g$ > 0).

$$
E_g = E^{conduction-band}_{minimum} - E^{valence-band}_{maximum}
$$

This classification is coarse as we are ignoring the intermediate regime of semiconductors and more exotic behaviour such as superconductivity.

![image](./images/5_bands.png)

## $k$-means clustering

We'll start by generating some synthetic data for materials with their class labels. To make the analyisis faster and more illustrative, we perform a dimensionality reduction from a 10 D to 2 D feature space, and then cluster the data using $k$-means.

In [None]:
# Installation of libraries
!pip install -U elementembeddings --quiet
!pip install matminer --quiet

In [None]:
# Import of modules
import numpy as np  # Numerical operations
import pandas as pd  # Data manipulation with DataFrames
import matplotlib.pyplot as plt  # Plotting
import seaborn as sns  # Statistical visualisation
from sklearn.decomposition import PCA  # Principal component analysis
from sklearn.cluster import KMeans  # k-means clustering
from sklearn.metrics import accuracy_score, confusion_matrix  # Metrics for model evaluation
from sklearn.tree import DecisionTreeClassifier  # Decision tree classifier

<details>
<summary>Colab error solution</summary>
If running the import module cell fails with an "AttributeError", click `Runtime` -> `Restart Session` and then simply rerun the cell.
</details>

Pay attention to each step in the process:

In [None]:
# Step 0: Set the number of clusters
n_clusters = 0

# Step 1: Generating sample data
np.random.seed(42)
num_materials = 200
num_features = 10
data = np.random.rand(num_materials, num_features)
labels = np.random.randint(0, 2, num_materials)

# Step 2: Reduce dimensions to 2 using principal component analysis (PCA)
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)

# Step 3: Cluster the data using k-means
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
predicted_labels = kmeans.fit_predict(reduced_data)

# Step 4: Create a plot to visualise the clusters and known labels
plt.figure(figsize=(5, 4))

# Plot the materials labeled as metal (label=1)
plt.scatter(reduced_data[labels == 1, 0], reduced_data[labels == 1, 1], c='blue', label='Metal')

# Plot the materials labeled as insulator (label=0)
plt.scatter(reduced_data[labels == 0, 0], reduced_data[labels == 0, 1], c='red', label='Insulator')

# Plot the cluster centers
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='green', s=100, label='Cluster centers')

# Draw cluster boundaries
h = 0.02  # step size for the meshgrid
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.5, cmap='viridis')

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('$k$-means clustering of artificial materials')
plt.legend()
plt.show()

<details>
<summary> Code hint </summary>
The algorithm fails for 0 clusters. 
Increase the value of `n_clusters` and look at the behaviour.
</details>

The cluster centres are shown by green dots. It doesn't do a great job, as we just generated this "materials data" from random numbers. There are no correlations for the algorithms to exploit. Nonetheless, this type of "failed experiment" is common in real research.

Since we know the labels here, we can use quantify how bad the model is by calculating the classification accuracy. Is it better than flipping a coin? 

In [None]:
# Step 5: Quantify classification accuracy
accuracy = accuracy_score(labels, predicted_labels)
conf_matrix = confusion_matrix(labels, predicted_labels)

print("Accuracy:", accuracy)
print("\nConfusion matrix:")
print(conf_matrix)

## Decision tree classifier

Let's see if we can do better using a dedicated classifier. We will now train a decision tree classifier to tackle the same classification problem and visualise the decision boundary.

In [None]:
# Step 0: Set the depth of the decision tree
max_tree_depth = 0

# Step 1: Train a decision tree classifier
def train_decision_tree(depth, reduced_data, labels):
    tree_classifier = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree_classifier.fit(reduced_data, labels)
    return tree_classifier

tree_classifier = train_decision_tree(max_tree_depth, reduced_data, labels)
predicted_labels = tree_classifier.predict(reduced_data)

# Step 2: Create a plot to visualise the decision boundary of the decision tree
plt.figure(figsize=(5, 4))

# Plot the materials labeled as metal (label=1)
plt.scatter(reduced_data[labels == 1, 0], reduced_data[labels == 1, 1], c='blue', label='Metal')

# Plot the materials labeled as insulator (label=0)
plt.scatter(reduced_data[labels == 0, 0], reduced_data[labels == 0, 1], c='red', label='Insulator')

# Plot the decision boundary of the decision tree classifier
h = 0.02  # step size for the meshgrid
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = tree_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.5, cmap='viridis')

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title(f'Decision tree (max depth={max_tree_depth}) for artificial materials')
plt.legend()

plt.show()

<details>
<summary> Code hint </summary>
With no nodes, you have made an indecisive tree 🥁.
    
Increase the value of `max_tree_depth` and look at the behaviour.
</details>

There should be more structure in the decision boundary due to the more complex model.

$k$-means clustering provides a simple way to group materials based on similarity, yielding a clear linear decision boundary. This method works well when our data showcases distinct clusters. However, when facing more intricate and overlapping distributions, this will not capture the complexity of the underlying patterns.

On the other hand, the decision tree classifier does better in handling non-linear separations. It constructs a boundary based on different feature thresholds, enabling it to capture fine-grained patterns. As always in ML, we must balance the trade-offs between simplicity and accuracy.

Is the decision tree more accurate? Let's see.

In [None]:
# Step 3: Quantify classification accuracy
accuracy = accuracy_score(labels, predicted_labels)
conf_matrix = confusion_matrix(labels, predicted_labels)

print("Accuracy:", accuracy)
print("\nConfusion Matrix:")
print(conf_matrix)

If you chose a large tree depth, then the decision tree will approach a perfect accuracy of 1.0. It does this by memorising the training data but is unlikely to generalise well to new (unseen) data, i.e. overfitting. In contrast, the accuracy of $k$-means clustering is lower because it is an unsupervised algorithm designed for clustering, not classification. Its performance depends on the data structure and the presence of distinct clusters in that feature space.

To obtain reliable results and a robust model, it is essential to split the data into training and testing sets, perform validation, and use other evaluation metrics to assess the model performance.

## Real materials

We can save time by using a pre-built dataset. We will return to [matminer](https://hackingmaterials.lbl.gov/matminer) that we used before and load `matbench_expt_is_metal`.

### Load dataset

In [None]:
# Imports
import matminer
from matminer.datasets.dataset_retrieval import load_dataset

# Use matminer to download the dataset
df = load_dataset('matbench_expt_is_metal')
print(f'The full dataset contains {df.shape[0]} entries. \n')

print('The DataFrame is shown below:')
df.head(10)

<details>
<summary> Code hint </summary>
To load a different dataset, you simply change the name in 'load_dataset()'.
</details>

### Materials featurisation

Revisiting concepts from Notebooks 3 and 4, featurising the chemical compositions is necessary to create a useful set of input vectors. This allows the presence (or absence) of an element (or element combination) to act as a feature that the classifier takes account for.

We will use [ElementEmbeddings](https://wmd-group.github.io/ElementEmbeddings/0.4/) again to featurise the `composition` column.

In [None]:
# Featurise the compositions 
from elementembeddings.composition import composition_featuriser
onehot_df = composition_featuriser(df["composition"], embedding="atomic", stats=["sum"])

# Change the is_metal column to Boolean values
onehot_df['is_metal'] = df['is_metal'].astype(int)
onehot_df.head(10)

We now have a DataFrame that is suitable for our clustering task!

## 🚨 Exercise 5: Metallicity

```{admonition} Coding exercises
:class: note
The exercises are designed to apply what you have learned with room for creativity. It is fine to discuss solutions with your classmates, but the actual code should not be directly copied.

The completed notebooks are to be submitted at the end of class, but you can revist later, experiment with the code, and follow the further reading suggestions.
```

### Your details

In [None]:
import numpy as np

# Insert your values
Name = "No Name" # Replace with your name
CID = 123446 # Replace with your College ID (as a numeric value with no leading 0s)

# Set a random seed using the CID value
CID = int(CID)
np.random.seed(CID)

# Print the message
print("This is the work of " + Name + " [CID: " + str(CID) + "]")

### Tasks

You will now apply the classification analysis tested above to real materials data. Remember that we have already defined the `onehot_df` DataFrame that contains materials features and the binary class (`is_metal`). You have one task to complete:

1. Perform $k$-means clustering of the `matbench_expt_is_metal` dataset. The starting point is to extract appropriate x and Y values from `onehot_df`.

```python 
# Process training data
cols_to_drop = ['is_metal', 'formula']
feature_cols = [col for col in list(onehot_df.columns) if col not in cols_to_drop]
data = onehot_df[feature_cols].values
labels = onehot_df['is_metal']
```
Do the resulting clusters map well onto the metal-insulator classes?

*Self-study (optional)*  

2. Perform decision tree classification on the dataset using cross-validation to make a robust model. Based on the performance, is it a useful model?
  
3. Predict if a new crystal that I discovered is metallic. Its composition is AlGaN$_2$. This will involve creating the feature vector for AlGaN$_2$ and then using your model predictively, e.g. `model.predict(AlGaN2)`. In reality, it should be an insulator (semiconductor) as it a mixture of GaN and AlN.

<details>
<summary> Task hint </summary>
For task 4, you can featurise a new composition using a command such as `new_material = composition_featuriser(["AlGaN2"], embedding="atomic", stats=["sum"])`
</details>

```{admonition} Submission
:class: note
When your notebook is complete, click on the download icon on the top right, select `.pdf`, save the file and upload it to MyDepartment. If you are using Google Colab, you have to print to pdf.
```

In [None]:
#Code block 




In [None]:
#Comment block 




In [None]:
#Code block 




In [None]:
#Comment block 




## 🌊 Dive deeper

* _Level 1:_ Tackle Chapter 6 on Linear Two-Class Classification in [Machine Learning Refined](https://github.com/jermwatt/machine_learning_refined#what-is-new-in-the-second-edition).

* _Level 2:_ Play [metal detection](http://palestrina.northwestern.edu/metal-detection/). Note, the website can be a little temperamental. 

* _Level 3:_ Dig deeper into the options for definitions decision trees and ensemble models in [scikit-learn](https://scikit-learn.org/stable/modules/tree.html).