# Sequential learning with CAMD - fill in the blanks

**Objective:** imagine you want to design a procedure in order to discover materials which are resistant to deformation.  Such efforts have been the subject of at least one [recent paper](https://pubs.acs.org/doi/10.1021/jacs.8b02717) which used machine learning to predict and verify a new superhard material, (ReWC0.8).  How would we design a method that iterated on each of its past experiments in order to improve itself?  In this notebook, we cover methods supported by [CAMD](https://s3-eu-west-1.amazonaws.com/itempdf74155353254prod/11860104/Autonomous_Intelligent_Agents_for_Accelerated_Materials_Discovery_v1.pdf), a software package designed to assist scientists with sequential learning, that will reveal how different methods of decision making and feedback perform on a known dataset of materials.

In [None]:
import pandas as pd

## Preprocessing data
The dataset we'll be using in the tutorial will be the elastic tensor dataset from Maarten de Jong's 2015 paper, [Charting the complete elastic properties of inorganic crystalline compounds](https://www.nature.com/articles/sdata20159).  We'll be using the [MatMiner](https://hackingmaterials.lbl.gov/matminer/) API to fetch the data, which we've written a function for in the helper code pre-installed on your SageMaker instance.  In order to make our data compatible with some machine learning functionality later on, we'll be featurizing it, also using MatMiner.  Lastly, we'll lay the groundwork for the simulation of our sequential learning procedure by separating the featurized data into **seed_data** which we will assume we have full knowledge of a-priori, and **candidate_data** which we will assume we know nothing about.

In [None]:
# Load the data
from hackathon.helper import load_tutorial_data
data = load_tutorial_data()

In [None]:
# Inspect the first five rows

In [None]:
# Sort the data and inspect again

In [None]:
# Inspect the lowest five rows

In [None]:
## Generate magpie features
from matminer.featurizers.composition import ElementProperty
from pymatgen import Composition

# Featurize dataframe here

In [None]:
# Inspect features of top candidates

Here we partition the data by choosing every other member of our known data for the seed data and the remainder for our candidate data.  Note that this partitioning can have a **significant** impact on how the sequential learning procedure progresses.  As an exercise, you might try seeing how the notebook compares if you use the alternative commented option where the seed is the bottom half of the dataset.

In [None]:
# Partition data into seed and candidate data

In [None]:
# Drop "answers" from candidate data

In [None]:
# Alternative: choose bottom half
# half = int(len(featurized_data) / 2)
# k_seed_data = featurized_data.iloc[half:]
# k_candidate_data = featurized_data.iloc[:half]
# k_candidate_data.drop(['bulk_modulus', 'shear_modulus'], axis=1)

# Alternative: choose randomly
# half = int(len(featurized_data) / 2)
# k_seed_data = featurized_data.sample(half)
# k_candidate_data = featurized_data.loc[~k_seed_data]
# k_candidate_data.drop(['bulk_modulus', 'shear_modulus'], axis=1)

In [None]:
# test to ensure no overlap
assert not set(k_seed_data.index).intersection(k_candidate_data.index)

## Agents

In CAMD, Hypothesis *Agents* are python objects which select candidates on which to perform experiments.  Almost all of the "AI" components, including ML algorithms, simpler regression, and even random selection, within CAMD are contained in logic implemented within Agents.  


To implement a CAMD-compatible Agent, we use the *HypothesisAgent* abstract class, which basically will issue an error if we don't fulfill all of the things we need to in order to ensure that our Agent is compatible with the sequential learning process implemented in a CAMD *Campaign* (more on Campaigns later).

In [None]:
from camd.agent.base import HypothesisAgent
from sklearn.linear_model import LinearRegression

def get_magpie_features(dataframe):
    """Helper function to get features of dataframe"""
    magpie_columns = [column for column in dataframe 
                      if column.startswith("MagpieData")]
    return dataframe[magpie_columns]

In [None]:
class LinearHardnessAgent(HypothesisAgent):
    def get_hypotheses(self, candidate_data, seed_data):
        # Fit on known data
        
        # Predict unknown data
        
        # Pick top 5 candidates

Let's play with the Agent a bit to see what it recommends.

In [None]:
# Test LinearHardnessAgent's get hypotheses with seed/candidate data

In [None]:
# Compare to data.head

### Exercise - Use a random forest regression
* How do its selections from our dataset differ?
* Try varying the parameters of the regressor - n_estimators, etc.

In [None]:
# For reference, here is how you invoke a random forest regressor
from sklearn.ensemble import RandomForestRegressor
rf_regressor = RandomForestRegressor()
help(RandomForestRegressor)

In [None]:
# Implement rf agent
class RFHardnessAgent(HypothesisAgent):
    def get_hypotheses(self, candidate_data, seed_data):
        # Fit on known data
        
        # Predict unknown data
        
        # Pick top 5 candidates

In [None]:
# Test rf agent

In [None]:
# Compare to top ten candidates

## Experiments

In CAMD, *Experiments* are objects that are used to generate new data corresponding to the output of the *Agent.get_hypotheses* method.  In other words, *Agents* pick the candidates on which you want to do experiments, and *Experiments* actually do those experiments.  As of today, only two experiments are implemented in CAMD, one of which is a AWS-based density functional theory computation of an input crystal structure.  The other, which we'll demonstrate below, is an *after-the-fact sampler*, which basically fetches the result of an experiment we already did that corresponds to the input.

Why is the ATFSampler useful?  We'll discuss simulation in more detail in a bit, but let's just say we use the ATFSampler to help us evaluate the performance of an Agent when we're trying to pick which agent is the best!

In [None]:
# Import ATF Sampler

In [None]:
# Invoke ATF agent with featurized data

Note that experiments are *stateful* meaning that their state is explicitly controlled by the user using the `submit` method.  When a new set of experiments are submitted, the previous experiments are appended to an internal history attribute and the new ones are set as the current experiments.

In [None]:
# Submit hypotheses and get results

## Analyzers

**Analyzers** are a bit tricky to explain because they're not necessary for every sequential learning process.  We're not going to spend much time on them here other than to say that, after you've performed an experiment, sometimes you want to postprocess the data in order to summarize the results of the current iteration and to augment the **seed data** which is being used to provide the **Agent** with the information it needs to make its next decision on which candidates to select for further experiments.

In [None]:
from camd.analysis import AnalyzerBase

In [None]:
class BulkModulusAnalyzer(AnalyzerBase):
    def analyze(self, new_experimental_results, seed_data):
        # Create new seed by concatenating old seed and new experiments
        new_seed = pd.concat(
            [seed_data, new_experimental_results],
        axis=0)
        
        # Do a few stats on the aggregated results
        # Mean new bulk modulus
        average_new_bulk_modulus = new_experimental_results.bulk_modulus.mean()
        # Average cumulative bulk modulus
        average_dataset_bulk_modulus = new_seed.bulk_modulus.mean()
        # Average rank of new data
        new_result_ranks = new_seed.bulk_modulus.rank(pct=True).loc[
            new_experimental_results.index
        ]
        
        # Construct a summary dataframe to return with the seed
        summary = pd.DataFrame({
            "average_new_bulk_modulus": [average_new_bulk_modulus],
            "average_dataset_bulk_modulus": [average_dataset_bulk_modulus],
            "average_rank": [new_result_ranks.mean()]
        })
        return summary, new_seed  # You must return both objects

In [None]:
# Invoke analyzer

In [None]:
# Analyze results with seed data

In [None]:
# Inspect summary

## Data, Campaigns, and Simulations

Now that we've got all of the building blocks in place, let's try putting everything together!

In [None]:
import os
from monty.os import cd
from camd.campaigns.base import Campaign
# Set up folders
os.system('rm -rf test')
os.system('mkdir -p test')
# Reinitialize experiment to clear history
k_atf_experiment = ATFSampler(dataframe=featurized_data)

In [None]:
# Invoke, initialize, and run campaign

In [None]:
# Read the results

In [None]:
# inspect history

In [None]:
# Plot history

In [None]:
# Fetch aggregated history

In [None]:
# Inspect top of recent history

In [None]:
# Do some highlighting
k_candidate_data.style.apply(
    lambda x: ['background: darkorange' 
               if (x.name in result_history.index)
               else '' for i in x], axis=1)

## Final thoughts
There's a lot more that we can do to improve our postprocessing analysis of how well the campaign proceeded, but this should get you started.  A few exercises you might find interesting to try:

* Test different regressors from scikit learn, this [documentation of their supervised learning methods](https://scikit-learn.org/stable/supervised_learning.html) points to many of them.
* Test the agent on using multiple random seeds and determine the spread on discovery rate.
* Develop an explore/exploit strategy where you choose some candidates from the regressor prediction and some randomly.
* Try different datasets, see the datasets folder or the [matminer datasets documentation](https://hackingmaterials.lbl.gov/matminer/dataset_summary.html).
* Try different featurizers, see [the matminer featurizer documentation](https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html)

## Glossary
* **Agent** - decision making object in camd, must implement `get_hypotheses` in order to work properly in the loop
* **Experiment** - object which performs some action in order to determine unknowns about an input dataset
* **Analyzer** - object which postprocesses experimental outputs and prior seed data in order to provide a new seed data
* **seed_data** - Data which is "known" either before the start of a given **Campaign** or prior to any iteration.  Is used to inform the **Agent** of the data it should be using to make a decision about how to select from the **Candidate data**.
* **candidate_data** - data which represents the information about the set of "unknowns" at a given point of time for a **Campaign**.
* **Campaign** - the iterative procedure by which an **Agent** suggests experiments from the **candidate data**, the **Experiment** performs them, the **Analyzer** analyzes them and feeds a new **seed data** and set of **candidate data** back to the **Agent** to start a new iteration. 