# Sequential learning with CAMD

CAMD is a package designed to assist materials science researchers with *sequential learning*,
which we define as an iterative process of experimentation that improves knowledge or strategy with each iteration.

In [1]:
import pandas as pd

## Data
The dataset we'll be using in the tutorial will be 

In [2]:
from hackathon.helper import load_tutorial_data
data = load_tutorial_data()
# See the first five rows
data.head()

Unnamed: 0_level_0,formula,space_group,shear_modulus,bulk_modulus
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mp-10003,Nb4CoSi,124,97.1,194.3
mp-10010,Al(CoSi)2,164,96.3,175.4
mp-10015,SiOs,221,130.1,295.1
mp-10021,Ga,63,15.1,49.1
mp-10025,SiRu2,62,101.9,256.8


In [3]:
data = data.sort_values('bulk_modulus', ascending=False)
data.head()

Unnamed: 0_level_0,formula,space_group,shear_modulus,bulk_modulus
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mp-611426,C,194,522.9,435.7
mp-49,Os,194,258.7,401.3
mp-1894,WC,187,279.0,385.2
mp-8,Re,194,173.1,365.1
mp-30745,Ir3W,194,193.3,351.3


In [4]:
data.tail()

Unnamed: 0_level_0,formula,space_group,shear_modulus,bulk_modulus
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
mp-56,Ba,194,3.4,8.0
mp-127,Na,229,3.2,7.5
mp-614603,CsI,225,3.9,7.4
mp-569289,Hg,229,2.7,7.2
mp-571222,CsBr,225,4.6,6.5


In [5]:
## Generate some features
from matminer.featurizers.composition import ElementProperty
from pymatgen import Composition

data['composition'] = data['formula'].apply(Composition)
featurizer = ElementProperty.from_preset("magpie")
featurized_data = featurizer.featurize_dataframe(data, 'composition')

HBox(children=(FloatProgress(value=0.0, description='ElementProperty', max=1181.0, style=ProgressStyle(descrip…




In [6]:
featurized_data.head()

Unnamed: 0_level_0,formula,space_group,shear_modulus,bulk_modulus,composition,MagpieData minimum Number,MagpieData maximum Number,MagpieData range Number,MagpieData mean Number,MagpieData avg_dev Number,...,MagpieData range GSmagmom,MagpieData mean GSmagmom,MagpieData avg_dev GSmagmom,MagpieData mode GSmagmom,MagpieData minimum SpaceGroupNumber,MagpieData maximum SpaceGroupNumber,MagpieData range SpaceGroupNumber,MagpieData mean SpaceGroupNumber,MagpieData avg_dev SpaceGroupNumber,MagpieData mode SpaceGroupNumber
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
mp-611426,C,194,522.9,435.7,(C),6.0,6.0,0.0,6.0,0.0,...,0.0,0.0,0.0,0.0,194.0,194.0,0.0,194.0,0.0,194.0
mp-49,Os,194,258.7,401.3,(Os),76.0,76.0,0.0,76.0,0.0,...,0.0,0.0,0.0,0.0,194.0,194.0,0.0,194.0,0.0,194.0
mp-1894,WC,187,279.0,385.2,"(W, C)",6.0,74.0,68.0,40.0,34.0,...,0.0,0.0,0.0,0.0,194.0,229.0,35.0,211.5,17.5,194.0
mp-8,Re,194,173.1,365.1,(Re),75.0,75.0,0.0,75.0,0.0,...,0.0,0.0,0.0,0.0,194.0,194.0,0.0,194.0,0.0,194.0
mp-30745,Ir3W,194,193.3,351.3,"(Ir, W)",74.0,77.0,3.0,76.25,1.125,...,0.0,0.0,0.0,0.0,225.0,229.0,4.0,226.0,1.5,225.0


## Agents

In CAMD, Hypothesis *Agents* are python objects which select candidates on which to perform experiments.  Almost all of the "AI" components, including ML algorithms, simpler regression, and even random selection, within CAMD are contained in logic implemented within Agents.  


To implement a CAMD-compatible Agent, we use the *HypothesisAgent* abstract class, which basically will issue an error if we don't fulfill all of the things we need to in order to ensure that our Agent is compatible with the sequential learning process implemented in a CAMD *Campaign* (more on Campaigns later).

In [7]:
from camd.agent.base import HypothesisAgent
from sklearn.linear_model import LinearRegression

In [8]:
class LinearHardnessAgent(HypothesisAgent):
    def get_hypotheses(self, candidate_data, seed_data):
        # Fit on known data
        x_known = seed_data.loc[:, 
            'MagpieData minimum Number':'MagpieData mode SpaceGroupNumber'
        ]
        y_known = seed_data['bulk_modulus']
        regressor = LinearRegression()
        regressor.fit(x_known, y_known)
        
        # Predict unknown data
        x_unknown = candidate_data.loc[:, 
            'MagpieData minimum Number':'MagpieData mode SpaceGroupNumber'
        ]
        y_predicted = regressor.predict(x_unknown)
        
        # Pick top 5 candidates
        candidate_data['bulk_modulus_pred'] = y_predicted
        candidate_data = candidate_data.sort_values(
            'bulk_modulus_pred', ascending=False)
        top_candidates = candidate_data.head(5)
        return top_candidates

Let's play with the Agent a bit to see what it recommends.

In [9]:
agent = LinearHardnessAgent()
hypotheses = agent.get_hypotheses(featurized_data, featurized_data)
hypotheses[['formula', 'bulk_modulus', 'bulk_modulus_pred']]

Unnamed: 0_level_0,formula,bulk_modulus,bulk_modulus_pred
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
mp-1894,WC,385.2,356.592477
mp-49,Os,401.3,332.343383
mp-2305,MoC,349.8,331.855661
mp-91,W,303.9,324.437405
mp-567397,W2C,335.8,304.896677


In [10]:
data.head()

Unnamed: 0_level_0,formula,space_group,shear_modulus,bulk_modulus,composition
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
mp-611426,C,194,522.9,435.7,(C)
mp-49,Os,194,258.7,401.3,(Os)
mp-1894,WC,187,279.0,385.2,"(W, C)"
mp-8,Re,194,173.1,365.1,(Re)
mp-30745,Ir3W,194,193.3,351.3,"(Ir, W)"


### Exercise - Use a random forest regression
How do its selections from our dataset differ?

In [11]:
### Implement agent here
from sklearn.ensemble import RandomForestRegressor

class RFHardnessAgent(HypothesisAgent):
    def get_hypotheses(self, candidate_data, seed_data):
        # Fit on known data
        x_known = seed_data.loc[:, 
            'MagpieData minimum Number':'MagpieData mode SpaceGroupNumber'
        ]
        y_known = seed_data['bulk_modulus']
        regressor = RandomForestRegressor(n_estimators=10)
        regressor.fit(x_known, y_known)
        
        # Predict unknown data
        x_unknown = candidate_data.loc[:, 
            'MagpieData minimum Number':'MagpieData mode SpaceGroupNumber'
        ]
        y_predicted = regressor.predict(x_unknown)
        
        # Pick top 5 candidates
        candidate_data['bulk_modulus_pred'] = y_predicted
        candidate_data = candidate_data.sort_values(
            'bulk_modulus_pred', ascending=False)
        top_candidates = candidate_data.head(5)
        return top_candidates

In [12]:
### Test agent here
agent = RFHardnessAgent()
hypotheses = agent.get_hypotheses(featurized_data, featurized_data)
hypotheses[['formula', 'bulk_modulus', 'bulk_modulus_pred']]

Unnamed: 0_level_0,formula,bulk_modulus,bulk_modulus_pred
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
mp-49,Os,401.3,375.28
mp-1894,WC,385.2,364.46
mp-8,Re,365.1,353.61
mp-101,Ir,346.3,353.48
mp-567397,W2C,335.8,341.186


In [13]:
data.head(10)

Unnamed: 0_level_0,formula,space_group,shear_modulus,bulk_modulus,composition
material_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
mp-611426,C,194,522.9,435.7,(C)
mp-49,Os,194,258.7,401.3,(Os)
mp-1894,WC,187,279.0,385.2,"(W, C)"
mp-8,Re,194,173.1,365.1,(Re)
mp-30745,Ir3W,194,193.3,351.3,"(Ir, W)"
mp-2305,MoC,187,239.8,349.8,"(Mo, C)"
mp-101,Ir,225,216.5,346.3,(Ir)
mp-11482,MoIr3,194,187.7,337.0,"(Mo, Ir)"
mp-567397,W2C,162,165.7,335.8,"(W, C)"
mp-30744,IrW,51,182.8,334.2,"(Ir, W)"


## Experiments

In CAMD, *Experiments* are objects that are used to generate new data corresponding to the output of the *Agent.get_hypotheses* method.  In other words, *Agents* pick the candidates on which you want to do experiments, and *Experiments* actually do those experiments.  As of today, only two experiments are implemented in CAMD, one of which is a AWS-based density functional theory computation of an input crystal structure.  The other, which we'll demonstrate below, is an *after-the-fact sampler*, which basically fetches the result of an experiment we already did that corresponds to the input.

Why is the ATFSampler useful?  We'll discuss simulation in more detail in a bit, but let's just say we use the ATFSampler to help us evaluate the performance of an Agent when we're trying to pick which agent is the best!

In [14]:
from camd.experiment.base import ATFSampler

In [15]:
k_atf_experiment = ATFSampler(dataframe=data)

Note that experiments are *stateful* meaning that their state is explicitly controlled by the user using the `submit` method.  When a new set of experiments are submitted, the previous experiments are appended to an internal history attribute and the new ones are set as the current experiments.

In [16]:
k_atf_experiment.submit(hypotheses)
results = k_atf_experiment.get_results()

## Analyzers

**Analyzers** are a bit tricky to explain because they're not necessary for every sequential learning process.  We're not going to spend much time on them here other than to say that, after you've performed an experiment, sometimes you want to postprocess the data in order to summarize the results of the current iteration and to augment the **seed data** which is being used to provide the **Agent** with the information it needs to make its next decision on which candidates to select for further experiments.

In [17]:
from camd.analysis import AnalyzerBase

In [18]:
class BulkModulusAnalyzer(AnalyzerBase):
    def analyze(self, new_experimental_results, seed_data):
        new_seed = pd.concat(
            [seed_data, new_experimental_results],
        axis=0)
        # Create a summary
        average_bulk_modulus = new_seed.bulk_modulus.mean()
        new_result_ranks = new_seed.bulk_modulus.rank(pct=True).loc[
            new_experimental_results.index
        ]
        summary = pd.DataFrame({
            "average_bulk_modulus": [average_bulk_modulus],
            "average_rank": [new_result_ranks.mean()]
        })
        return summary, new_seed
    
    def present(self):
        pass

In [19]:
k_analyzer = BulkModulusAnalyzer()
summary, new_seed = k_analyzer.analyze(results, data)

In [20]:
summary

Unnamed: 0,average_bulk_modulus,average_rank
0,137.23204,0.994519


## Data, Campaigns, and Simulations

Now that we've got all of the building blocks in place, let's try putting everything together!

In [22]:
# Prep seed/candidate data by selecting every other point
k_seed = featurized_data.iloc[::2]
k_candidates = featurized_data.iloc[1::2].drop(
    ['bulk_modulus', 'shear_modulus'],
    axis=1
)
set(k_seed.index).intersection(k_candidates.index)

set()

In [23]:
from camd.campaigns.base import Campaign

campaign = Campaign(candidate_data=k_candidates, 
         seed_data=k_seed,
         agent=LinearHardnessAgent(),
         experiment=k_atf_experiment,
         analyzer=k_analyzer
        )

In [None]:
%pdb
!rm -rf test
!mkdir -p test
from monty.os import cd
with cd('test'):
    campaign.initialize()
    campaign.auto_loop()

Automatic pdb calling has been turned ON


ValueError: Initialization may overwrite existing loop data. Exit.

> [0;32m/Users/josephmontoya/miniconda3/envs/hackathon2020/lib/python3.7/site-packages/camd/campaigns/base.py[0m(291)[0;36minitialize[0;34m()[0m
[0;32m    289 [0;31m        [0;32mif[0m [0mself[0m[0;34m.[0m[0minitialized[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    290 [0;31m            raise ValueError(
[0m[0;32m--> 291 [0;31m                "Initialization may overwrite existing loop data. Exit.")
[0m[0;32m    292 [0;31m        [0;32mif[0m [0;32mnot[0m [0mself[0m[0;34m.[0m[0mseed_data[0m[0;34m.[0m[0mempty[0m [0;32mand[0m [0;32mnot[0m [0mself[0m[0;34m.[0m[0mcreate_seed[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    293 [0;31m            print("{} {} state: Agent {} hypothesizing".format(
[0m


## Final thoughts

## Glossary
* **Agent** - decision making object in camd, must implement `get_hypotheses` in order to work properly in the loop
* **Experiment** - object which performs some action in order to determine unknowns about an input dataset
* **Analyzer** - object which postprocesses experimental outputs and prior seed data in order to provide a new seed data
* **seed_data** - Data which is "known" either before the start of a given **Campaign** or prior to any iteration.  Is used to inform the **Agent** of the data it should be using to make a decision about how to select from the **Candidate data**.
* **candidate_data** - data which represents the information about the set of "unknowns" at a given point of time for a **Campaign**.
* **Campaign** - the iterative procedure by which an **Agent** suggests experiments from the **candidate data**, the **Experiment** performs them, the **Analyzer** analyzes them and feeds a new **seed data** and set of **candidate data** back to the **Agent** to start a new iteration. 