# Contributing a Model

To create a model, you will create:

* A .yaml file containing model metadata and (optionally) an equation string
* A .py file containing any additional logic for that model (required if equation string not supplied)
* A .json file containing test data for that model

This tutorial takes you through how to create and test a model interactively inside a Jupyter notebook. However, if it's a very simple model, it may be easier just to copy an existing model and edit it.

As an example, we will create a toy model that, when given a material's band gap, creates a new property we'll call the "double gap" (simply the band gap * 2).

## Step 1: Find or create your input and output symbols

A `Symbol` defines the kinds of quantities a model can accept as inputs or generate as outputs.

A symbol can either be:
* A *property* of a material, e.g. band gap or bulk modulus.
* A *condition*, e.g. temperature or applied stress.
* A generic *object*, e.g. a pymatgen `Structure`, or just a simple string or boolean.

A symbol is used to provide guarantees about the units and shape (e.g. scalar/matrix/tensor etc.) of a related quantity, and also provides useful information to the user.

There are many pre-defined symbols available, [see a full list of them here](https://github.com/materialsintelligence/propnet/tree/master/propnet/symbols).

In [20]:
from propnet.symbols import DEFAULT_SYMBOLS

In [21]:
band_gap = DEFAULT_SYMBOLS['band_gap']

In [22]:
print(band_gap)

band_gap:
	name:	band_gap
	category:	property
	units:	1.0 electron_volt
	object_type:	None
	display_names:	['Band gap']
	display_symbols:	['E_g']
	shape:	1
	comment:	



If the input/output symbols for your model are not available, you'll have to add them to the registry (copy an existing .yaml file as a template, and submit a pull request).

Since the "double gap" isn't defined in the registry, we can create it dynamically instead:

In [23]:
from propnet.core.symbols import Symbol

double_gap = Symbol('double_gap',
                    category='property',
                    display_names=["Double band gap (a fake property)"],
                    display_symbols=["E_d"],
                    units="eV",
                    shape=1)

Units have been parsed from a string format automatically, do these look correct? (1, (('electron_volt', 1.0),))


This gave a warning because we parsed the units from a string: make sure to read the warning, it's easy to specify the wrong units (e.g. millisecond `ms` vs meter-seconds `m s`).

To contribute this symbol to the registry, write it to a .yaml file and submit a pull request. However, for now we can use this symbol dynamically.

In [24]:
print(double_gap.to_yaml())

category: property
comment: null
display_names: [Double band gap (a fake property)]
display_symbols: [E_d]
name: double_gap
shape: 1
units:
- 1
- - [electron_volt, 1.0]



## Step 2: Construct your model metadata

### 2.1 Symbol Mapping

A 'symbol mapping' refers to how we map variables used *inside* our model to their symbol types defined globally, *outside* the model.

For example, 

In [25]:
symbol_mapping = {
    'E_g': 'band_gap',
    'E_d': 'double_gap'
}

The reason we have internal variables is that we might have multiple variables with the same symbol, e.g. `{'r_a': 'ionic_radius', 'r_b': 'ionic_radius'}` in the Goldschmidt model.

Note we can name these internal variables anything we like.

For example, the following would also be valid:

```
symbol_mapping = {
    'bg': 'band_gap',
    'double': 'double_gap'
}
```

It's only the keys that reference symbols defined globally that have to match their canonical names.

### 2.2 Connections

And your connections are simply the input/outputs from the model:

In [26]:
connections = [
    {
        'inputs': ['E_g'],
        'outputs': ['E_d']
    }
]

This is a list. We can have multiple inputs/outputs to a single model, provided we add logic to our `evaluate` method to handle this. In future, we might try and detect connections automatically. [Work in progress.]

### 2.3 Documentation

In [27]:
title = 'My model title'
tags = ['optical']
references = ['url:http://en.wikipedia.org', 'doi:10.1103/PhysRevB.54.11169']
description = """
A long description can go here!

Use plain text or Markdown syntax.

"""

The reference list can contain references either in BibTeX format (string starting with @), or can be urls with a "url:" prefix, or dois with a "doi:" prefix (these will be parsed into full references automatically via CrossRef).

In [28]:
metadata = {
    "title": title,
    "tags": tags,
    "references": references,
    "symbol_mapping": symbol_mapping,
    "connections": connections,
    "description": description
}

## Step 3: Create model class and evaluation logic

**[WIP: we probably need some kind of AbstractModel metaclass / a better way of constructing these classes inside a Jupyter notebook, but this works for now]**

### Option 1: Define logic explicitly in Python

Create a test model dynamically: this is just for testing evaluation in a Jupyter notebook. Models will know about the default symbols, but since we defined an extra symbol dynamically, we need to include that. (If you're just using already-defined symbols, the `additional_symbols` keyword argument can be left empty.)

In [29]:
from propnet.core.models import AbstractModel
toy_model = AbstractModel(metadata, additional_symbols=[double_gap])

Now we write an evaluate method:

In [30]:
def evaluate(symbol_values):
    
    # you can access your symbol values here
    # the dictionary keys will match those you specified
    # in your symbol_mapping
    E_g = symbol_values["E_g"]
    
    # evaluate should always return a dictionary
    # here 
    return {
        "E_d": E_g*2
    }
    
    
toy_model.evaluate = evaluate

We can now test the model:

In [31]:
toy_model.evaluate({'E_g': 2})

{'E_d': 4}

### Option 2: For simple models, just add an equation string to your metadata

This will be a list of equation strings parsed by Sympy and solved using Sympy's non-linear solver. Each equation string should equal zero, and be rearranged appropriately

In [32]:
metadata["equations"] = ["E_d - E_g*2"]

In [33]:
toy_model = AbstractModel(metadata, additional_symbols=[double_gap])
toy_model.evaluate({'E_g': 2})

{'E_d': <Quantity(4.00000000000000, 'electron_volt')>, 'successful': True}

## Step 4: Create test data

Create a few plausible test sets. These will be used by the unit tests to check the model doesn't break in future.

In [34]:
test_data = [
    {
        'inputs': {'E_g': 2},
        'outputs': {'E_d': 4}
    },
    {
        'inputs': {'E_g': 5},
        'outputs': {'E_d': 10}
    }
]

Test your model using this test data:

In [35]:
toy_model.test(test_data)

True

This should return True if everything works as expected **[WIP: instead of returning True/False, we should define a new EvaluateFailure exception type]**.

## Step 5: Export your model

We now want to write three files to add to the repository:

 * a .yaml file containing metadata, to go into `propnet/models`
 * a .py file containing logic, to go into `propnet/models/`
 * a .json file containing test data, to go into `propnet/models/test_data`
    
File names should all be the same, and match the class name in the .py file.

It's simple to create the .yaml file:

In [36]:
print(toy_model.to_yaml())  # write this to ToyModel.yaml

connections:
- inputs: [E_g]
  outputs: [E_d]
equations: [E_d - E_g*2]
references: [url:http://en.wikipedia.org, doi:10.1103/PhysRevB.54.11169]
symbol_mapping: {E_d: double_gap, E_g: band_gap}
tags: [optical]
title: AbstractModel
---

A long description can go here!

Use plain text or Markdown syntax.




The ToyModel.py file needs to be constructed manually, but will look something like:

```
from propnet.core.models import AbstractModel

class ToyModel(AbstractModel): pass
```

or 

```
from propnet.core.models import AbstractModel

class ToyModel(AbstractModel):

    def evaluate(self, symbol_values):
        ...
```

Finally, the test data can also be dumped to a file:

In [37]:
from monty.serialization import dumpfn
dumpfn(test_data, 'ToyModel.json')

(`dumpfn` is used over `json.dumps` because it will also serialize other objects, such as pymatgen `Structure` objects, to JSON correctly)