In [None]:
import pandas as pd

# Data loading

In [None]:
df_emto = pd.read_json('./raw_data/emto.raw.json')
df_vasp = pd.read_json('./raw_data/vasp.raw.json')

# Data preprocessing

In [None]:
df_emto[['C_prime', 'c11', 'c12','c44', 'B', 'G', 'E']].describe().round().astype(int)

In [None]:
df_vasp[['C_prime', 'c11', 'c12','c44', 'B', 'G', 'E']].describe().round().astype(int)

Add info about pure base lement (Ti) in ternaries dataset to add then all possible combinations of base and dopants with concentrations of 0.0

In [None]:
import numpy as np
from handlers import set_dopants_for_pure_elements

In [None]:
df_emto = set_dopants_for_pure_elements(df_emto, base='Ti')
df_emto = set_dopants_for_pure_elements(df_emto, base='Zr')

In [None]:
df_vasp = set_dopants_for_pure_elements(df_vasp, base='Ti')
df_vasp = set_dopants_for_pure_elements(df_vasp, base='Zr')

# Features engeneering

In [None]:
import warnings
warnings.filterwarnings("ignore") 

In [None]:
from handlers import mlb_feats_from_elemental_hull

Information about different forms of elements in the resulting investigated compositions were added from information on phase diagrams of pure elements from Materials Project data. 

The following concentration weighted properties have been used as a separate features: atomic number, electronegativity, row and group in a periodic table, atomic mass, atomic radius, molar volume, average ionic radius, maximal and minimal oxidation state. 

In addition to features that indirectly carry information about the properties and concentration of elements in each composition, we also use features that directly describe the concentrations of each of the 34 unique elements in each composition.

Finally, the last group of features was formed as a multi-label binarization of the space groups of elements in their stable crystalline forms, corresponding to each composition from Materials Project.

Elemental properties of the most stable form used for features in the concentration weighted form: calculated density, calculated relaxed volume per atom, calculated energy per atom, and total magnetization.

In [None]:
df_emto_feats = mlb_feats_from_elemental_hull(df_emto)

In [None]:
from handlers import convert_to_feats

In [None]:
df_emto_feats = convert_to_feats(df_emto_feats)

# Models testing

To explore different models following code should be changed:

In [None]:
prop_to_pred = 'C_prime'

In [None]:
X1 = df_emto_feats.copy().drop(['ucf', 'base', 'dopants', 
                               'B','E',  'G', 'C_prime', 
                               'c11', 'c12', 'c44'], axis='columns')
y1 = df_emto_feats[prop_to_pred]

model1_feats = list(X1.columns)
print(f'Totally {len(model1_feats)} feats will be used')

In [None]:
from ml_models.tpot_pipeline_C_prime_EMTO_lib import *

model1 = model1()
model1

## Interpolation ability testing

Here we estimated interpolation rate of a single model for EMTO values prediction.

To assess interpolation rate we preform a variation of k-fold validation. During this validation at each step we eliminate 1/5 of compositions for each system, then we accurately stack the results and astimate predictability for EMTO-predictor.

In [None]:
tmp_df = X1.copy()
tmp_df[prop_to_pred] = y1
tmp_df['base'] = df_emto['base']
tmp_df['dopants'] = df_emto['dopants']

In [None]:
from handlers import interpolation_k_fold_cv

In [None]:
res, all_true, all_pred, selected_indicies = interpolation_k_fold_cv(model1, tmp_df, 5, prop_to_pred)

In [None]:
import seaborn as sns

In [None]:
display(res)
sns.jointplot(kind='reg', x=all_true, y=all_pred)

Train first model

In [None]:
from handlers import fill_na_feats

In [None]:
'''Train first model using all data from EMTO'''
model1.fit(X1,y1)

'''
    Next step is to fill all empty features with zeros and than 
    to calculate predicted EMTO values from model 1 for VASP set
'''

df_vasp_feats = mlb_feats_from_elemental_hull(df_vasp)
df_vasp_feats = convert_to_feats(df_vasp_feats)

X2 = df_vasp_feats.drop(['ucf', 'base', 'dopants', 
                       'B','E',  'G', 'C_prime', 
                       'c11', 'c12', 'c44'], axis='columns').copy()

predicted_EMTO = model1.predict(fill_na_feats(model1_feats, X2))
X2_ = X2.copy()
X2_['predicted_EMTO'] = predicted_EMTO

y2 = df_vasp_feats[prop_to_pred]

model2_feats = list(X2_.columns)
print(f'Totally {X2_.shape[1]} feats will be used')

In [None]:
from ml_models.tpot_pipeline_C_prime_VASP_lib import *

model2 = model2()
model2

interpolation abillity for second group of models tested via leave-one-out CV:

In [None]:
from handlers import leave_one_out_cv

In [None]:
true, pred, res = leave_one_out_cv(model2, X2_,y2)

In [None]:
sns.jointplot(kind='reg', x=true, y=pred, color='tab:orange')

## Extrapolation ability testing

To estimate extrapolation rate we performed "system-fold" validation, which means, that in each step of such validation one entire system (e.g. Ti-Al) was excluded from a train set, and then we accurately check predicted values for excluded system. This validation is performed only for binaries.

In [None]:
# Validate only binaries
print('Prop_to_pred:', prop_to_pred)

Prepare datasets

In [None]:
from handlers import get_set_for_system_fold_cv

In [None]:
tmp_df = get_set_for_system_fold_cv(df_emto, df_vasp, prop_to_pred)

tmp_df_emto = X1.copy()
tmp_df_emto[prop_to_pred] = df_emto[prop_to_pred]

tmp_df_vasp = X2.copy()
tmp_df_vasp[prop_to_pred] = df_vasp[prop_to_pred]

results,metrics = system_fold_cv(tmp_df,
                           model1, model2,
                           prop_to_pred,
                           tmp_df_emto, tmp_df_vasp)

In [None]:
display(metrics)
sns.jointplot(x = results['True_EMTO'].values, y=results['Pred_EMTO'].values, 
              kind='reg')
sns.jointplot(x = results['True_VASP'].values, y=results['Pred_VASP'].values, 
              kind='reg', color='tab:orange')