## Train energy prediction models for on-top and hollow sites

In this notebook I will train two models that predicts energy for an adsorbate based on a feature vector made from the nearby metal atoms at the adsorption site. The input data is .csv files with extracted feature vectors and energies from DFT data in .db files.

I will train:
* XGBoost regressor for on-top site (OH)
* XGBoost regressor for hollow site (O and H)

#### Import packages

In [1]:
import xgboost
from xgboost import XGBRegressor
import pandas as pd
from sklearn.model_selection import train_test_split

  from pandas import MultiIndex, Int64Index


#### Import data from .csv files to a Pandas Dataframe

In [2]:
feature_folder = "../csv_features/"


# Let's import O and H first. Add an initial row that says "I'm an O" or "I'm an H". The hollow model will take care of seperating them. But the data might share patterns that the model can use, even though the adsorbate is different.
def prepare_csv(feature_folder, filename, adsorbate):
    init_df = pd.read_csv(feature_folder + filename)

    # Add a first column about the adsorbate
    adsorbate_df = pd.DataFrame([adsorbate for x in range(len(init_df))], columns = ["adsorbate"])

    #Combine
    prepared_df = pd.concat([adsorbate_df, init_df], axis = 1)
    return prepared_df

H_df = prepare_csv(feature_folder, "H_features.csv", "H")
O_df = prepare_csv(feature_folder, "O_features.csv", "O")

full_df = pd.concat([H_df, O_df], axis = 0)
#Seperate the energies and remove the useless columns
del full_df["H_out.dbrow"]
#full_df1 = full_df.drop("OH_out.db row", axis = 1)
full_df

KeyError: 'H_out.dbrow'

In [3]:
full_df

Unnamed: 0,adsorbate,feature0,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,...,feature49,feature50,feature51,feature52,feature53,feature54,G_ads (eV),slab db row,H_out.dbrow,O_out.dbrow
0,H,0,0,0,0,0,0,0,0,0,...,2,0,0,1,1,1,0.152480,1,1.0,
1,H,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,2,-0.102444,1,2.0,
2,H,0,0,0,0,0,0,0,0,0,...,2,1,0,1,0,1,0.064475,1,3.0,
3,H,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,2,-0.063264,1,4.0,
4,H,0,0,0,0,0,0,0,0,0,...,2,1,1,0,1,0,-0.001436,1,5.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
494,O,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,1.460288,521,,459.0
495,O,0,0,0,0,0,0,0,0,0,...,1,0,3,0,0,0,1.696832,522,,389.0
496,O,0,0,0,0,0,0,0,0,0,...,2,0,0,0,3,0,1.222140,523,,415.0
497,O,0,0,0,1,0,0,0,0,0,...,0,0,0,3,0,0,2.425709,524,,421.0


In [20]:
# Prepare data for XGBoost

#shuffle and split
train, val_test = train_test_split(full_df, test_size=0.2)
val, test = train_test_split(val_test, test_size=0.5)