# Training the model

This notebook loads a PDF dataset and trains an XGBoost model to classify the structure of a metallic nanoparticle based on its PDF. The model is trained to predict a specific atomic model for an atomic pair distribution (PDF) from a total scattering experiment on a metallic nanoparticle. There are 4044 atomic models to predict, as seen in the xyz_files folder containing all of them. The model is given PDFs with 300 datapoints from r = 0 Å to r = 30 Å, with a step length of 0.1. 

The model is trained for 500 epochs with early stopping after 5 rounds of no improvement in validation loss. The learning rate is 0.15 and the max depth is 3. There is no hyperparameter optimization.

**How to use:** Run the cells underneath from top to bottom 

The first cell imports packages and functions from the backend. The second cell imports a PDF dataset. The third cell trains an XGBoost model. The fourth cell saves the trained XGBoost model.

# Import packages and functions from backend

In [1]:
import sys, os, os.path, h5py, time

from os import walk
import pandas as pd
import xgboost as xgb

sys.path.append("Backend")

from training import load_PDFs, ML, sort_filenames, get_training_data_backend
sorted_filenames_flat, xyz_path = get_training_data_backend(xyz_folder_name = "natoms6_12_test2") #Den her skal nok tage xyz_folder_name

# Import PDF dataset

Load PDF dataset from the folder "/PDF_datasets/". Specify the folder name of the dataset you want to load.

In [2]:
folder_name = "natoms6_12_test2"

X_train, y_train, X_val, y_val, X_test, y_test = load_PDFs(folder_name)
display(X_train)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,294,295,296,297,298,299,300,qmin,qmax,qdamp
0,0.0,-0.000183,-0.001878,-0.002262,-0.002411,-0.005062,-0.006439,-0.006241,-0.010246,-0.014122,...,-0.002747,-0.002943,-0.002615,-0.001315,-0.001143,-0.001071,0.000256,1.333008,19.390625,0.038116
1,0.0,0.036011,0.040466,0.019714,0.015961,0.048615,0.082825,0.078857,0.048126,0.032379,...,0.005524,0.006897,0.006256,0.003841,0.001868,0.001820,0.002640,1.600586,12.335938,0.028152
2,0.0,0.011978,-0.005566,-0.031342,-0.028366,-0.010338,-0.020813,-0.055084,-0.064575,-0.044189,...,-0.002182,-0.007282,-0.008125,-0.005478,-0.005753,-0.010139,-0.012756,1.472656,13.898438,0.027298
3,0.0,-0.022964,-0.008598,-0.010109,-0.044586,-0.046753,-0.026718,-0.054749,-0.081665,-0.052368,...,0.002283,-0.006462,0.001189,0.005630,-0.003742,-0.001387,0.007446,1.152344,19.046875,0.023132
4,0.0,-0.003372,-0.028412,-0.056885,-0.062622,-0.060699,-0.082397,-0.115845,-0.123474,-0.112610,...,0.001376,-0.002373,-0.004524,-0.001351,0.001927,0.000175,-0.003399,1.125000,14.078125,0.026703
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
385,0.0,0.008987,0.023407,0.029892,0.035706,0.045593,0.045105,0.043091,0.044250,0.033813,...,-0.000769,-0.002258,-0.003593,-0.005184,-0.006268,-0.007019,-0.007927,1.861328,20.703125,0.027618
386,0.0,0.013359,0.027466,0.037720,0.048218,0.055145,0.058197,0.060181,0.054871,0.048096,...,0.000834,0.002913,0.004581,0.005928,0.007465,0.008194,0.008942,1.989258,23.875000,0.014648
387,0.0,0.022247,-0.006535,-0.048309,-0.038818,-0.004665,-0.021072,-0.073425,-0.077698,-0.034943,...,0.002287,0.002970,0.003504,0.002693,0.001478,0.001465,0.002180,1.208984,14.031250,0.021683
388,0.0,0.016617,-0.001432,0.032257,0.026398,0.017731,0.049164,0.026215,0.039093,0.048187,...,0.000020,-0.000866,-0.000434,-0.000149,-0.000884,-0.000099,-0.000434,1.914062,23.484375,0.025604


# Train XGBoost model

Train the XGBoost model on the loaded PDF dataset. n_threads specifies how many CPU cores are used by XGBoost. Early stopping is turned on with 5 early stopping rounds.

In [4]:
n_threads = 4
n_epochs = 5
model = None
model_trained = ML(X_train, y_train, X_val, y_val, model, n_threads, n_epochs)

Time spent on making data ready: 0.0005698879559834798  min
Training model
[0]	train-mlogloss:8.22765	val-mlogloss:8.22764
[1]	train-mlogloss:8.22765	val-mlogloss:8.22764
[2]	train-mlogloss:8.22765	val-mlogloss:8.22764
[3]	train-mlogloss:8.22765	val-mlogloss:8.22764
[4]	train-mlogloss:8.22765	val-mlogloss:8.22764
Time spent on training model: 0.08458458582560222  min


# Save XGBoost model

Save trained model in the "Results/" folder. The model will be named after the loaded PDF dataset.

In [5]:
model_trained.save_model("Models/XGBmodel_" + folder_name + ".model")