# Building KRR NMR prediction model

In this tutorial, we build the kernel ridge regression model which predicts nmr values by using m3gnet descriptor. <br>
Before building model, please download dataset from figshare.

In [1]:
%load_ext lab_black

In [2]:
import optuna
import pandas as pd
from sklearn.kernel_ridge import KernelRidge
import time

We build $^{13}C$ nmr prediction model. We use the dataset "m3gnet_train_C_1000.csv", which contains 1000 C environments.

## Loading dataset

In [3]:
element = "C"
atomic_number = 6
df_train = pd.read_csv(
    f"../../data/NMR/train_dataset/{element}/m3gnet_train_{element}_1000.csv"
)
df_test = pd.read_csv(
    f"../../data/NMR/test_dataset/{element}/m3gnet_test_{element}.csv"
)

## Splitting dataframe into X and y

In [4]:
X_train = df_train.loc[:, "atom_feature_vector_1":"atom_feature_vector_64"]
X_test = df_test.loc[:, "atom_feature_vector_1":"atom_feature_vector_64"]
y_train = df_train[["nmr_shift"]]
y_test = df_test[["nmr_shift"]]

## Building kernel ridge model and fitting

In [5]:
# init params values
alpha = 0.005
gamma = 0.3

In [6]:
# we use laplacian kernel
kernel_ridge = KernelRidge(kernel="laplacian", gamma=gamma, alpha=alpha)
kernel_ridge.fit(X_train, y_train)

## Hyper parameter tuning

In [7]:
alpha_low = 5e-3
alpha_high = 5e-2
gamma_low = 1e-1
gamma_high = 1e0
n_iteration = 30
random_state = 0
cv = 5

In [8]:
param_distributions = {
    "alpha": optuna.distributions.FloatDistribution(alpha_low, alpha_high),
    "gamma": optuna.distributions.FloatDistribution(gamma_low, gamma_high),
}

In [9]:
optuna_search = optuna.integration.OptunaSearchCV(
    kernel_ridge,
    param_distributions,
    cv=cv,
    n_jobs=1,
    n_trials=n_iteration,
    random_state=random_state,
    scoring=None,
)

  optuna_search = optuna.integration.OptunaSearchCV(


In [10]:
start_time = time.time()
optuna_search.fit(X_train, y_train)
print(f"It takes {time.time() - start_time} [s] for hyper parameters tuning")

[I 2024-04-05 07:03:14,568] A new study created in memory with name: no-name-cc160845-25b0-483a-ac5e-8376827076b1
[I 2024-04-05 07:03:15,492] Trial 0 finished with value: 0.8968670910536346 and parameters: {'alpha': 0.02687212953596977, 'gamma': 0.7171410857599202}. Best is trial 0 with value: 0.8968670910536346.
[I 2024-04-05 07:03:16,329] Trial 1 finished with value: 0.9626408079527161 and parameters: {'alpha': 0.04935092227977195, 'gamma': 0.2760367691460003}. Best is trial 1 with value: 0.9626408079527161.
[I 2024-04-05 07:03:17,065] Trial 2 finished with value: 0.7821823723569024 and parameters: {'alpha': 0.031253525306763084, 'gamma': 0.9825239593893744}. Best is trial 1 with value: 0.9626408079527161.
[I 2024-04-05 07:03:17,712] Trial 3 finished with value: 0.8937660386289838 and parameters: {'alpha': 0.013121613382547876, 'gamma': 0.7313217715739823}. Best is trial 1 with value: 0.9626408079527161.
[I 2024-04-05 07:03:18,377] Trial 4 finished with value: 0.8616764663780131 and 

It takes 23.21517539024353 [s] for hyper parameters tuning


In [11]:
# best params
print(optuna_search.best_params_)

{'alpha': 0.0053786843126121565, 'gamma': 0.10379398353737113}


## Predicting nmr values

In [12]:
from sklearn.metrics import mean_absolute_error  # MAE
from sklearn.metrics import mean_squared_error  # MSE

predictions_kr_train = optuna_search.predict(X_train)
predictions_kr_test = optuna_search.predict(X_test)

## Calc. MAE and MSE

In [13]:
mae_train = mean_absolute_error(y_train, predictions_kr_train)
mae_test = mean_absolute_error(y_test, predictions_kr_test)

In [14]:
print(f"MAE(train) : {mae_train} ppm")
print(f"MAE(test) : {mae_test} ppm")

MAE(train) : 0.3382271180018615 ppm
MAE(test) : 5.8226059686979275 ppm
