# Deep learning models
In the other tutorials, all examples have used scikit-learn models. However,
QSPRpred also has a number of other deep-learning models build-in. These models rely on
torch, therefore you need to make sure to have torch or installed QSPPred with the `deep` (or `full`) option (see [README.txt](https://github.com/CDDLeiden/QSPRpred#readme)).

First, we will load the dataset as usual.

In [1]:
import os
from qsprpred.data.data import QSPRDataset
from qsprpred.data.utils.descriptorsets import FingerprintSet
from qsprpred.data.utils.descriptorcalculator import MoleculeDescriptorsCalculator
from qsprpred.data.utils.datasplitters import RandomSplit

os.makedirs("../../tutorial_output/data", exist_ok=True)

# Create dataset
dataset = QSPRDataset.fromTableFile(
  	filename="../../tutorial_data/A2A_LIGANDS.tsv", 
  	store_dir="../../tutorial_output/data",
  	name="DeepLearningTutorialDataset",
  	target_props=[{"name": "pchembl_value_Mean", "task": "REGRESSION"}],
  	random_state=42
)

# calculate compound features and split dataset into train and test
feature_calculator = MoleculeDescriptorsCalculator(desc_sets = [FingerprintSet(fingerprint_type="MorganFP", radius=3, nBits=2048)])
dataset.prepareDataset(
    split=RandomSplit(test_fraction=0.2, dataset=dataset),
    feature_calculators=[feature_calculator],
    recalculate_features=True,
)

dataset.getDF().head()

Unnamed: 0_level_0,SMILES,pchembl_value_Mean,Year,QSPRID
QSPRID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
DeepLearningTutorialDataset_0,Cc1cc(C)n(-c2cc(NC(=O)CCN(C)C)nc(-c3ccc(C)o3)n...,8.68,2008.0,DeepLearningTutorialDataset_0
DeepLearningTutorialDataset_1,Nc1c(C(=O)Nc2ccc([N+](=O)[O-])cc2)sc2nc3c(cc12...,4.82,2010.0,DeepLearningTutorialDataset_1
DeepLearningTutorialDataset_2,O=C(Nc1nc2ncccc2n2c(=O)n(-c3ccccc3)nc12)c1ccccc1,5.65,2009.0,DeepLearningTutorialDataset_2
DeepLearningTutorialDataset_3,CNC(=O)C12CC1C(n1cnc3c(NCc4cccc(Cl)c4)nc(C#CCC...,5.45,2009.0,DeepLearningTutorialDataset_3
DeepLearningTutorialDataset_4,CCCn1c(=O)c2c(nc3cc(OC)ccn32)n(CCCNC(=O)c2ccc(...,5.2,2019.0,DeepLearningTutorialDataset_4


## Fully connected neural network
The first model we will look at is a fully connected neural network. This model uses the `DDNModel` class instead of the `SklearnModel` class. The `DDNModel` class accepts a `patience` argument, which is the number of epochs to wait before stopping training if the validation loss does not improve and a tolerance ( `tol`) argument, which is the minimum improvement in validation loss to be considered an improvement.

Other parameters for the underlying estimator `STFullyConnected` can be passed to the `parameters` argument as usual.
There is no need to specify the `alg` argument, as currently only `STFullyConnected` is available.

In [34]:
# Create model
from qsprpred.extra.gpu.models.dnn import DNNModel

os.makedirs("../../tutorial_output/models", exist_ok=True)
model = DNNModel(
    base_dir = '../../tutorial_output/models',
    data = dataset,
    name = 'DeepLearningTutorialModel',
    parameters={'n_epochs': 100}, # maximum number of epochs to train for
    patience=3,
    tol=0.01,
    random_state=42
)

The `DNNModel` supports early stopping of the training, which as mentioned can be controlled by the `patience` and the `tol` arguments. You can check if a model supports early stopping, by checking the `supportsEarlyStopping` attribute.

In [28]:
model.supportsEarlyStopping

True

If a model supports early stopping, it also has an `EarlyStopping` attribute which is an instance of the `EarlyStopping` class that keeps track of the number of epochs trained. The `EarlyStopping` class has a `mode` attribute that sets how early stopping should be handled when fitting the estimator. It can be one of four modes: `EarlyStoppingMode.RECORDING`, `EarlyStoppingMode.NOT_RECORDING`, `EarlyStoppingMode.OPTIMAL`, `EarlyStoppingMode.FIXED`. By default it is set to `EarlyStoppingMode.NOT_RECORDING`.

In [35]:
model.earlyStopping.mode

<EarlyStoppingMode.NOT_RECORDING: 'NOT_RECORDING'>

In this mode (`EarlyStoppingMode.NOT_RECORDING`), the `EarlyStopping` class will not keep track of at which epoch the training is stopped in a fit. In the `EarlyStoppingMode.RECORDING` mode the `EarlyStopping` class will keep track of the epoch on which the training was stopped. This can be accessed through the `EarlyStopping` class `trainedEpochs` attribute, which is a list of the epochs on which the training was stopped. You can see that for now it is just an empty list.

In [36]:
model.earlyStopping.trainedEpochs

[]

If we then run a cross-validation with the mode set to `EarlyStoppingMode.RECORDING`, we can see that the `trainedEpochs` attribute is now filled with the epochs on which the training was stopped for each fold.

In [37]:
from qsprpred.models.assessment_methods import CrossValAssessor
from qsprpred.models.early_stopping import EarlyStoppingMode
from qsprpred.models.metrics import SklearnMetric

CrossValAssessor(SklearnMetric.getMetric(name='r2'), mode=EarlyStoppingMode.RECORDING)(model)
model.earlyStopping.trainedEpochs

[59, 57, 63, 32, 54]

Mind that the mode has now been changed to `EarlyStoppingMode.RECORDING`, therefore, if you run the cross-validation again, the `trainedEpochs` attribute will be appended with the epochs on which the training was stopped previously.

In [32]:
CrossValAssessor(SklearnMetric.getMetric(name='r2'))(model)
model.earlyStopping.trainedEpochs

[59, 57, 63, 32, 54, 57, 44, 64, 39, 60]

In [38]:
model.earlyStopping.mode

<EarlyStoppingMode.RECORDING: 'RECORDING'>

In [19]:
from qsprpred.models.hyperparam_optimization import GridSearchOptimization
from qsprpred.models.assessment_methods import CrossValAssessor, TestSetAssessor
from qsprpred.models.metrics import SklearnMetric
from qsprpred.models.early_stopping import EarlyStoppingMode

# score_func = SklearnMetric.getMetric(name='r2')

# # Define the search space
# search_space = {"lr": [1e-4, 1e-3,], "neurons_h1" : [100, 200]}

# gridsearcher = GridSearchOptimization(
#     param_grid=search_space,
#     model_assessor=TestSetAssessor(scoring=SklearnMetric.getDefaultMetric(model.task)),
# )
# gridsearcher.optimize(model)

# # Create a CrossValAssessor object
# CrossValAssessor(score_func, mode=EarlyStoppingMode.RECORDING)(model)
# TestSetAssessor(score_func)(model)

# fit the model on the whole dataset
model.fitAttached(mode=EarlyStoppingMode.RECORDING)

TypeError: qsprpred.models.early_stopping.early_stopping.<locals>.wrapper_fit() got multiple values for keyword argument 'mode'