# BiBERTa：Deep learning-assisted to accelerate the discovery of donor/acceptor pairs for high-performance organic solar cells

It is a deep learning-based framework built for new donor/acceptor pairs discovery. The framework contains data collection section, PCE prediction section and molecular discovery section. Specifically, a large D/A pair dataset was built by collecting experimental data from literature. Then, a novel RoBERTa-based dual-encoder model (BiBERTa) was developed for PCE prediction by using the SMILES of donor and acceptor pairs as the input. Two pretrained ChemBERTa2 encoders were loaded as initial parameters of the dual-encoder. The model was trained, tested and validated on the experimental dataset.

It's an example for the whole process. 
It was used to test that the code works. 
All parameters were set ##small## to show how the BiBERTa worked.

## Train

The BiBERTa contains bi-RoBERTa encoder layers and interaction layers. The SMILES of donor and acceptor pairs are used as the input of the model. Two pre-trained ChemBERTa2 encoders are loaded as initial parameters of the dual-encoder layers. 

In [1]:
import train

In [2]:
train.main(using_wandb = False, hparams = 'config/config_hparam.json')

Global seed set to 111
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Some weights of the model checkpoint at DeepChem/ChemBERTa-10M-MTR were not used when initializing RobertaModel: ['regression.out_proj.bias', 'norm_std', 'norm_mean', 'regression.dense.bias', 'regression.dense.weight', 'regression.out_proj.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MTR and are new

Sanity Checking: 0it [00:00, ?it/s]

  rank_zero_warn(



mae : 11.320357322692871
mse : 129.50091552734375
r2 : -95.02465836638983
r : (-0.15681785877868776, 0.8997528739238635)


  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]


mae : 11.339509010314941
mse : 129.9381866455078
r2 : -95.34889210142464
r : (-0.695970444850101, 0.5099488927664776)


Validation: 0it [00:00, ?it/s]


mae : 11.353251457214355
mse : 130.25604248046875
r2 : -95.58458531057506
r : (-0.8837593217927395, 0.3100085823443737)


Validation: 0it [00:00, ?it/s]


mae : 11.365836143493652
mse : 130.54246520996094
r2 : -95.7969671662234
r : (-0.9291022063013242, 0.24116338773809162)


Validation: 0it [00:00, ?it/s]


mae : 11.378275871276855
mse : 130.8253936767578
r2 : -96.00675803338187
r : (-0.9387377555338279, 0.22399279977766837)


Validation: 0it [00:00, ?it/s]


mae : 11.391215324401855
mse : 131.12200927734375
r2 : -96.2267016531091
r : (-0.9474233972244441, 0.20735418430665573)


Validation: 0it [00:00, ?it/s]


mae : 11.403160095214844
mse : 131.39552307128906
r2 : -96.42950967428648
r : (-0.970118117214374, 0.15602224944152876)


Validation: 0it [00:00, ?it/s]


mae : 11.41584300994873
mse : 131.6869659423828
r2 : -96.64561206871977
r : (-0.9881986259950967, 0.09790152141628687)


Validation: 0it [00:00, ?it/s]


mae : 11.426310539245605
mse : 131.9269561767578
r2 : -96.82356621195204
r : (-0.9990359187946117, 0.027956759244627788)


Validation: 0it [00:00, ?it/s]


mae : 11.432506561279297
mse : 132.0695343017578
r2 : -96.92928947955389
r : (-0.993156121959623, 0.07452367294514158)


Validation: 0it [00:00, ?it/s]


mae : 11.436820030212402
mse : 132.1690673828125
r2 : -97.00308362262108
r : (-0.9875166330838923, 0.10069636984177498)


Validation: 0it [00:00, ?it/s]


mae : 11.438968658447266
mse : 132.21923828125
r2 : -97.04029466765073
r : (-0.9831523276243125, 0.11702447589363633)


Validation: 0it [00:00, ?it/s]


mae : 11.44015121459961
mse : 132.2470245361328
r2 : -97.06088868897383
r : (-0.9853580435809046, 0.10907519901164801)


`Trainer.fit` stopped: `max_epochs=12` reached.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  rank_zero_warn(


Testing: 0it [00:00, ?it/s]


mae : 11.44015121459961
mse : 132.2470245361328
r2 : -97.06088868897383
r : (-0.9853580435809046, 0.10907519901164801)


## Screen for large-scale dataset

In [8]:
import screen

In [9]:
x = screen.smiles_aas_test( r"dataset\OSC\test.csv")
print(x)

100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.33it/s]

[{'acceptor': 'CCCCCCCCC1(CCCCCCCC)c2cc3c(cc2-c2cc4c(cc21)-c1sc(/C=C2\\C(=O)c5ccccc5C2=C(C#N)C#N)cc1C4(CCCCCCCC)CCCCCCCC)C(CCCCCCCC)(CCCCCCCC)c1cc(/C=C2\\C(=O)c4ccccc4C2=C(C#N)C#N)sc1-3', 'donor': 'CCCCC(CC)Cc1sc(-c2c3cc(-c4ccc(-c5sc(-c6ccc(C)s6)c6c5C(=O)c5c(CC(CC)CCCC)sc(CC(CC)CCCC)c5C6=O)s4)sc3c(-c3cc(F)c(CC(CC)CCCC)s3)c3cc(C)sc23)cc1F', 'predict': 10.974780082702637}, {'acceptor': 'CCCCCCc1ccc(C2(c3ccc(CCCCCC)cc3)c3cc4c(cc3-c3sc5cc(/C=C6\\C(=O)c7ccccc7C6=C(C#N)C#N)sc5c32)C(c2ccc(CCCCCC)cc2)(c2ccc(CCCCCC)cc2)c2c-4sc3cc(/C=C4\\C(=O)c5ccccc5C4=C(C#N)C#N)sc23)cc1', 'donor': 'CCCCCCCCOc1cccc(-c2nc3c(-c4ccc(C)s4)c(F)c(F)c(-c4ccc(-c5cc6c(-c7cc(F)c(CC(CC)CCCC)s7)c7sc(C)cc7c(-c7cc(F)c(CC(CC)CCCC)s7)c6s5)s4)c3nc2-c2cccc(OCCCCCCCC)c2)c1', 'predict': 8.40988540649414}, {'acceptor': 'CCCCCCc1ccc(C2(c3ccc(CCCCCC)cc3)c3c(sc4cc(/C=C5\\C(=O)c6cc(F)c(F)cc6C5=C(C#N)C#N)sc34)-c3sc4c5c(sc4c32)-c2sc3cc(/C=C4\\C(=O)c6cc(F)c(F)cc6C4=C(C#N)C#N)sc3c2C5(c2ccc(CCCCCC)cc2)c2ccc(CCCCCC)cc2)cc1', 'donor': 'CCCCC(




## Predict by using D/A pairs

In [10]:
import run

Some weights of the model checkpoint at DeepChem/ChemBERTa-10M-MLM were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at DeepChem/ChemBERTa-10M-MLM and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be 

In [11]:
a = run.smiles_adp_test('CCCCC(CC)CC1=C(F)C=C(C2=C3C=C(C4=CC=C(C5=C6C(=O)C7=C(CC(CC)CCCC)SC(CC(CC)CCCC)=C7C(=O)C6=C(C6=CC=C(C)S6)S5)S4)SC3=C(C3=CC(F)=C(CC(CC)CCCC)S3)C3=C2SC(C)=C3)S1','CCCCC(CC)CC1=CC=C(C2=C3C=C(C)SC3=C(C3=CC=C(CC(CC)CCCC)S3)C3=C2SC(C2=CC4=C(C5=CC(Cl)=C(CC(CC)CCCC)S5)C5=C(C=C(C)S5)C(C5=CC(Cl)=C(CC(CC)CCCC)S5)=C4S2)=C3)S1')                         
print(a)

7.416102886199951


## Acknowledgement

Jinyu Sun 

E-mail: jinyusun@csu.edu.cn