# DeepAcceptor：Deep learning-based design and screening of non-fullerene acceptor materials for organic solar cells

It is a time-consuming and costly process to develop affordable and high-performance organic photovoltaic materials. Developing reliable computational methods to predict the power conversion efficiency (PCE) is crucial to triage unpromising molecules in large-scale databases and accelerate the material discovery process. In this study, a deep learning-based framework (DeepAcceptor) has been built to design and discover high-efficient small molecule acceptor materials. Specifically, an experimental dataset was constructed by collecting data from publications. Then, a BERT-based model was customized to predict PCEs by taking fully advantages of the atom, bond, connection information in molecular structures of acceptors, and this customized architecture is termed as abcBERT. The computation molecules and experimental molecules were used to pre-train and fine-tune the model, respectively. The molecular graph was used as the input and the computation molecules and experimental molecules were used to pretrain and finetune the model, respectively. 
DeepAcceptor is a promising method to predict the PCE and speed up the discovery of high-performance acceptor materials.

It's a toy data example for the whole process. 
It was used to test that the code works. 
All parameters were set small to show how the abcBERT worked.

## Dataset preparation

The atom types and bond information were calculated by using rdkit.The training,test and validation dataset are preprocess by runing the utils .py

In [1]:
import utils

import os
from collections import OrderedDict

import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdchem

from compound_constants import DAY_LIGHT_FG_SMARTS_LIST


from utils import mol_to_geognn_graph_data_MMFF3d

In [2]:
    import pandas as pd 
    from tqdm import tqdm
    f = pd.read_csv (r"data/reg/train.csv")
    re = []
    pce = f['PCE']
    for ind,smile in enumerate ( f.iloc[:,1]):
        
        atom,adj = mol_to_geognn_graph_data_MMFF3d(smile)
        np.save('data/reg/train/adj'+str(ind)+'.npy',np.array(adj))
        re.append([atom,'data/reg/train/adj'+str(ind)+'.npy',pce[ind] ])
    r = pd.DataFrame(re)
    r.to_csv('data/reg/train/train.csv')
    print('done')
    

done


In [3]:
    f = pd.read_csv (r"data/reg/test.csv")
    re = []
    pce = f['PCE']
    for ind,smile in enumerate ( f.iloc[:,1]):
        
        atom,adj = mol_to_geognn_graph_data_MMFF3d(smile)
        np.save('data/reg/test/adj'+str(ind)+'.npy',np.array(adj))
        re.append([atom,'data/reg/test/adj'+str(ind)+'.npy',pce[ind] ])
    r = pd.DataFrame(re)
    r.to_csv('data/reg/test/test.csv')
    print('done')

done


In [4]:
import pandas as pd 
from tqdm import tqdm
f = pd.read_csv (r"data/chem.txt")
re = []
for ind,smile in enumerate ( f['Smiles']):
    print(ind)
    atom,adj = mol_to_geognn_graph_data_MMFF3d(smile)
    np.save('data/adj/adj'+str(ind)+'.npy',np.array(adj))
    re.append([atom,'data/adj/adj'+str(ind)+'.npy'])
r = pd.DataFrame(re)
r.to_csv('data/adj/re.csv')
print('done')

0
1
2
3
4
done


## Pre-Training

First, the  masked language model (MLM) task  was chosen as the SMILES was converted into molecular graph by using RDKit. Then, a supernode was added, which was made to connected to all the atoms in a molecule. A mask atoms model was used to pretrain the model similar to MLM task in NLP. As shown in Figure 1, the pretrained model consisting of the embedding layer, transformer encoder layers and classification layers was used to predict the masked atoms. The computational molecules were represented as embeddings including word token embeddings and positional embeddings. Then the embedding was used as the input of transformer encoder layers. Specifically, 15% of the atoms in a molecule were randomly selected, and these atoms have an 80% probability of being represented as [MASK], 10% probability of being replaced by other atoms and 10% probability of keeping unchanged. In pretraining stage, the classification linear layers were added to the transformer encoder layers and used to predict the masked atoms. The original molecules were used as the truth to train the model and predict the types of masked atoms.

### It is recommended to calculate on the supercomputing!

In [5]:
import pretrain

In [6]:
pretrain.main()

Epoch 1 Batch 0 Loss 0.7139
Accuracy: 0.0000
Test Accuracy: 0.0000
medium_weights/bert_weightsMedium_1.h5
Epoch 1 Loss 0.7139
Time taken for 1 epoch: 0.8610005378723145 secs

Accuracy: 0.0000
Saving checkpoint
Epoch 2 Batch 0 Loss nan
Accuracy: 0.0000
Test Accuracy: 0.0000
medium_weights/bert_weightsMedium_2.h5
Epoch 2 Loss nan
Time taken for 1 epoch: 0.33299994468688965 secs

Accuracy: 0.0000
Saving checkpoint


The pretrained model can be used to finetune the model.

## Train

The pre-trained model can be used to predict PCE for new NFA materials

In [1]:
import regression
from regression import *

In [2]:
    result =[]
    r2_list = []
    seed = 12
    r2 ,prediction_val,prediction_test= main(seed)


data
load_wieghts
best r2: 0.1220
best r2: 0.1220
stopping_monitor: 1
The model has been trained


## Predict

In [3]:
import predict
from predict import *

In [4]:
    result =[]
    r2_list = []
    seed = 12
    r2,prediction_val= main(seed)


data
finish!


## Acknowledgement

Jinyu Sun 

E-mail: jinyusun@csu.edu.cn