# DeepAcceptor：Deep learning-based design and screening of non-fullerene acceptor materials for organic solar cells

It is a time-consuming and costly process to develop affordable and high-performance organic photovoltaic materials. Developing reliable computational methods to predict the power conversion efficiency (PCE) is crucial to triage unpromising molecules in large-scale databases and accelerate the material discovery process. In this study, a deep learning-based framework (DeepAcceptor) has been built to design and discover high-efficient small molecule acceptor materials. Specifically, an experimental dataset was constructed by collecting data from publications. Then, a BERT-based model was customized to predict PCEs by taking fully advantages of the atom, bond, connection information in molecular structures of acceptors, and this customized architecture is termed as abcBERT. The computation molecules and experimental molecules were used to pre-train and fine-tune the model, respectively. The molecular graph was used as the input and the computation molecules and experimental molecules were used to pretrain and finetune the model, respectively. 
DeepAcceptor is a promising method to predict the PCE and speed up the discovery of high-performance acceptor materials.

It's a toy data example for the whole process. 
It was used to test that the code works. 
All parameters were set small to show how the abcBERT worked.

## Dataset preparation

### Download the pretrained and finetuned model 

In [4]:
pip install wget

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Note: you may need to restart the kernel to use updated packages.


In [1]:
import wget
url = r"https://github.com/JinYSun/DeepAcceptor/releases/download/v1.0.0/data.h5"
wget.download(url,"regression_weights/data.h5")

  0% [                                                                        ]        0 / 17610752  0% [                                                                        ]     8192 / 17610752  0% [                                                                        ]    16384 / 17610752  0% [                                                                        ]    24576 / 17610752  0% [                                                                        ]    32768 / 17610752  0% [                                                                        ]    40960 / 17610752  0% [                                                                        ]    49152 / 17610752  0% [                                                                        ]    57344 / 17610752  0% [                                                                        ]    65536 / 17610752  0% [                                                                        ]    73728 / 17610752

  5% [...                                                                     ]   917504 / 17610752  5% [...                                                                     ]   925696 / 17610752  5% [...                                                                     ]   933888 / 17610752  5% [...                                                                     ]   942080 / 17610752  5% [...                                                                     ]   950272 / 17610752  5% [...                                                                     ]   958464 / 17610752  5% [...                                                                     ]   966656 / 17610752  5% [...                                                                     ]   974848 / 17610752  5% [....                                                                    ]   983040 / 17610752  5% [....                                                                    ]   991232 / 17610752

 14% [..........                                                              ]  2637824 / 17610752 15% [..........                                                              ]  2646016 / 17610752 15% [..........                                                              ]  2654208 / 17610752 15% [..........                                                              ]  2662400 / 17610752 15% [..........                                                              ]  2670592 / 17610752 15% [..........                                                              ]  2678784 / 17610752 15% [..........                                                              ]  2686976 / 17610752 15% [...........                                                             ]  2695168 / 17610752 15% [...........                                                             ]  2703360 / 17610752 15% [...........                                                             ]  2711552 / 17610752

 25% [..................                                                      ]  4440064 / 17610752 25% [..................                                                      ]  4448256 / 17610752 25% [..................                                                      ]  4456448 / 17610752 25% [..................                                                      ]  4464640 / 17610752 25% [..................                                                      ]  4472832 / 17610752 25% [..................                                                      ]  4481024 / 17610752 25% [..................                                                      ]  4489216 / 17610752 25% [..................                                                      ]  4497408 / 17610752 25% [..................                                                      ]  4505600 / 17610752 25% [..................                                                      ]  4513792 / 17610752

 36% [..........................                                              ]  6373376 / 17610752 36% [..........................                                              ]  6381568 / 17610752 36% [..........................                                              ]  6389760 / 17610752 36% [..........................                                              ]  6397952 / 17610752 36% [..........................                                              ]  6406144 / 17610752 36% [..........................                                              ]  6414336 / 17610752 36% [..........................                                              ]  6422528 / 17610752 36% [..........................                                              ]  6430720 / 17610752 36% [..........................                                              ]  6438912 / 17610752 36% [..........................                                              ]  6447104 / 17610752

 46% [.................................                                       ]  8192000 / 17610752 46% [.................................                                       ]  8200192 / 17610752 46% [.................................                                       ]  8208384 / 17610752 46% [.................................                                       ]  8216576 / 17610752 46% [.................................                                       ]  8224768 / 17610752 46% [.................................                                       ]  8232960 / 17610752 46% [.................................                                       ]  8241152 / 17610752 46% [.................................                                       ]  8249344 / 17610752 46% [.................................                                       ]  8257536 / 17610752 46% [.................................                                       ]  8265728 / 17610752

 54% [.......................................                                 ]  9650176 / 17610752 54% [.......................................                                 ]  9658368 / 17610752 54% [.......................................                                 ]  9666560 / 17610752 54% [.......................................                                 ]  9674752 / 17610752 54% [.......................................                                 ]  9682944 / 17610752 55% [.......................................                                 ]  9691136 / 17610752 55% [.......................................                                 ]  9699328 / 17610752 55% [.......................................                                 ]  9707520 / 17610752 55% [.......................................                                 ]  9715712 / 17610752 55% [.......................................                                 ]  9723904 / 17610752

 61% [............................................                            ] 10895360 / 17610752 61% [............................................                            ] 10903552 / 17610752 61% [............................................                            ] 10911744 / 17610752 62% [............................................                            ] 10919936 / 17610752 62% [............................................                            ] 10928128 / 17610752 62% [............................................                            ] 10936320 / 17610752 62% [............................................                            ] 10944512 / 17610752 62% [............................................                            ] 10952704 / 17610752 62% [............................................                            ] 10960896 / 17610752 62% [............................................                            ] 10969088 / 17610752

100% [........................................................................] 17610752 / 17610752

'regression_weights/data.h5'

In [2]:
url1 = r"https://github.com/JinYSun/DeepAcceptor/releases/download/v1.0.0/bert_weightsMedium_80.h5"
url2 = r"https://github.com/JinYSun/DeepAcceptor/releases/download/v1.0.0/bert_weights_encoderMedium_80.h5"
wget.download(url1,"medium_weights/bert_weightsMedium_80.h5")
wget.download(url2,"medium_weights/bert_weights_encoderMedium_80.h5")

100% [........................................................................] 17095616 / 17095616

'medium_weights/bert_weights_encoderMedium_80.h5'

The atom types and bond information were calculated by using rdkit.The training,test and validation dataset are preprocess by runing the utils .py

In [2]:
import utils

import os
from collections import OrderedDict

import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdchem

from compound_constants import DAY_LIGHT_FG_SMARTS_LIST


from utils import mol_to_geognn_graph_data_MMFF3d

In [3]:
    import pandas as pd 
    from tqdm import tqdm
    f = pd.read_csv (r"data/reg/train.csv")
    re = []
    pce = f['PCE']
    for ind,smile in enumerate ( f.iloc[:,1]):
        
        atom,adj = mol_to_geognn_graph_data_MMFF3d(smile)
        np.save('data/reg/train/adj'+str(ind)+'.npy',np.array(adj))
        re.append([atom,'data/reg/train/adj'+str(ind)+'.npy',pce[ind] ])
    r = pd.DataFrame(re)
    r.to_csv('data/reg/train/train.csv')
    print('done')
    

done


In [4]:
    f = pd.read_csv (r"data/reg/test.csv")
    re = []
    pce = f['PCE']
    for ind,smile in enumerate ( f.iloc[:,1]):
        
        atom,adj = mol_to_geognn_graph_data_MMFF3d(smile)
        np.save('data/reg/test/adj'+str(ind)+'.npy',np.array(adj))
        re.append([atom,'data/reg/test/adj'+str(ind)+'.npy',pce[ind] ])
    r = pd.DataFrame(re)
    r.to_csv('data/reg/test/test.csv')
    print('done')

done


In [6]:
        f = pd.read_table ('data/chem1.txt')
        re = []
        for ind,smile in enumerate ( f.iloc[:,0]):
            print(ind)
            atom,adj = mol_to_geognn_graph_data_MMFF3d(smile)
            np.save('data//adj/'+str(ind)+'.npy',np.array(adj))
            re.append([atom,'data/adj/'+str(ind)+'.npy'])
            r = pd.DataFrame(re)
            r.to_csv('data/adj/re.csv')

0
1
2
3
4


## Pre-Training

First, the  masked language model (MLM) task  was chosen as the SMILES was converted into molecular graph by using RDKit. Then, a supernode was added, which was made to connected to all the atoms in a molecule. A mask atoms model was used to pretrain the model similar to MLM task in NLP. As shown in Figure 1, the pretrained model consisting of the embedding layer, transformer encoder layers and classification layers was used to predict the masked atoms. The computational molecules were represented as embeddings including word token embeddings and positional embeddings. Then the embedding was used as the input of transformer encoder layers. Specifically, 15% of the atoms in a molecule were randomly selected, and these atoms have an 80% probability of being represented as [MASK], 10% probability of being replaced by other atoms and 10% probability of keeping unchanged. In pretraining stage, the classification linear layers were added to the transformer encoder layers and used to predict the masked atoms. The original molecules were used as the truth to train the model and predict the types of masked atoms.

### It is recommended to calculate on the supercomputing!

In [7]:
import pretrain

In [8]:
pretrain.main()

Epoch 1 Batch 0 Loss 0.7081
Accuracy: 0.0000
Test Accuracy: 0.0000
medium_weights/bert_weightsMedium_1.h5
Epoch 1 Loss 0.7081
Time taken for 1 epoch: 0.8134782314300537 secs

Accuracy: 0.0000
Saving checkpoint
Epoch 2 Batch 0 Loss nan
Accuracy: 0.0000
Test Accuracy: 0.0000
medium_weights/bert_weightsMedium_2.h5
Epoch 2 Loss nan
Time taken for 1 epoch: 0.3538072109222412 secs

Accuracy: 0.0000
Saving checkpoint


The pretrained model can be used to finetune the model.

## Train

The pre-trained model can be used to predict PCE for new NFA materials

In [9]:
import regression

In [10]:
    result =[]
    r2_list = []
    for seed in [24]:
        print(seed)
        r2 ,prediction_val,prediction_test= regression.main(seed)
        result.append(prediction_val)
        r2_list.append(r2)
    print(r2_list)

24
data
load_wieghts
best r2: 0.1220
best r2: 0.1220
stopping_monitor: 1
The model has been trained
[0.122]


## Predict

Prediction on large scale dataset

In [11]:
import predict
from predict import *

In [13]:
np.set_printoptions(threshold=sys.maxsize)
prediction_val= main()

data
finish!  Results can be found in abcBERT/results.csv


Prediction for single molecule

In [15]:
import predictbysmiles

from predictbysmiles import *


In [18]:
prediction_val = predictbysmiles.main ('CCCCCCCCC1=CC=C(C2(C3=CC=C(CCCCCCCC)C=C3)C3=CC4=C(C=C3C3=C2C2=C(C=C(C5=CC=C(/C=C6/C(=O)C7=C(C=CC=C7)C6=C(C#N)C#N)C6=NSN=C56)S2)S3)C(C2=CC=C(CCCCCCCC)C=C2)(C2=CC=C(CCCCCCCC)C=C2)C2=C4SC3=C2SC(C2=CC=C(/C=C4\C(=O)C5=C(C=CC=C5)C4=C(C#N)C#N)C4=NSN=C24)=C3)C=C1')

[10.348401]


## Acknowledgement

Jinyu Sun 

E-mail: jinyusun@csu.edu.cn