# DeepAcceptor：Computational design and screening of acceptor materials for organic solar cells

It is a time-consuming and costly process to developing affordable and high-performance organic photovoltaic materials. Developing reliable computational methods to predict the power conversion efficiency (PCE) is crucial to triage unpromising molecules in large-scale database and accelerate the material discovery process. In this study, a deep-learning based framework (DeepAcceptor) has been built to design and discover high-efficient small molecule acceptor materials. Specifically, an experimental dataset was built by collecting data from publications. Then, a BERT-based model was used to predict PCEs. The molecular graph was used as the input and the computation molecules and experimental molecules were used to pretrain and finetune the model, respectively. DeepAcceptor is a promising method to predict the PCE and speed up the discovery of high-performance acceptor materials.

## Dataset preparation

The atom types and bond information were calculated by using rdkit.The training,test and validation dataset are preprocess by runing the utils .py

In [1]:
import utils

import os
from collections import OrderedDict

import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import rdchem

from compound_constants import DAY_LIGHT_FG_SMARTS_LIST


from utils import mol_to_geognn_graph_data_MMFF3d

In [2]:
    import pandas as pd 
    from tqdm import tqdm
    f = pd.read_csv (r"data/reg/train0.csv")
    re = []
    pce = f['PCE']
    for ind,smile in enumerate ( f.iloc[:,1]):
        
        atom,adj = mol_to_geognn_graph_data_MMFF3d(smile)
        np.save('data/reg/train/adj'+str(ind)+'.npy',np.array(adj))
        re.append([atom,'data/reg/train/adj'+str(ind)+'.npy',pce[ind] ])
    r = pd.DataFrame(re)
    r.to_csv('data/reg/train/train.csv')
   
    

## Pre-Training

First, the  masked language model (MLM) task  was chosen as the SMILES was converted into molecular graph by using RDKit. Then, a supernode was added, which was made to connected to all the atoms in a molecule. A mask atoms model was used to pretrain the model similar to MLM task in NLP. As shown in Figure 1, the pretrained model consisting of the embedding layer, transformer encoder layers and classification layers was used to predict the masked atoms. The computational molecules were represented as embeddings including word token embeddings and positional embeddings. Then the embedding was used as the input of transformer encoder layers. Specifically, 15% of the atoms in a molecule were randomly selected, and these atoms have an 80% probability of being represented as [MASK], 10% probability of being replaced by other atoms and 10% probability of keeping unchanged. In pretraining stage, the classification linear layers were added to the transformer encoder layers and used to predict the masked atoms. The original molecules were used as the truth to train the model and predict the types of masked atoms.

### It is recommended to calculate on the supercomputing!

In [7]:
import pretrain

main()

## Train

The pre-trained model can be used to predict PCE for new NFA materials

In [10]:
import regression

In [None]:
    result =[]
    r2_list = []
    for seed in [24]:
        print(seed)
        r2 ,prediction_val,prediction_test= main(seed)
        result.append(prediction_val)
        r2_list.append(r2)
    print(r2_list)
    from sklearn.metrics import median_absolute_error,r2_score,mean_squared_error
    from scipy.stats import pearsonr
    from sklearn.metrics import confusion_matrix
    import matplotlib.pyplot as plt
    import numpy as np
    re = []
    for i in range (len(result)):
        data = result[i]
        mae = median_absolute_error(data[0],data[1])
        r2=r2_score(data[0],data[1])
        mse = mean_squared_error(data[0],data[1])
        
        r=pearsonr(data[0],data[1])[0]
        res=np.vstack((mae,r2,mse,r))

## Predict

In [11]:
import predict

In [None]:
    result =[]
    r2_list = []
    for seed in [24]:
        print(seed)
        r2 ,prediction_val= main(seed)
        result.append(prediction_val)
        r2_list.append(r2)
    print(r2_list)

## Acknowledgement

Jinyu Sun 

E-mail: jinyusun@csu.edu.cn