In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
#pip install rdkit-pypi
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

import shap

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import xgboost as xgb
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor



from sklearn.metrics import make_scorer ,mean_squared_error, mean_absolute_error, r2_score, explained_variance_score, max_error

from scipy.stats import skew

from sklearn.preprocessing import StandardScaler , FunctionTransformer

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate , cross_val_score


This train dataset contains both single-junctional OSCs with nonfullerene acceptors and tandem OSCs with nonfullerene acceptors (NFA).
The references of these different data and articles studied will be mentioned at the end of this work. 

We have collected the names of the donors and acceptors whose smiles had been lied to. ( Because one of the constraints of this project is that there was no software at our disposal allowing to find the smiles from the shortened names of molecules ) and with the help of the python library mordered which is an alternative to the software DRAGON 2.0 usually used ( which is not available anymore ) to generate descriptors ( about 1826 ) . This makes a total of about 3652 descriptors managed in our case ( 1826 for D and 1826 for A )



* The data of type D-NFA with single junction were collected and contained the quite a lot of information of which those which we retained were the name of the Donors, Acceptors, the corresponding smiles, PCE of the combinations D-A.

* Concerning the other type of data (D-NFA tandem), there were no smiles in the base. The collected data were only the PCE, the names of the D and A and we used the list of smiles at our disposal and mapped these smiles to the names of each molecule of this data.

**Note :**  You will see in this dataset some data where the names of the acceptors are not mentinonated ( and replaced by the NaN values ) . These data belong in fact to the category of D-NFA with single junction and the data were found as they are, but since it is especially the smiles which were important to generate the descriptors, these data were kept and used within the framework of our work.


How this was done will be detailed in another notebook.


In [18]:
data=pd.read_csv("data/train_data_with_virtualdesciptors.csv")


acc_SMILES_column = data.pop('acc_SMILES')
data.insert(0, 'acc_SMILES', acc_SMILES_column)
don_SMILES_column= data.pop('don_SMILES')
data.insert(1, 'don_SMILES', don_SMILES_column)
reported_acc_column = data.pop('Reported Acceptor')
data.insert(2, 'Reported Acceptor', reported_acc_column)
reported_don_column = data.pop('Reported Donor ')
data.insert(3, 'Reported Donor ', reported_don_column)



DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`


In [21]:
data.head()

Unnamed: 0,acc_SMILES,don_SMILES,Reported Acceptor,Reported Donor,ABC_donor,ABCGG_donor,nAcid_donor,nBase_donor,SpAbs_A_donor,SpMax_A_donor,...,TSRW10,MW,AMW,WPath,WPol,Zagreb1,Zagreb2,mZagreb1,mZagreb2,PCE (%)
0,CCc1ccc(C2(c3ccc(CC)cc3)c3cc4c(cc3-c3sc5cc(/C=...,CCCc1sc(-c2c3cc(-c4cc(C(=O)CC)c(-c5ccc(-c6sc(-...,IT-4F,PBDT-3TCO,124.327252,84.118682,0,0,204.028157,2.635164,...,155.06864,1274.243974,9.233652,43758.0,195.0,540.0,690.0,27.847222,20.0,11.77
1,CCc1ccc(C2(c3ccc(CC)cc3)c3cc4c(cc3-c3sc5cc(/C=...,CCN1C(=O)/C(=C\c2csc(-c3ccc(-c4ccc(-c5sc(-c6cc...,ITIC,PTIBT,68.613631,46.808443,0,0,113.422146,2.538887,...,150.832128,1202.281661,8.712186,38532.0,185.0,516.0,660.0,24.402778,19.277778,5.72
2,CCc1c(/C=C2\C(=O)c3cc(F)c(F)cc3C2=C(C#N)C#N)sc...,CCOC(=O)/C(C#N)=C/c1ccc(-c2sc(-c3cc4c(-c5ccc(C...,Y6,3BDT-5,96.424552,65.12927,0,0,160.252377,2.651864,...,130.929653,1002.04054,11.011435,23112.0,146.0,412.0,532.0,22.138889,14.944444,10.4
3,CCC1(CC)c2cc3c(cc2-c2sc(/C=C4\C(=O)c5ccccc5C4=...,CCOC(=O)/C(C#N)=C\c1cc(CC)c(-c2ccc(-c3sc(-c4cc...,IDIC,SM-Cl,66.015914,47.400457,0,0,109.669402,2.632854,...,117.898253,786.212318,8.545786,14389.0,123.0,336.0,432.0,18.569444,13.0,7.73
4,O=C1C2=C(C=CC=C2)C(/C1=C/C3=CC(C4(C5=CC=C(CCCC...,CCOC(=O)/C(C#N)=C\c1cc(CC)c(-c2ccc(-c3sc(-c4cc...,,SM-Cl,66.015914,47.400457,0,0,109.669402,2.632854,...,160.000527,1314.58792,7.303266,51681.0,187.0,536.0,664.0,27.458333,22.166667,7.73


# DATA PREPROCESSING

# Model Construction 

To select the appropriate regression model , we must train and validate differents models on our train/validation data and compare the metrics

* XGBoost
* Random Forest
* Desicion Trees
* K-nearest neighbors

# MODEL TEST (APPLICATION)


664 unique acceptors and 235 unique donors were identified. And so 664*235 combinations were generated, the corresponding smiles were identified and the descriptors were calculated to constitute our test dataset.