# Background
To design efficient materials for organic photovoltaics (OPVs) the most important parameter is the power conversion efficiency (PCE)  of a material. Here is a dataset of 280 samples taken from the article Adv. Energy Mater. 2018, 1801032 by Sahu et al.

#Classification:
Each of the sample/material has 13 independent features and it belongs to a class depending on its PCE value (low, moderate, high). Construct ML classifiers using logistic regression, support vector machine, and random forest algorithms to classify the samples. Report the accuracies for the training and test sets. Find out the top four important features. The article and the datasets are .xlsx files


##Loading the data

In [None]:
#libraries
import pandas as pd, numpy as np

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
cltrain = pd.read_excel('/content/drive/MyDrive/data_sci_BIO434/Train.xlsx', index_col=0)
cltest = pd.read_excel('/content/drive/MyDrive/data_sci_BIO434/Test.xlsx', index_col=0)

In [None]:
cltrain.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 250 entries, 0 to 249
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   polarizability  250 non-null    float64
 1   delLA           250 non-null    float64
 2   delLD           250 non-null    float64
 3   N_atom          250 non-null    int64  
 4   Eg              250 non-null    float64
 5   lamda_h         250 non-null    float64
 6   DIP             250 non-null    float64
 7   AL-DH           250 non-null    float64
 8   delHD           250 non-null    float64
 9   E_bind          250 non-null    float64
 10  DL-AL           250 non-null    float64
 11  delGE           250 non-null    float64
 12  E_T1            250 non-null    float64
 13  PCE             250 non-null    object 
dtypes: float64(12), int64(1), object(1)
memory usage: 29.3+ KB


In [None]:
cltrain.head()

Unnamed: 0,polarizability,delLA,delLD,N_atom,Eg,lamda_h,DIP,AL-DH,delHD,E_bind,DL-AL,delGE,E_T1,PCE
0,1267.351,0.03347,0.046259,68,454.6,0.367623,6.428867,3.584556,0.385585,2.018272,0.528989,1.33401,1.8951,high
1,1424.967333,0.03347,0.053334,78,454.73,0.348322,6.404554,3.581018,0.388034,1.984391,0.522459,1.400475,1.9001,high
2,351.559667,0.077825,0.748857,25,436.05,0.184364,7.043788,3.956536,1.422067,2.7824,0.654434,4.599248,1.777,low
3,435.703,0.077825,2.074052,20,448.11,0.37421,6.815126,3.760886,1.424788,2.47501,0.558378,3.070109,1.4173,low
4,375.125333,0.077825,0.74151,25,440.92,0.168807,6.917187,3.853132,1.455537,2.738448,0.722462,18.406742,1.7498,low


In [None]:
cltest.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30 entries, 0 to 29
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   polarizability  30 non-null     float64
 1   delLA           30 non-null     float64
 2   delLD           30 non-null     float64
 3   N_atom          30 non-null     int64  
 4   Eg              30 non-null     float64
 5   lamda_h         30 non-null     float64
 6   DIP             30 non-null     float64
 7   AL-DH           30 non-null     float64
 8   delHD           30 non-null     float64
 9   E_bind          30 non-null     float64
 10  DL-AL           30 non-null     float64
 11  delGE           30 non-null     float64
 12  E_T1            30 non-null     float64
 13  PCE             30 non-null     object 
dtypes: float64(12), int64(1), object(1)
memory usage: 3.5+ KB


In [None]:
cltest.head()

Unnamed: 0,polarizability,delLA,delLD,N_atom,Eg,lamda_h,DIP,AL-DH,delHD,E_bind,DL-AL,delGE,E_T1,PCE
0,1143.852667,0.03347,0.04762,58,453.34,0.384475,6.52294,3.650407,0.345585,2.080714,0.500962,1.2827,1.8899,high
1,304.145667,0.03347,0.725456,22,429.81,0.257704,7.080407,3.988917,1.513769,2.78676,0.548037,4.730064,1.7367,low
2,283.91,0.077825,1.659895,22,520.6,0.151521,6.340608,3.10101,1.505606,2.877901,0.867499,0.0,1.0544,low
3,1702.321333,0.03347,0.026939,90,461.73,0.269453,6.295822,3.484146,0.240549,1.931837,0.570895,0.00102,1.847,high
4,1267.351,0.03347,0.046259,68,454.6,0.367623,6.428867,3.584556,0.385585,2.018272,0.528989,1.33401,1.8951,high


## Data preprocessing

In [None]:
# Split the data into features and target
cltrainx = cltrain.drop(columns ='PCE')
cltrainy = cltrain['PCE']
cltestx = cltest.drop(columns ='PCE')
cltesty = cltest['PCE']

In [None]:
#Scaling features using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
cltrainx_scaled = scaler.fit_transform(cltrainx)
cltestx_scaled = scaler.transform(cltestx)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #for pce categorical(object) to numerical
cltrainy_le = le.fit_transform(cltrainy)
cltesty_le = le.transform(cltesty)

## Models
1. Logistic regression
2. Support vector machine
3. Random forest algorithms

In [None]:
#libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

###Logistic Regression

In [None]:
# Logistic Regression
lr = LogisticRegression(max_iter=500)
lr.fit(cltrainx_scaled, cltrainy_le)
ypred_trainlr = lr.predict(cltrainx_scaled)
ypred_testlr = lr.predict(cltestx_scaled)

In [None]:
trainacc_lr = accuracy_score(cltrainy_le, ypred_trainlr)
testacc_lr = accuracy_score(cltesty_le, ypred_testlr)

In [None]:
print("Logistic Regression - Accuracy for train set:", trainacc_lr.round(2))
print("Logistic Regression - Accuracy for test set:", testacc_lr.round(2))

Logistic Regression - Accuracy for train set: 0.59
Logistic Regression - Accuracy for test set: 0.5


### Support vector machine classifier

In [None]:
# Support Vector Machine
svm = SVC()
svm.fit(cltrainx_scaled, cltrainy_le)
ypred_trainsvm = svm.predict(cltrainx_scaled)
ypred_testsvm = svm.predict(cltestx_scaled)

In [None]:
trainacc_svm = accuracy_score(cltrainy_le, ypred_trainsvm)
testacc_svm = accuracy_score(cltesty_le, ypred_testsvm)

In [None]:
print("Support Vector Classifier - Accuracy for train set:", trainacc_svm.round(2))
print("Support Vector Classifier - Accuracy for test set:", testacc_svm.round(2))

Support Vector Classifier - Accuracy for train set: 0.68
Support Vector Classifier - Accuracy for test set: 0.67


### Random Forest Classifier

In [None]:
# Random Forest
rf = RandomForestClassifier()
rf.fit(cltrainx_scaled, cltrainy_le)
ypred_trainrf = rf.predict(cltrainx_scaled)
ypred_testrf = rf.predict(cltestx_scaled)

In [None]:
trainacc_rf = accuracy_score(cltrainy_le, ypred_trainrf)
testacc_rf = accuracy_score(cltesty_le, ypred_testrf)

In [None]:
print("Random Forest Classifier - Accuracy for train set:", trainacc_rf.round(2))
print("Random Forest Classifier - Accuracy for test set:", testacc_rf.round(2))

Random Forest Classifier - Accuracy for train set: 1.0
Random Forest Classifier - Accuracy for test set: 0.67


###Top 4 important features

In [None]:
from sklearn.feature_selection import SelectFromModel

In [None]:
# Feature Importance for Random Forest
feature_importances = rf.feature_importances_
features = cltrainx.columns
important_features_indices = feature_importances.argsort()[-4:][::-1]
top_4_features = features[important_features_indices]

print("Top 4 Important Features:", list(top_4_features))

Top 4 Important Features: ['E_bind', 'DL-AL', 'polarizability', 'delLD']


##Results:
Logistic Regression:
 - Accuracy for train set: 0.59
 - Accuracy for test set: 0.5

Support Vector Classifier:
 - Accuracy for train set: 0.68
 - Accuracy for test set: 0.67

Random Forest Classifier:
 - Accuracy for train set: 1.0
 - Accuracy for test set: 0.73

Top 4 Important Features:
- E_bind
- polarizability
- N_atom
- DL-AL


#Regression
In two more files (Train.csv and Test.csv), the values for PCE are given along with all other independent features. Build a regression model using a multiple linear regression algorithm and calculate the RMSE and R2 for the training and test datasets.

##Loading the data

In [None]:
rtrain = pd.read_csv('/content/drive/MyDrive/data_sci_BIO434/Train.csv')
rtest = pd.read_csv('/content/drive/MyDrive/data_sci_BIO434/Test.csv')

In [None]:
rtrain.head()

Unnamed: 0,#Sno.,PCE,polarizability,delLA,delLD,N_atom,Eg,lamda_h,DIP,AL-DH,delHD,E_bind,DL-AL,delGE,E_T1
0,29,7.8,1267.351,0.03347,0.046259,68,454.6,0.367623,6.428867,3.584556,0.385585,2.018272,0.528989,1.33401,1.8951
1,33,7.72,1424.967333,0.03347,0.053334,78,454.73,0.348322,6.404554,3.581018,0.388034,1.984391,0.522459,1.400475,1.9001
2,276,1.54,351.559667,0.077825,0.748857,25,436.05,0.184364,7.043788,3.956536,1.422067,2.7824,0.654434,4.599248,1.777
3,275,1.74,435.703,0.077825,2.074052,20,448.11,0.37421,6.815126,3.760886,1.424788,2.47501,0.558378,3.070109,1.4173
4,268,2.59,375.125333,0.077825,0.74151,25,440.92,0.168807,6.917187,3.853132,1.455537,2.738448,0.722462,18.406742,1.7498


In [None]:
rtrain.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   #Sno.           250 non-null    int64  
 1   PCE             250 non-null    float64
 2   polarizability  250 non-null    float64
 3   delLA           250 non-null    float64
 4   delLD           250 non-null    float64
 5   N_atom          250 non-null    int64  
 6   Eg              250 non-null    float64
 7   lamda_h         250 non-null    float64
 8   DIP             250 non-null    float64
 9   AL-DH           250 non-null    float64
 10  delHD           250 non-null    float64
 11  E_bind          250 non-null    float64
 12  DL-AL           250 non-null    float64
 13  delGE           250 non-null    float64
 14  E_T1            250 non-null    float64
dtypes: float64(13), int64(2)
memory usage: 29.4 KB


In [None]:
rtest.head()

Unnamed: 0,#Sno.,PCE,polarizability,delLA,delLD,N_atom,Eg,lamda_h,DIP,AL-DH,delHD,E_bind,DL-AL,delGE,E_T1
0,49,7.18,1143.852667,0.03347,0.04762,58,453.34,0.384475,6.52294,3.650407,0.345585,2.080714,0.500962,1.2827,1.8899
1,200,4.5,304.145667,0.03347,0.725456,22,429.81,0.257704,7.080407,3.988917,1.513769,2.78676,0.548037,4.730064,1.7367
2,267,2.7,283.91,0.077825,1.659895,22,520.6,0.151521,6.340608,3.10101,1.505606,2.877901,0.867499,0.0,1.0544
3,16,8.23,1702.321333,0.03347,0.026939,90,461.73,0.269453,6.295822,3.484146,0.240549,1.931837,0.570895,0.00102,1.847
4,4,9.36,1267.351,0.03347,0.046259,68,454.6,0.367623,6.428867,3.584556,0.385585,2.018272,0.528989,1.33401,1.8951


In [None]:
rtest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   #Sno.           30 non-null     int64  
 1   PCE             30 non-null     float64
 2   polarizability  30 non-null     float64
 3   delLA           30 non-null     float64
 4   delLD           30 non-null     float64
 5   N_atom          30 non-null     int64  
 6   Eg              30 non-null     float64
 7   lamda_h         30 non-null     float64
 8   DIP             30 non-null     float64
 9   AL-DH           30 non-null     float64
 10  delHD           30 non-null     float64
 11  E_bind          30 non-null     float64
 12  DL-AL           30 non-null     float64
 13  delGE           30 non-null     float64
 14  E_T1            30 non-null     float64
dtypes: float64(13), int64(2)
memory usage: 3.6 KB


##Data preprocessing

In [None]:
# Split the data into features and target
rtrainx = rtrain.drop(columns =['PCE', '#Sno.'])
rtrainy = rtrain['PCE']
rtestx = rtest.drop(columns =['PCE', '#Sno.'])
rtesty = rtest['PCE']

In [None]:
#Scaling features using StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
rtrainx_scaled = scaler.fit_transform(rtrainx)
rtestx_scaled = scaler.transform(rtestx)

## Linear regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt

In [None]:
#Linear regression
linr = LinearRegression()
linr.fit(rtrainx_scaled, rtrainy)
ypred_trainlinr = linr.predict(rtrainx_scaled)
ypred_testlinr = linr.predict(rtestx_scaled)

In [None]:
# Calculate RMSE for training and testing sets
rmse_train = sqrt(mean_squared_error(rtrainy, ypred_trainlinr))
rmse_test = sqrt(mean_squared_error(rtesty, ypred_testlinr))

# Calculate R2 score for training and testing sets
r2_train = r2_score(rtrainy, ypred_trainlinr)
r2_test = r2_score(rtesty, ypred_testlinr)

In [None]:
print("Linear Regression - RMSE:")
print("Train set:", rmse_train)
print("Test set:", rmse_test)
print("------------------------")
print("Linear Regression - R2:")
print("Train set:", r2_train)
print("Test set:", r2_test)

Linear Regression - RMSE:
Train set: 1.1769443018693992
Test set: 1.3376621627459313
------------------------
Linear Regression - R2:
Train set: 0.5126444660151978
Test set: 0.33381788737837015


##Results:

Linear Regression - RMSE:
- Train set: 1.177
- Test set: 1.338

Linear Regression - R2:
- Train set: 0.513
- Test set: 0.334