## Model Reproducibility Code

## WEEK 2: TASK 2

In [1]:
#import necessary libraries
import pandas as pd
from rdkit import Chem
from sklearn.metrics import matthews_corrcoef,accuracy_score,precision_score,confusion_matrix, f1_score,recall_score,balanced_accuracy_score


In [2]:
#reading the dataset downloaded from th publication Github page
test= pd.read_csv('external_test_set_pos.csv')
test.head()

Unnamed: 0,ACTIVITY,smiles
0,0,CCOC(=O)C1(CCN(C)CC1)c1ccccc1
1,0,CCN(CC)CC(=O)NC1=C(C)C=CC=C1C
2,0,CCCC(CCC)C(=O)O
3,0,CCC(COC(=O)c1cc(OC)c(OC)c(OC)c1)(c1ccccc1)N(C)C
4,0,COc1ccc(N(C(C)=O)c2cc3c(cc2[N+](=O)[O-])OC(C)(...


In [3]:
#reading the prediction output that was run on ersilia
prediction = pd.read_csv('reproducibility_prediction_output.csv')
prediction.head()

Unnamed: 0,key,input,probability
0,XADCESSVHJOZHK-UHFFFAOYSA-N,CCOC(=O)C1(CCN(C)CC1)c1ccccc1,0.645569
1,NNJVILVZKWQKPM-UHFFFAOYSA-N,CCN(CC)CC(=O)NC1=C(C)C=CC=C1C,0.088602
2,NIJJYAXOARWZEE-UHFFFAOYSA-N,CCCC(CCC)C(=O)O,0.042538
3,LORDFXWUHHSAQU-UHFFFAOYSA-N,CCC(COC(=O)c1cc(OC)c(OC)c(OC)c1)(c1ccccc1)N(C)C,0.060436
4,XZEITPHZKJCCSQ-UHFFFAOYSA-N,COc1ccc(N(C(C)=O)c2cc3c(cc2[N+](=O)[O-])OC(C)(...,0.038881


In [4]:
#set a threshold value to 0.5 
#convert the probability column to a binary prediction
predicted_output = (prediction['probability'] >= 0.5).astype(int)

# Extract the ACTIVITY column 
test_output = test['ACTIVITY'] 

From the publication, this is the meaning of the following term:
- tn = true negatives
- tp = true positives
- fp = false postives
- fn = false negatives
- NPV = Negative Predicted Values
- PPV = Positive Predicted Values
- SPE = Specificity
- SEN = Sensitivity
- B-ACC = Balanced Accuracy



In [7]:
# Calculate Matthews correlation coefficient
mcc = matthews_corrcoef(test_output, predicted_output)

# Calculate confusion matrix
conf_matrix = confusion_matrix(test_output, predicted_output)

# Calculate NPV
tn, fp, fn, tp = conf_matrix.ravel()
npv = tn / (tn + fn)

# Calculate accuracy
accuracy = accuracy_score(test_output, predicted_output)

# Calculate precision 
precision = precision_score(test_output, predicted_output)#represent PPV

# Calculate SPE
spe = tn / (tn + fp)

# Calculate recall
recall = recall_score(test_output, predicted_output)

# Calculate F1 score
f1 = f1_score(test_output, predicted_output)

# Calculate balanced accuracy
balanced_accuracy = balanced_accuracy_score(test_output, predicted_output)



In [8]:
# Print the results
print("MCC:", mcc)
print("NPV:", npv)
print("ACC:", accuracy)
print("PPV:", precision)
print("SPE:", spe)
print("SEN:", recall)
print("B-ACC:", balanced_accuracy)


MCC: 0.5993902797701955
NPV: 0.6875
ACC: 0.8181818181818182
PPV: 0.8928571428571429
SPE: 0.7857142857142857
SEN: 0.8333333333333334
B-ACC: 0.8095238095238095


From the publication, this model is reproducable because I got the output when I used **"external_test_set_pos.csv"** which is test 1 tset from the table below.
LINK TO THE TABLE: [here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00541-z/tables/4)