### Analysis of Lipinski rules in predicted molecules
In this notebook, a sample of random molecules is collected from PubChem and analysed with Drug Predictor's model. Then an analysis is conducted to observes any possible bias in the compliance to Lipinski rules between molecules predicted with high and low probabilities.

#### Imports and loads

In [2]:
import os
import random
import pandas as pd
import numpy as np
import pubchempy as pcp
from rdkit import Chem
from rdkit.Chem import Lipinski
from rdkit.Chem import Descriptors
from compute_fp_note import Compute_FP
from tensorflow.keras.models import load_model


In [4]:
# Training dataset, loaded to avoid grabbing molecules used for training
pubchem = pd.read_csv(os.path.join('..', 'data', '01_raw', 'dataset_pubchem.csv'))

In [26]:
# The fingerprints used to train the model
with open(os.path.join('..', 'data', '05_model_input', 'selected_fp.txt')) as file:
            selected_fp = file.readline()

In [29]:
# Load a computer wich holds all the functions to calculate any type of fingerprint
computer = Compute_FP()

In [36]:
# Load the CNN model
model = load_model(os.path.join('..', 'data', '06_models', 'def_model.hd5'))



#### Collecting a n number of molecules not present in the training dataset

In [5]:
n_mols = 10

In [16]:
labeled_cids = list(pubchem['cid'])
random_cids = []
while len(random_cids) < 10:
    i = random.randint(2, 15000000)
    if i not in labeled_cids:
        random_cids.append(i)

#### Building a dataframe with the selected cids

Functions

In [10]:
def get_smiles(cid):
    try:
        compound = pcp.Compound.from_cid(cid)
        return compound.isomeric_smiles
    except:
        'no smiles found'
        return None

In [21]:
def get_rdkit_molecule(smiles):
    try:
        mol = Chem.MolFromSmiles(smiles)
        return mol
    except:
        print('no molecule found')
        return None

In [31]:
def get_fp(mol):
    return computer.relate_fp_functions(selected_fp, mol)

Building a Dataframe with 'cid', 'smiles' and 'molecule' columns

In [17]:
randoms = pd.DataFrame()

In [18]:
randoms['cid'] = random_cids

In [19]:
randoms['smiles'] = randoms['cid'].map(get_smiles)

In [23]:
randoms['molecule'] = randoms['smiles'].map(get_rdkit_molecule)

In [32]:
randoms['fingerprints'] = randoms['molecule'].map(get_fp)

In [101]:
randoms = randoms.dropna()

In [35]:
randoms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   cid           10 non-null     int64 
 1   smiles        10 non-null     object
 2   molecule      10 non-null     object
 3   fingerprints  10 non-null     object
dtypes: int64(1), object(3)
memory usage: 448.0+ bytes


Saving the model for later if n_mols is high

In [None]:
#randoms.to_pickle('randoms_no_preds.pickle')

#### Predicting the categories

Reshaping the array of fingerprints

In [38]:
fingerprints = np.array(list(randoms['fingerprints']))
reshaped_fps = fingerprints.reshape((fingerprints.shape[0], fingerprints.shape[1], 1))

Obtaining the predictions and the highest probabilities.

In [47]:
probs = model.predict(reshaped_fps)
preds = [np.argmax(x) for x in probs]
max_probs = [np.max(x) for x in probs]



Adding predictions and probabilities to the dataframe


In [48]:
randoms['predictions'] = preds

In [49]:
randoms['probabilities'] = max_probs

In [50]:
#randoms.to_pickle('randoms.pickle')

In [168]:
randoms = pd.read_pickle(os.path.join('randoms_smf.pickle'))

In [169]:
randoms.head()

Unnamed: 0,cid,smiles,molecule,fingerprints
0,13573036,CC1=C(C=C(N1C)C2=CC=CC=C2)C=NC3=CC=C(C=C3)[N+]...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,1230264,C1C[C@H]2[C@@H]3CN(C[C@@H]3[C@@H]1C2OC(=O)C4=C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,7079270,C[NH+]1CCC2=C(C1)C(=C3C(=C2C(=O)/C=C/C4=C(C(=C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,6136287,CCOC1=C(C=C(C=C1)/C=C\2/C(=O)N(C(=S)S2)C3=C(C=...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
4,7961650,C[C@@H](C(=O)C1=CC=C(C=C1)F)OC(=O)CNS(=O)(=O)C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [170]:
def add_lipinski_props(df, molecule_column):
    df['HBondAcceptorCount'] = df[molecule_column].map(Lipinski.NumHAcceptors)
    df['HBondDonorCount'] = df[molecule_column].map(Lipinski.NumHDonors)
    df['MolecularWeight'] = df[molecule_column].map(Descriptors.MolWt)
    df['LogP'] = df[molecule_column].map(Descriptors.MolLogP)

In [171]:
def is_lipinski(x: pd.DataFrame) -> pd.DataFrame:
    """
    Function that applies a set of rules (Lipinski rules) to several columns of a pandas dataframe and returns \
          a dataframe with a new column that states if said rules were passed or not.
    Input: pandas dataframe.
    Output: pandas dataframe.
    """
    # Lipinski rules
    hdonor = x['HBondDonorCount'] < 6
    haccept = x['HBondAcceptorCount'] < 10
    mw = x['MolecularWeight'] < 500
    clogP = x['LogP'] < 5
    # Apply rules to dataframe
    x['RuleFive'] = np.where(((hdonor & haccept & mw) | (hdonor & haccept & clogP) | (hdonor & mw & clogP) | (haccept & mw & clogP)), 1, 0)
    return x

In [172]:
add_lipinski_props(randoms, 'molecule')

In [173]:
randoms.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9936 entries, 0 to 9955
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   cid                 9936 non-null   int64  
 1   smiles              9936 non-null   object 
 2   molecule            9936 non-null   object 
 3   fingerprints        9936 non-null   object 
 4   HBondAcceptorCount  9936 non-null   int64  
 5   HBondDonorCount     9936 non-null   int64  
 6   MolecularWeight     9936 non-null   float64
 7   LogP                9936 non-null   float64
dtypes: float64(2), int64(3), object(3)
memory usage: 698.6+ KB


In [174]:
is_lipinski(randoms)

Unnamed: 0,cid,smiles,molecule,fingerprints,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,RuleFive
0,13573036,CC1=C(C=C(N1C)C2=CC=CC=C2)C=NC3=CC=C(C=C3)[N+]...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4,0,319.364,4.65932,1
1,1230264,C1C[C@H]2[C@@H]3CN(C[C@@H]3[C@@H]1C2OC(=O)C4=C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,361.485,4.04250,1
2,7079270,C[NH+]1CCC2=C(C1)C(=C3C(=C2C(=O)/C=C/C4=C(C(=C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",6,1,412.462,1.90800,1
3,6136287,CCOC1=C(C=C(C=C1)/C=C\2/C(=O)N(C(=S)S2)C3=C(C=...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",5,0,399.537,5.11654,1
4,7961650,C[C@@H](C(=O)C1=CC=C(C=C1)F)OC(=O)CNS(=O)(=O)C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",5,1,383.372,2.05770,1
...,...,...,...,...,...,...,...,...,...
9951,5957860,COC1=C(C=CC(=C1)/C=N/NC2=NC3=C(NON3)N=C2NC4=CC...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",11,4,592.403,4.61740,0
9952,12084460,C1[C@H]([C@H]([C@@H]([C@H](O1)OCC2=CC=CC=C2)O)...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",10,2,762.896,6.36440,0
9953,1667066,CC[C@H](C)OC(=O)[C@H](C)OC1=CC2=C(C=C1)C(=O)C(...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",6,0,438.520,5.99180,1
9954,14077126,C[C@@H]([C@H]1[C@@H](C[C@H](O1)OC)OC)OC(=O)NC2...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",9,1,385.329,2.21640,1


In [175]:
randoms.head()

Unnamed: 0,cid,smiles,molecule,fingerprints,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,RuleFive
0,13573036,CC1=C(C=C(N1C)C2=CC=CC=C2)C=NC3=CC=C(C=C3)[N+]...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",4,0,319.364,4.65932,1
1,1230264,C1C[C@H]2[C@@H]3CN(C[C@@H]3[C@@H]1C2OC(=O)C4=C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",3,0,361.485,4.0425,1
2,7079270,C[NH+]1CCC2=C(C1)C(=C3C(=C2C(=O)/C=C/C4=C(C(=C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",6,1,412.462,1.908,1
3,6136287,CCOC1=C(C=C(C=C1)/C=C\2/C(=O)N(C(=S)S2)C3=C(C=...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",5,0,399.537,5.11654,1
4,7961650,C[C@@H](C(=O)C1=CC=C(C=C1)F)OC(=O)CNS(=O)(=O)C...,<rdkit.Chem.rdchem.Mol object at 0x000001E4BB8...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",5,1,383.372,2.0577,1


In [184]:
randoms.to_pickle(os.path.join('randoms_w_lip.pickle'))

In [7]:
randoms = pd.read_pickle(os.path.join('randoms_w_lip.pickle'))

In [9]:
randoms.groupby('RuleFive')['probabilities'].mean()

RuleFive
0    0.608702
1    0.584020
Name: probabilities, dtype: float32

In [10]:
safest = randoms[randoms['probabilities']>0.9]

In [11]:
len(safest)

1344

In [12]:
safest.groupby('RuleFive')['probabilities'].count()

RuleFive
0     192
1    1152
Name: probabilities, dtype: int64

In [13]:
randoms.columns

Index(['cid', 'smiles', 'molecule', 'fingerprints', 'HBondAcceptorCount',
       'HBondDonorCount', 'MolecularWeight', 'LogP', 'RuleFive',
       'probabilities'],
      dtype='object')

In [14]:
randoms_analysis = randoms.drop(['smiles', 'molecule', 'fingerprints'], axis=1)

In [15]:
randoms_analysis.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9936 entries, 0 to 9955
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   cid                 9936 non-null   int64  
 1   HBondAcceptorCount  9936 non-null   int64  
 2   HBondDonorCount     9936 non-null   int64  
 3   MolecularWeight     9936 non-null   float64
 4   LogP                9936 non-null   float64
 5   RuleFive            9936 non-null   int32  
 6   probabilities       9936 non-null   float32
dtypes: float32(1), float64(2), int32(1), int64(3)
memory usage: 543.4 KB


In [16]:
randoms_analysis.groupby('RuleFive').count()/len(randoms_analysis)

Unnamed: 0_level_0,cid,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,probabilities
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.107589,0.107589,0.107589,0.107589,0.107589,0.107589
1,0.892411,0.892411,0.892411,0.892411,0.892411,0.892411


In [17]:
safest_analysis = randoms_analysis[randoms_analysis['probabilities']>0.7]

In [18]:
safest_analysis.groupby('RuleFive').mean()

Unnamed: 0_level_0,cid,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,probabilities
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,8572043.0,9.796069,3.149877,705.860218,5.04211,0.881701
1,7484470.0,4.578584,1.239724,363.742101,3.119728,0.863904


In [19]:
safest_analysis.groupby('RuleFive').count()/len(safest_analysis)

Unnamed: 0_level_0,cid,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,probabilities
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.123259,0.123259,0.123259,0.123259,0.123259,0.123259
1,0.876741,0.876741,0.876741,0.876741,0.876741,0.876741


In [20]:
drugs = pd.read_csv(os.path.join('../data/03_primary/all_drugs_dataset.csv'))

In [21]:
drugs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203 entries, 0 to 202
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   CID                    203 non-null    int64  
 1   HBondAcceptorCount     203 non-null    int64  
 2   HBondDonorCount        203 non-null    int64  
 3   IsomericSMILES         203 non-null    object 
 4   MolecularWeight        203 non-null    float64
 5   LogP                   172 non-null    float64
 6   RuleFive               203 non-null    int64  
 7   MATC_Code_Short        203 non-null    object 
 8   MATC_Code_Explanation  203 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 14.4+ KB


In [22]:
drugs_analysis = drugs.drop(['IsomericSMILES', 'MATC_Code_Short', 'MATC_Code_Explanation'], axis=1)

In [23]:
drugs_analysis.groupby('RuleFive').count()/len(drugs)

Unnamed: 0_level_0,CID,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.172414,0.172414,0.172414,0.172414,0.064039
1,0.827586,0.827586,0.827586,0.827586,0.783251


In [24]:
drugs_analysis.groupby('RuleFive').mean()

Unnamed: 0_level_0,CID,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,21539050.0,11.628571,7.885714,949.565746,-2.558462
1,4670556.0,3.327381,1.541667,256.537738,1.641509


In [25]:
randoms_analysis.groupby('RuleFive').mean()

Unnamed: 0_level_0,cid,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,probabilities
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,8461197.0,8.743686,2.762395,683.377282,5.466515,0.608702
1,7354755.0,4.489343,1.136235,359.932612,3.183786,0.58402


In [26]:
pd.DataFrame(drugs_analysis['CID'])

Unnamed: 0,CID
0,24769
1,134694070
2,5121
3,4660557
4,122175
...,...
198,5311065
199,25074887
200,21585658
201,5284373


In [27]:
drugs_w_mol = pd.read_pickle('../data/05_model_input/input_table.pickle/2023-09-07T18.33.38.013Z/input_table.pickle')

In [28]:
drugs_w_mol

Unnamed: 0,CID,MATC_Code_Short,MATC_Code_Explanation,Molecule,Morgan2FP,MACCSKeys,AtomPairFP,TopTorFP,AvalonFP,PubchemFP,Label
0,24769,B,BLOOD AND BLOOD FORMING ORGANS,<rdkit.Chem.rdchem.Mol object at 0x000002784DF...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ...",1
1,134694070,C,CARDIOVASCULAR SYSTEM,<rdkit.Chem.rdchem.Mol object at 0x00000278503...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ...",2
2,5121,J,ANTIINFECTIVES FOR SYSTEMIC USE,<rdkit.Chem.rdchem.Mol object at 0x00000278503...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, ...",7
3,4660557,N,NERVOUS SYSTEM,<rdkit.Chem.rdchem.Mol object at 0x00000278503...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ...",10
4,122175,L,ANTINEOPLASTIC AND IMMUNOMODULATING AGENTS,<rdkit.Chem.rdchem.Mol object at 0x00000278503...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, ...",8
...,...,...,...,...,...,...,...,...,...,...,...
10105,65450,J,ANTIINFECTIVES FOR SYSTEMIC USE,<rdkit.Chem.rdchem.Mol object at 0x0000027847C...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ...",7
10106,91800164,V,VARIOUS,<rdkit.Chem.rdchem.Mol object at 0x0000027847B...,"[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ...","[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, ...",15
10107,9851775,P,ANTIPARASITIC PRODUCTS INSECTICIDES AND REPELL...,<rdkit.Chem.rdchem.Mol object at 0x0000027847B...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, ...",12
10108,16019977,V,VARIOUS,<rdkit.Chem.rdchem.Mol object at 0x0000027847B...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, ...",15


In [29]:
drugs_w_mol = drugs_w_mol[['CID', 'Molecule']]

In [30]:
drugs_w_mol.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9909 entries, 0 to 10109
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   CID       9909 non-null   int64 
 1   Molecule  9909 non-null   object
dtypes: int64(1), object(1)
memory usage: 232.2+ KB


In [31]:
display(randoms_analysis.groupby('RuleFive').count()/len(randoms_analysis))
display(randoms_analysis.groupby('RuleFive').mean())

Unnamed: 0_level_0,cid,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,probabilities
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,0.107589,0.107589,0.107589,0.107589,0.107589,0.107589
1,0.892411,0.892411,0.892411,0.892411,0.892411,0.892411


Unnamed: 0_level_0,cid,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,probabilities
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,8461197.0,8.743686,2.762395,683.377282,5.466515,0.608702
1,7354755.0,4.489343,1.136235,359.932612,3.183786,0.58402


In [33]:
display(drugs_analysis.groupby('RuleFive')['CID'].count()/len(drugs_analysis))
display(drugs_analysis.groupby('RuleFive').mean())

RuleFive
0    0.172414
1    0.827586
Name: CID, dtype: float64

Unnamed: 0_level_0,CID,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,21539050.0,11.628571,7.885714,949.565746,-2.558462
1,4670556.0,3.327381,1.541667,256.537738,1.641509


In [35]:
display(safest_analysis.groupby('RuleFive')['cid'].count()/len(safest_analysis))
display(safest_analysis.groupby('RuleFive').mean())

RuleFive
0    0.123259
1    0.876741
Name: cid, dtype: float64

Unnamed: 0_level_0,cid,HBondAcceptorCount,HBondDonorCount,MolecularWeight,LogP,probabilities
RuleFive,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,8572043.0,9.796069,3.149877,705.860218,5.04211,0.881701
1,7484470.0,4.578584,1.239724,363.742101,3.119728,0.863904


In [219]:
drugs_analysis.mean()

CID                   2.554020e+07
HBondAcceptorCount    8.574679e+00
HBondDonorCount       3.917606e+00
MolecularWeight       5.263210e+02
LogP                  1.926290e+00
RuleFive              6.447082e-01
dtype: float64

In [220]:
randoms_analysis.mean()

cid                   7.473796e+06
HBondAcceptorCount    4.947061e+00
HBondDonorCount       1.311192e+00
MolecularWeight       3.947316e+02
LogP                  3.429382e+00
RuleFive              8.924114e-01
probabilities         5.866758e-01
dtype: float64

In [221]:
safest_analysis.mean()

cid                   7.618523e+06
HBondAcceptorCount    5.221684e+00
HBondDonorCount       1.475167e+00
MolecularWeight       4.059111e+02
LogP                  3.356678e+00
RuleFive              8.767414e-01
probabilities         8.660977e-01
dtype: float64