# ADN_T000. Getting data of Acetylcholinesterase from ChEMBL

Authors:
* Adnane Aouidate, (2019-2020), Computer Aided Drug Discovery Center, Shenzhen Institute of Advanced Technology(SIAT), Shenzhen, China.
* Adnane Aouidate, (2021-2022), Structural Bioinformatics and Chemoinformatics, Institute of Organic and Analytical Chemistry (ICOA), Orléans, France.
* Update , 2023, Ait Melloul Faculty of Applied Sciences, Ibn Zohr University, Agadir, Morocco,


**Aim of this tutorial**

In this tutorial, we will learn how to access the ChEMBL database using the chembl_webresource_client Python module. ChEMBL is a database of bioactive compounds with drug-like characteristics, their biological actions, and the related targets. It is a valuable resource for drug discovery research, and is frequently used by academic and industrial researchers.

In this tutorial, we will learn how to:

* Install the chembl_webresource_client module
* Query the ChEMBL database for compounds 
* Retrieve compound data, including their properties, activities, and targets
* Curate the compound data
* Save the compounds

**What is ChEMBL?**

ChEMBL is a database of bioactive compounds with drug-like characteristics, their biological actions, and the related targets. It is maintained by the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI). The database contains data on more than 2 million chemicals and more than 15,000 protein targets.

The ChEMBL database is a valuable resource for drug discovery research. It can be used to find new drug candidates, predict drug-target interactions, and rank targets for drug development.

**How to get started**

To get started with this tutorial, you will need to have Python installed on your computer. You can then install the chembl_webresource_client module using the following command:

```
pip install chembl_webresource_client
```
Or

```
conda install -c conda-forge chembl_webresource_client
```

**For more information**

For more information on the ChEMBL database, please visit the ChEMBL website: https://www.ebi.ac.uk/chembl/. For more information on the chembl_webresource_client module, please visit the module documentation: https://github.com/chembl/chembl_webresource_client.

**Let's get started!**

I hope you enjoy this tutorial!

In [42]:
import pandas as pd
import numpy as np
from rdkit.Chem import PandasTools
from chembl_webresource_client.new_client import new_client

In [2]:
target = new_client.target
compound = new_client.molecule
bioactivities = new_client.activity

### Get target data (Acetylcholinerase)

* Get UniProt ID of the target of interest (ACh:  P22303 (https://www.uniprot.org/uniprot/P22303)) from [UniProt website] 
* Use UniProt ID to get target information

If you are interested in another target, select a different UniProt ID. (all depends on you)

In [3]:
uniprot_id = "P22303"

In [4]:
# Get target information from ChEMBL but restrict it to specified values only
target_query = target.get(target_components__accession=uniprot_id).only(
    "target_chembl_id", "organism", "pref_name", "target_type"
)
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,organism,pref_name,target_chembl_id,target_type
0,Homo sapiens,Acetylcholinesterase,CHEMBL220,SINGLE PROTEIN
1,Homo sapiens,Cholinesterases; ACHE & BCHE,CHEMBL2095233,SELECTIVITY GROUP


Otherwise, you can get it using just the protein name ('acetylcholinesterase') in this case

In [5]:
# Get target information from ChEMBL but restrict it to specified values only
target_query = target.search('acetylcholinesterase').only("target_chembl_id", "organism", "pref_name", "target_type")
targets = pd.DataFrame.from_dict(target_query)
targets

Unnamed: 0,organism,pref_name,target_chembl_id,target_type
0,Homo sapiens,Acetylcholinesterase,CHEMBL220,SINGLE PROTEIN
1,Homo sapiens,Cholinesterases; ACHE & BCHE,CHEMBL2095233,SELECTIVITY GROUP
2,Drosophila melanogaster,Acetylcholinesterase,CHEMBL2242744,SINGLE PROTEIN
3,Bemisia tabaci,AChE2,CHEMBL2366409,SINGLE PROTEIN
4,Leptinotarsa decemlineata,Acetylcholinesterase,CHEMBL2366490,SINGLE PROTEIN
5,Torpedo californica,Acetylcholinesterase,CHEMBL4780,SINGLE PROTEIN
6,Mus musculus,Acetylcholinesterase,CHEMBL3198,SINGLE PROTEIN
7,Rattus norvegicus,Acetylcholinesterase,CHEMBL3199,SINGLE PROTEIN
8,Electrophorus electricus,Acetylcholinesterase,CHEMBL4078,SINGLE PROTEIN
9,Bos taurus,Acetylcholinesterase,CHEMBL4768,SINGLE PROTEIN


#### Fetch target data from ChEMBL

In [6]:
target = targets.iloc[0]
target

organism                    Homo sapiens
pref_name           Acetylcholinesterase
target_chembl_id               CHEMBL220
target_type               SINGLE PROTEIN
Name: 0, dtype: object

Save selected ChEMBL ID.

In [7]:
chembl_id = target.target_chembl_id
print(f"taget chembl id is: {chembl_id}")

taget chembl id is: CHEMBL220


### Get bioactivity data

Now, we want to query bioactivity data for the target of interest.

#### Fetch bioactivity data for the target from ChEMBL

In this step, we fetch the bioactivity data and filter it to only consider

* human proteins, 
* bioactivity type Ki, 
* exact measurements (relation `'='`), and
* binding data (assay type `'B'`).

In [8]:
Ace_bioactivities = bioactivities.filter(target_chembl_id= chembl_id,
                                        type="Ki",
                                        relation = "=",
                                        assay_type = "B").only("activity_id",
                                                              "assay_chembl_id",
                                                               "assay_description",
                                                               "assay_type",
                                                               "molecule_chembl_id",
                                                               "type",
                                                               "standard_units",
                                                               "relation",
                                                               "standard_value",
                                                               "target_chembl_id",
                                                               "target_organism",)
print(f"the lenght and type of Ace_bioactivities are : {len(Ace_bioactivities)}, {type(Ace_bioactivities)}")


the lenght and type of Ace_bioactivities are : 636, <class 'chembl_webresource_client.query_set.QuerySet'>


In [9]:
print(f"the lenght and the type of the first element: {len(Ace_bioactivities[0])}, {type(Ace_bioactivities[0])}")
Ace_bioactivities[0]

the lenght and the type of the first element: 13, <class 'dict'>


{'activity_id': 111024,
 'assay_chembl_id': 'CHEMBL641011',
 'assay_description': 'Inhibition constant determined against Acetylcholinesterase (AChE) receptor.',
 'assay_type': 'B',
 'molecule_chembl_id': 'CHEMBL11805',
 'relation': '=',
 'standard_units': 'nM',
 'standard_value': '0.104',
 'target_chembl_id': 'CHEMBL220',
 'target_organism': 'Homo sapiens',
 'type': 'Ki',
 'units': 'nM',
 'value': '0.104'}

In [10]:
df = pd.DataFrame.from_dict(Ace_bioactivities)
print(f"Dataframe shape : {df.shape}")
df

Dataframe shape : (636, 13)


Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value,target_chembl_id,target_organism,type,units,value
0,111024,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL11805,=,nM,0.104,CHEMBL220,Homo sapiens,Ki,nM,0.104
1,118575,CHEMBL641012,Inhibitory activity against human AChE,B,CHEMBL208599,=,nM,0.026,CHEMBL220,Homo sapiens,Ki,nM,0.026
2,125075,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL60745,=,nM,1.63,CHEMBL220,Homo sapiens,Ki,nM,1.63
3,733829,CHEMBL641691,Inhibitory activity of compound against acetyl...,B,CHEMBL95,=,nM,151.0,CHEMBL220,Homo sapiens,Ki,nM,151.0
4,740235,CHEMBL641013,Inhibitory activity of compound against acetyl...,B,CHEMBL173309,=,nM,12.2,CHEMBL220,Homo sapiens,Ki,nM,12.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
631,24963342,CHEMBL5216425,Binding affinity to AChE (unknown origin) asse...,B,CHEMBL5220695,=,nM,120.0,CHEMBL220,Homo sapiens,Ki,uM,0.12
632,24963343,CHEMBL5216425,Binding affinity to AChE (unknown origin) asse...,B,CHEMBL5219239,=,nM,170.0,CHEMBL220,Homo sapiens,Ki,uM,0.17
633,24963371,CHEMBL5216438,Binding affinity to AChE (unknown origin) asse...,B,CHEMBL5218804,=,nM,0.264,CHEMBL220,Homo sapiens,Ki,nM,0.264
634,24963381,CHEMBL5216446,Binding affinity to human AChE assessed as inh...,B,CHEMBL5219425,=,nM,3500.0,CHEMBL220,Homo sapiens,Ki,uM,3.5


In [11]:
df['standard_units'].unique()

array(['nM', '/min/M', "10'5/M/min", "10'2/M/min", "10'3/M/min",
       "10'8/M/min", "10'7/M/min", "10'4/M/min", "10'6/M/min", 'mM/min',
       '10^8M'], dtype=object)

In [12]:
dfnM = df[df['standard_units'] == 'nM']
dfnM.drop(['units', 'value'], axis=1, inplace= True)
dfnM

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value,target_chembl_id,target_organism,type
0,111024,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL11805,=,nM,0.104,CHEMBL220,Homo sapiens,Ki
1,118575,CHEMBL641012,Inhibitory activity against human AChE,B,CHEMBL208599,=,nM,0.026,CHEMBL220,Homo sapiens,Ki
2,125075,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL60745,=,nM,1.63,CHEMBL220,Homo sapiens,Ki
3,733829,CHEMBL641691,Inhibitory activity of compound against acetyl...,B,CHEMBL95,=,nM,151.0,CHEMBL220,Homo sapiens,Ki
4,740235,CHEMBL641013,Inhibitory activity of compound against acetyl...,B,CHEMBL173309,=,nM,12.2,CHEMBL220,Homo sapiens,Ki
...,...,...,...,...,...,...,...,...,...,...,...
631,24963342,CHEMBL5216425,Binding affinity to AChE (unknown origin) asse...,B,CHEMBL5220695,=,nM,120.0,CHEMBL220,Homo sapiens,Ki
632,24963343,CHEMBL5216425,Binding affinity to AChE (unknown origin) asse...,B,CHEMBL5219239,=,nM,170.0,CHEMBL220,Homo sapiens,Ki
633,24963371,CHEMBL5216438,Binding affinity to AChE (unknown origin) asse...,B,CHEMBL5218804,=,nM,0.264,CHEMBL220,Homo sapiens,Ki
634,24963381,CHEMBL5216446,Binding affinity to human AChE assessed as inh...,B,CHEMBL5219425,=,nM,3500.0,CHEMBL220,Homo sapiens,Ki


#### Preprocess and filter bioactivity data

1. Convert `standard_value`'s datatype from `object` to `float`
2. Delete entries with missing values
3. Keep only entries with `standard_unit == nM`
4. Delete duplicate molecules
5. Reset `DataFrame` index
6. Rename columns

In [13]:
dfnM.dtypes

activity_id            int64
assay_chembl_id       object
assay_description     object
assay_type            object
molecule_chembl_id    object
relation              object
standard_units        object
standard_value        object
target_chembl_id      object
target_organism       object
type                  object
dtype: object

In [14]:
dfnM = dfnM.astype({"standard_value": "float64"})
dfnM.dtypes

activity_id             int64
assay_chembl_id        object
assay_description      object
assay_type             object
molecule_chembl_id     object
relation               object
standard_units         object
standard_value        float64
target_chembl_id       object
target_organism        object
type                   object
dtype: object

In [15]:
dfnM.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 575 entries, 0 to 635
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   activity_id         575 non-null    int64  
 1   assay_chembl_id     575 non-null    object 
 2   assay_description   575 non-null    object 
 3   assay_type          575 non-null    object 
 4   molecule_chembl_id  575 non-null    object 
 5   relation            575 non-null    object 
 6   standard_units      575 non-null    object 
 7   standard_value      575 non-null    float64
 8   target_chembl_id    575 non-null    object 
 9   target_organism     575 non-null    object 
 10  type                575 non-null    object 
dtypes: float64(1), int64(1), object(9)
memory usage: 53.9+ KB


In [16]:
dfnM.dropna(axis=0, how='any', inplace= True)
print(f"Dataframe shape is {dfnM.shape}")

Dataframe shape is (575, 11)


**4. Delete duplicate molecules**

Sometimes the same molecule (`molecule_chembl_id`) has been tested more than once, in this case, we only keep the first one.

Note other choices could be to keep the one with the best value or a mean value of all assay results for the respective compound.

In [17]:
dfnM.drop_duplicates(subset=["molecule_chembl_id"], keep= 'first', inplace= True)
print(f"Dataframe shape is : {dfnM.shape}")

Dataframe shape is : (472, 11)


**5. Reset "DataFrame" index**

Since we deleted some rows, but we want to iterate over the index later, we reset the index to be continuous.

In [18]:
dfnM.reset_index(drop=True, inplace= True)
dfnM.head()

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,standard_units,standard_value,target_chembl_id,target_organism,type
0,111024,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL11805,=,nM,0.104,CHEMBL220,Homo sapiens,Ki
1,118575,CHEMBL641012,Inhibitory activity against human AChE,B,CHEMBL208599,=,nM,0.026,CHEMBL220,Homo sapiens,Ki
2,125075,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL60745,=,nM,1.63,CHEMBL220,Homo sapiens,Ki
3,733829,CHEMBL641691,Inhibitory activity of compound against acetyl...,B,CHEMBL95,=,nM,151.0,CHEMBL220,Homo sapiens,Ki
4,740235,CHEMBL641013,Inhibitory activity of compound against acetyl...,B,CHEMBL173309,=,nM,12.2,CHEMBL220,Homo sapiens,Ki


In [19]:
dfnM.rename(columns={"standard_value": "Ki", "standard_units": "units"}, inplace= True)
dfnM.head()

Unnamed: 0,activity_id,assay_chembl_id,assay_description,assay_type,molecule_chembl_id,relation,units,Ki,target_chembl_id,target_organism,type
0,111024,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL11805,=,nM,0.104,CHEMBL220,Homo sapiens,Ki
1,118575,CHEMBL641012,Inhibitory activity against human AChE,B,CHEMBL208599,=,nM,0.026,CHEMBL220,Homo sapiens,Ki
2,125075,CHEMBL641011,Inhibition constant determined against Acetylc...,B,CHEMBL60745,=,nM,1.63,CHEMBL220,Homo sapiens,Ki
3,733829,CHEMBL641691,Inhibitory activity of compound against acetyl...,B,CHEMBL95,=,nM,151.0,CHEMBL220,Homo sapiens,Ki
4,740235,CHEMBL641013,Inhibitory activity of compound against acetyl...,B,CHEMBL173309,=,nM,12.2,CHEMBL220,Homo sapiens,Ki


In [20]:
print(f"DataFrame shape: {dfnM.shape}")
# NBVAL_CHECK_OUTPUT

DataFrame shape: (472, 11)


We now have a set of **424** molecule ids with respective Ki values for our target kinase.

### Get compound data

We have a `DataFrame` containing all molecules tested against acetylcholinase (with the respective measured bioactivity). 

Now, we want to get the molecular structures of the molecules that are linked to respective bioactivity ChEMBL IDs. 

#### Fetch compound data from ChEMBL

Let's have a look at the compounds from ChEMBL which we have defined bioactivity data for: We fetch compound ChEMBL IDs and structures for the compounds linked to our filtered bioactivity data.

In [21]:
compounds = compound.filter(
    molecule_chembl_id__in=list(dfnM["molecule_chembl_id"])
).only("molecule_chembl_id", "molecule_structures")

In [22]:
compounds_df = pd.DataFrame.from_dict(compounds)
print(f"DataFrame shape: {compounds_df.shape}")
compounds_df.head()

DataFrame shape: (472, 2)


Unnamed: 0,molecule_chembl_id,molecule_structures
0,CHEMBL28,{'canonical_smiles': 'O=c1cc(-c2ccc(O)cc2)oc2c...
1,CHEMBL50,{'canonical_smiles': 'O=c1c(O)c(-c2ccc(O)c(O)c...
2,CHEMBL8320,"{'canonical_smiles': 'O=C1C=CC(=O)C=C1', 'molf..."
3,CHEMBL481,{'canonical_smiles': 'CCc1c2c(nc3ccc(OC(=O)N4C...
4,CHEMBL95,{'canonical_smiles': 'Nc1c2c(nc3ccccc13)CCCC2'...


#### Preprocess and filter compound data

1. Remove entries with missing entries
2. Delete duplicate molecules (by molecule_chembl_id)
3. Get molecules with canonical SMILES

**1. Remove entries with missing molecule structure entry**

In [23]:
compounds_df.dropna(axis=0, how='any', inplace= True)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (472, 2)


**2. Delete duplicate molecules**

In [24]:
compounds_df.drop_duplicates(subset=["molecule_chembl_id"], keep = 'first', inplace= True)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (472, 2)


**3. Get molecules with canonical SMILES**

So far, we have multiple different molecular structure representations. We only want to keep the canonical SMILES.

In [25]:
canonical_smiles = []

for i, compounds in compounds_df.iterrows():
    try:
        canonical_smiles.append(compounds["molecule_structures"]["canonical_smiles"])
    except KeyError:
        canonical_smiles.append(None)

compounds_df["smiles"] = canonical_smiles
compounds_df.drop("molecule_structures", axis=1, inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")

DataFrame shape: (472, 2)


In [26]:
compounds_df

Unnamed: 0,molecule_chembl_id,smiles
0,CHEMBL28,O=c1cc(-c2ccc(O)cc2)oc2cc(O)cc(O)c12
1,CHEMBL50,O=c1c(O)c(-c2ccc(O)c(O)c2)oc2cc(O)cc(O)c12
2,CHEMBL8320,O=C1C=CC(=O)C=C1
3,CHEMBL481,CCc1c2c(nc3ccc(OC(=O)N4CCC(N5CCCCC5)CC4)cc13)-...
4,CHEMBL95,Nc1c2c(nc3ccccc13)CCCC2
...,...,...
467,CHEMBL5218804,COc1cccc2c1CCC(NC(=O)OCc1ccccc1)C2
468,CHEMBL5219123,COc1cc2c(c(OC)c1)CC(NC(=O)OCc1ccccc1)C2
469,CHEMBL5219239,CC1CCCCN1CCCNC(=O)c1cc(NC(=O)OC(C)(C)C)ccc1O
470,CHEMBL5219425,CCN(CC)C(=O)OC1C[N+]2(C)CCC1CC2.[I-]


Sanity check: Remove all molecules without a canonical SMILES string.

In [27]:
compounds_df.dropna(axis=0, how="any", inplace=True)
print(f"DataFrame shape: {compounds_df.shape}")
# NBVAL_CHECK_OUTPUT

DataFrame shape: (472, 2)


We now have a set of **424** molecule ids with respective Ki values for our acetylcholinase target.

### Output (bioactivity-compound) data
**Summary of compound and bioactivity data**

In [28]:
print(f"Bioactivities filtered: {dfnM.shape[0]}")
dfnM.columns

Bioactivities filtered: 472


Index(['activity_id', 'assay_chembl_id', 'assay_description', 'assay_type',
       'molecule_chembl_id', 'relation', 'units', 'Ki', 'target_chembl_id',
       'target_organism', 'type'],
      dtype='object')

In [29]:
print(f"Compounds filtered: {compounds_df.shape[0]}")
compounds_df.columns

Compounds filtered: 472


Index(['molecule_chembl_id', 'smiles'], dtype='object')

In [30]:
Output_df = pd.merge(dfnM[['molecule_chembl_id', 'units', 'Ki']],
                          compounds_df, 
                          on = 'molecule_chembl_id')
Output_df.reset_index(drop=True, inplace= True)
print(f"Dataset with {Output_df.shape[0]} entities")

Dataset with 472 entities


Sanity check: The merged bioactivities/compound data set contains **424** entries.

In [31]:
Output_df.dtypes

molecule_chembl_id     object
units                  object
Ki                    float64
smiles                 object
dtype: object

In [32]:
Output_df.head(10)

Unnamed: 0,molecule_chembl_id,units,Ki,smiles
0,CHEMBL11805,nM,0.104,COc1ccccc1CN(C)CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)...
1,CHEMBL208599,nM,0.026,CCC1=CC2Cc3nc4cc(Cl)ccc4c(N)c3[C@@H](C1)C2
2,CHEMBL60745,nM,1.63,CC[N+](C)(C)c1cccc(O)c1.[Br-]
3,CHEMBL95,nM,151.0,Nc1c2c(nc3ccccc13)CCCC2
4,CHEMBL173309,nM,12.2,CCN(CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)CCCCCN(CC)C...
5,CHEMBL1128,nM,200.0,CC[N+](C)(C)c1cccc(O)c1.[Cl-]
6,CHEMBL102226,nM,20000.0,CCCCCCCSC(=O)OCC[N+](C)(C)C.[Cl-]
7,CHEMBL103873,nM,2000.0,CCCCCSC(=O)OCC[N+](C)(C)C.[Cl-]
8,CHEMBL640,nM,1000.0,CCN(CC)CCNC(=O)c1ccc(N)cc1
9,CHEMBL75121,nM,21.7,COc1cc2cc(-c3ccc(CN(C)Cc4ccccc4)cc3)c(=O)oc2cc1OC


In [33]:
#delete the Ki = 0 because if they still in the dataset will give infinite values once converted to pKi
Output_df = Output_df[Output_df["Ki"] != 0]

In [34]:
Output_df.Ki.describe()

count    4.720000e+02
mean     2.909111e+05
std      4.388460e+06
min      1.700000e-03
25%      2.975000e+01
50%      2.507000e+02
75%      6.054750e+03
max      9.496300e+07
Name: Ki, dtype: float64

Point to note: Values greater than 100,000,000 will be fixed at 100,000,000 otherwise the negative logarithmic value will become negative.

In [35]:
#For the moment we don't have a value greater than 100,000,000 in our dataset
# def norm_value(input):
#     norm = []

#     for i in input['Ki']:
#         if i > 100000000:
#           i = 100000000
#         norm.append(i)

#     input['Ki'] = norm
#     x = input
        
#     return x

# df_nom = norm_value(Output_df)
# df_nom.Ki.describe()

#### Add pKi values

As you can see the low Ki values are difficult to read (values are distributed over multiple scales), which is why we convert the Ki values to pKi.

In [36]:
Output_df["Ki"].describe()

count    4.720000e+02
mean     2.909111e+05
std      4.388460e+06
min      1.700000e-03
25%      2.975000e+01
50%      2.507000e+02
75%      6.054750e+03
max      9.496300e+07
Name: Ki, dtype: float64

In [37]:
#We have changed the math.log10() by  np.log10() because math.log expects a single float value. It doesn't work on pandas Series objects. 
def convert_Ki_to_pKi(Ki_value):
    pKi_value = 9 - np.log10(Ki_value)
    return pKi_value

#Other way to do it

# def pKi(input):
#     pki = []
#     pKi = 9 - np.log10(input)
#     return pKi

In [38]:
# Apply conversion to each row of the compounds DataFrame
Output_df["pKi"] = Output_df.apply(lambda x: convert_Ki_to_pKi(x["Ki"]), axis=1)

In [39]:
Output_df

Unnamed: 0,molecule_chembl_id,units,Ki,smiles,pKi
0,CHEMBL11805,nM,0.104,COc1ccccc1CN(C)CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)...,9.982967
1,CHEMBL208599,nM,0.026,CCC1=CC2Cc3nc4cc(Cl)ccc4c(N)c3[C@@H](C1)C2,10.585027
2,CHEMBL60745,nM,1.630,CC[N+](C)(C)c1cccc(O)c1.[Br-],8.787812
3,CHEMBL95,nM,151.000,Nc1c2c(nc3ccccc13)CCCC2,6.821023
4,CHEMBL173309,nM,12.200,CCN(CCCCCC(=O)N(C)CCCCCCCCN(C)C(=O)CCCCCN(CC)C...,7.913640
...,...,...,...,...,...
467,CHEMBL5220695,nM,120.000,CC(C)(C)OC(=O)Nc1ccc(O)c(C(=O)NCCCN2CCCCC2)c1,6.920819
468,CHEMBL5219239,nM,170.000,CC1CCCCN1CCCNC(=O)c1cc(NC(=O)OC(C)(C)C)ccc1O,6.769551
469,CHEMBL5218804,nM,0.264,COc1cccc2c1CCC(NC(=O)OCc1ccccc1)C2,9.578396
470,CHEMBL5219425,nM,3500.000,CCN(CC)C(=O)OC1C[N+]2(C)CCC1CC2.[I-],5.455932


In [41]:
Output_df.to_csv('./databases/acetylcholinesterase_Ki_pKi_bioactivity_data_curated.csv', index=False)