# TRPV1 agonistic drug research

The object of this script is to build a table ready to train/test a DrugTorch model.

Here are the steps:
1. Get all targets under the name `TRPV1` from the ChEMBL database.
2. Once target ID's are retrieved we will use them to search all related activities to identify compounds that have interacted with the TRPV1 targets.
3. Get all relevant Chemble ID's (pChEMBL score > 0)
4. Get all molecule properties and structure.
5. Merge both tables from steps #4 and #5 to create the final testing/training dataset for ChemProp
  - https://chemprop.readthedocs.io/en/latest/quickstart.html

In [None]:
# Install ChEMBL API
!pip install chembl_webresource_client

Collecting chembl_webresource_client
  Downloading chembl_webresource_client-0.10.9-py3-none-any.whl.metadata (1.4 kB)
Collecting requests-cache~=1.2 (from chembl_webresource_client)
  Downloading requests_cache-1.3.0-py3-none-any.whl.metadata (9.2 kB)
Collecting cattrs>=22.2 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading cattrs-25.3.0-py3-none-any.whl.metadata (8.4 kB)
Collecting url-normalize>=2.0 (from requests-cache~=1.2->chembl_webresource_client)
  Downloading url_normalize-2.2.1-py3-none-any.whl.metadata (5.6 kB)
Downloading chembl_webresource_client-0.10.9-py3-none-any.whl (55 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.2/55.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_cache-1.3.0-py3-none-any.whl (69 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading cattrs-25.3.0-py3-none-any.whl (70 kB)
[2K   [90m━━━━━━━━━━━━━━

In [None]:
from chembl_webresource_client.new_client import new_client
import pandas as pd

targets_api = new_client.target
activities_api = new_client.activity

# Let's search for all targets related to TrpV1 - https://www.ebi.ac.uk/chembl/search_results/TrpV1
targets = targets_api.search('TrpV1').only(['target_chembl_id', 'pref_name', 'organism'])
targets_df = pd.DataFrame(targets)
targets_df

Unnamed: 0,organism,pref_name,target_chembl_id
0,Homo sapiens,TRPV1 mRNA,CHEMBL5500263
1,Cavia porcellus,Transient receptor potential cation channel su...,CHEMBL5132
2,Canis lupus familiaris,Transient receptor potential cation channel su...,CHEMBL5254
3,Mus musculus,Transient receptor potential cation channel su...,CHEMBL1781864
4,Homo sapiens,Transient receptor potential cation channel su...,CHEMBL4794
5,Rattus norvegicus,Transient receptor potential cation channel su...,CHEMBL5102
6,Gallus gallus,Uncharacterized protein,CHEMBL2412949
7,Rattus norvegicus,Vanilloid receptor,CHEMBL2096684


In [None]:
# In total, there are 8 TrpV1 targets, now let's find all related activity to identify the compounds used
# There are roughly about 15500 activities so this process will take several minutes.
# Get yourself some tea or coffee!
import numpy as np

queried_activities = []
for id in targets_df['target_chembl_id']:
  activities = new_client.activity.filter(target_chembl_id=id).only([
      'molecule_chembl_id',
      'standard_value',
      'standard_type',
      'standard_units',
      'pchembl_value'
  ])
  queried_activities.append(activities)
  print(f"{id} data retrieved...")

merged_activities = np.concatenate(queried_activities, axis=0)
activities_df = pd.DataFrame(list(merged_activities))
activities_df.shape

CHEMBL5500263 data retrieved...
CHEMBL5132 data retrieved...
CHEMBL5254 data retrieved...
CHEMBL1781864 data retrieved...
CHEMBL4794 data retrieved...
CHEMBL5102 data retrieved...
CHEMBL2412949 data retrieved...
CHEMBL2096684 data retrieved...


(15605, 8)

In [None]:
# Here is the first 10 rows
activities_df.head(10)

Unnamed: 0,molecule_chembl_id,pchembl_value,standard_type,standard_units,standard_value,type,units,value
0,CHEMBL207433,,pKb,,7.3,pKb,,7.3
1,CHEMBL207433,,Activity,%,80.0,Activity,%,80.0
2,CHEMBL213390,,Log IC50,,,Log IC50,,
3,CHEMBL213390,,pKb,,8.4,pKb,,8.4
4,CHEMBL514691,,pKb,,7.4,pKb,,7.4
5,CHEMBL514691,7.2,IC50,nM,63.1,pIC50,,7.2
6,CHEMBL207433,,pKb,,,pKb,,
7,CHEMBL1210154,,IC50,nM,10000.0,IC50,nM,10000.0
8,CHEMBL1784749,,Inhibition,%,,INH,,
9,CHEMBL1784749,,Inhibition,%,,INH,,


## Next, let's learn what these columns mean:
- molecule_chembl_id - ID of the compound tested
- pchembl_value - activity value, higher means more potency
- standard_type - type of activity measurement
- standard_units - units for the standard value
- standard_value - measurement of activity
- the last three columns are redundant but let's just leave them just in case.

### Looking at the first rows of the dataset, it's quite clear we're going to need to clean the data in order to find the most relevant compounds to TRPV1

We will have to consider:
- filtering out low activity compounds
- only stick to one standard type
- and potentially standard unit

In [None]:
# First let's ensure correct datatypes for each column
activities_df['pchembl_value'] = activities_df['pchembl_value'].astype(float)
activities_df['standard_value'] = activities_df['standard_value'].astype(float)

# Let's experiment!
filtered_df = activities_df.loc[(activities_df['pchembl_value'] > 0)]

# Let's only keep molecule_chembl_id and pchembl_value
id_value_df = filtered_df.drop(['standard_type', 'standard_units', 'standard_value', 'type', 'units', 'value'], axis=1)
id_value_df.head(5)

Unnamed: 0,molecule_chembl_id,pchembl_value
5,CHEMBL514691,7.2
49,CHEMBL4648896,7.87
55,CHEMBL27105,6.83
56,CHEMBL285922,7.18
57,CHEMBL17976,7.64


 ### A pChEMBL value of 7 or higher is often considered highly active, while values below 5 are often considered inactive.

 For now, let's consider the pChEMBL value as our top criteria for creating our training/testing data. We will only train our model with compounds that have a pChEMBL value greater than 0.


In [None]:
# Next let's get the properties of these molecules and add them to the existing dataframe
molecule = new_client.molecule

all_smiles = []
for i, id in enumerate(id_value_df['molecule_chembl_id'].unique()):
  smiles = molecule.filter(chembl_id=id).only(['molecule_chembl_id','molecule_structures'])
  try:
    smile_str = smiles[0]['molecule_structures']['canonical_smiles']
    id = smiles[0]['molecule_chembl_id']
    all_smiles.append({'molecule_chembl_id': id, 'canonical_smiles': smile_str})
    print(f"{i}/{len(id_value_df['molecule_chembl_id'].unique())} - {id} has a SMILES structure...")
  except TypeError:
    print(f"{i}/{len(id_value_df['molecule_chembl_id'].unique())} - {id} did not have a SMILES structure...")

smiles_df = pd.DataFrame(all_smiles)
smiles_df


In [None]:
# Let's inner join the two dataframes
merged_df = pd.merge(smiles_df, id_value_df, on='molecule_chembl_id', how='inner')
merged_df

Unnamed: 0,molecule_chembl_id,canonical_smiles,pchembl_value
0,CHEMBL514691,Cc1nc2cc(NC(=O)c3ccc(-c4ccc(F)cc4)nc3C)ccc2s1,7.20
1,CHEMBL514691,Cc1nc2cc(NC(=O)c3ccc(-c4ccc(F)cc4)nc3C)ccc2s1,7.31
2,CHEMBL514691,Cc1nc2cc(NC(=O)c3ccc(-c4ccc(F)cc4)nc3C)ccc2s1,8.05
3,CHEMBL514691,Cc1nc2cc(NC(=O)c3ccc(-c4ccc(F)cc4)nc3C)ccc2s1,7.40
4,CHEMBL514691,Cc1nc2cc(NC(=O)c3ccc(-c4ccc(F)cc4)nc3C)ccc2s1,7.00
...,...,...,...
7943,CHEMBL39785,CCC(CCc1ccc(O)c(OC)c1)N(Cc1ccccc1)/C(S)=N/CCc1...,4.68
7944,CHEMBL291285,COc1cc(CCC(CCc2ccccc2)O/C(S)=N/CCc2ccccc2)ccc1O,4.72
7945,CHEMBL36669,CCC(CCc1ccc(O)c(OC)c1)N(C)/C(S)=N/CCc1ccccc1,4.57
7946,CHEMBL440471,CCC(CCc1ccc(O)c(OC)c1)/N=C(\S)NCCc1ccccc1,4.54


In [None]:
from google.colab import files

In [None]:
assert False

In [None]:
# Run this cell to download the training/testing data
csv_name_data = 'trpv1_agonists_train_test_data.csv'
model_data = merged_df.drop(columns=['molecule_chembl_id'], axis=0)
model_data.to_csv(csv_name_data, index=False)
files.download(csv_name_data)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Run this cell to download the data with the ID's for manual viewing purposes
csv_name_id = 'trpv1_agonists_train_test_ids.csv'
merged_df.to_csv(csv_name_id, index=False)
files.download(csv_name_id)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>