## Data Curation
Curating raw data is a long, detailed process that is vital to good data science. 
Here we will cover some tools in AMPL that are commonly used in the data 
curation process. These are just a few of the steps needed to curate a dataset;
another tutorial will cover data curation in more detail.

### Read the data.
We've prepared an example dataset for the KCNA5 target. This dataset is 
simpler than that is commonly found in the wild, but will concisely 
demonstrate AMPL tools.

In [6]:
import pandas as pd
import numpy as np
import sklearn as sns

# read in data
kcna5=pd.read_csv('kcna5_ic50.csv')


### Columns
This dataset is drawn from the ChEMBL database and contains the following columns
- `molecule_chembl_id`: The ChEMBL id for the molecule.
- `smiles`: The SMILES string that represents the molecule. This is the main
input taken by AMPL models.
- `standard_type`: This column records the type of endpoint e.g., IC50, Ki, Kd, etc.
This dataset only contains IC50 data points.
- `standard_relation`: Data points might be censored. This column records if the 
datapoint is censored or not.
- `standard_value`: The IC50 value is recorded here.
- `standard_units`: IC50 values can be recorded in different units which are recorded
here. Fortunately, all of this data uses nM.

In [7]:
kcna5.columns

Index(['molecule_chembl_id', 'smiles', 'standard_type', 'standard_relation',
       'standard_value', 'standard_units'],
      dtype='object')

### Standardize SMILES
SMILES strings are not unique and the same compound can be represented by different, 
not so equivalent, SMILES. This step simplifies the machine learning problem by 
ensuring each compound is represented the same way.

In [8]:
from atomsci.ddm.utils.struct_utils import base_smiles_from_smiles
kcna5['base_rdkit_smiles'] = base_smiles_from_smiles(kcna5['smiles'])

From now on we will use `base_rdkit_smiles`.

### Standardize Relations
Relations can also differ from database to database. This function will standardize the relation column for use with AMPL. Since this data is from ChEMBL, we will call 
the function with `db=ChEMBL`

In [9]:
from atomsci.ddm.utils.data_curation_functions import standardize_relations
kcna5 = standardize_relations(kcna5, db='ChEMBL', 
                    rel_col='standard_relation',
                    output_rel_col='fixed_relation')


### Calculate pIC50s
We will convert the IC50s to pIC50s before performing machine learning. This function 
will use `standard_units` and `standard_value` columns. This function converts IC50s 
in nM to pIC50s.

In [10]:
from atomsci.ddm.utils.data_curation_functions import convert_IC50_to_pIC50
kcna5 = convert_IC50_to_pIC50(kcna5, 
                              unit_col='standard_units',
                              value_col='standard_value',
                              new_value_col='pIC50',
                              unit_conv={'uM':lambda x: x*1e-6, 'nM':lambda x: x*1e-9},
                              inplace=False)

### Remove outliers and aggregate
The final step is to remove outliers and aggregate duplicate measurements.


In [None]:
from atomsci.ddm.utils.curate_data import remove_outlier_replicates, \
                                    aggregate_assay_data
kcna5 = remove_outlier_replicates(kcna5, id_col='moelcule_chembl_id',
                                response_col='pIC50')

kcna5 = aggregate_assay_data(kcna5, 
                             value_col='pIC50',
                             output_value_col='avg_pIC50',
                             id_col='molcule_chembl_id',
                             smiles_col='base_rdkit_smiles',
                             relation_col='fixed_relation',
                        )
