## MoA Kaggle- Drug Discovery

#### MultiLabel Classification problem

[Kaggle Link](https://www.kaggle.com/c/lish-moa/data)

In cuurent approaches for Drug discovery, Scientists seek to identify a protein target associated with a disease and develop a molecule that can modulate that protein target. As a shorthand to describe the biological activity of a given molecule, scientists assign a label referred to as mechanism-of-action or MoA for short.

Effects of a drug on the protien target

**Metric** is average logloss for all classes

#### Xgboost 1 variable paramTuning and Feature Importance

Out of 206 target variables, cyclooxygenase_inhibitor has highest LogLoss 0f 0.092 with Deep neuarl net, 
Goal is to reduce this with xgboost.

See work on predWork notebook for details about individual losses for variables

In [1]:

import numpy as np
import pandas as pd
import xgboost as xgb

SEED = 42

np.random.seed(SEED)

### Supporting Fns


In [12]:
def GridParamTune(dtrain,gridParams,EvalMetric,params,Drop=1,verbose=1):
    '''
    Xgboost Param tuning
    Author: Taran
    
    Given a params dictionary this function implements Grid and random search
    The cv object can be replaced for any other model
    
    Args - dTrain matrix,
            gridParams A parameters dictinary with candidate search space
            Drop Rate - for  Random search [0,1]
            verbose 0 0r 1 
            params  ---not to be tuned parameters
            EvalMetric  -- takes 1 eval metric
    Output
    Returns a df with results for each params
    Additional dependency itertools
    
    '''
    import itertools
    #paramers passed
    paramNames = list(gridParams.keys())
    
    results = []
    ### iterate over all combinations
    for row in itertools.product(*gridParams.values()):
        ### random search threshold
        if np.random.random(1)[0] < Drop:
            continue 
        
        # insert values into param dict
        for i in range(len(row)):
            params[paramNames[i]] = row[i]
            
        # train model for given params    
        cvN = xgb.cv(
                    params,
                    dtrain,
                    num_boost_round=1000,
                    seed=42,
                    nfold=5,
                    metrics={EvalMetric},
                    early_stopping_rounds=25
                )
        #get results 
        bestRound= cvN[f'test-{EvalMetric}-mean'].argmin()
        trainMetric = cvN[f'train-{EvalMetric}-mean'][bestRound]
        testMetric =cvN[f'test-{EvalMetric}-mean'][bestRound]
        overfit = ((cvN[f'test-{EvalMetric}-mean'][bestRound]/cvN[f'train-{EvalMetric}-mean'][bestRound]) -1)*100
        
        #unlist
        tempResults=[list(params.values())[1:],bestRound,trainMetric,testMetric,overfit]      

        results.append(list(itertools.chain.from_iterable(i if isinstance(i, list) else [i] for i in tempResults)))
         
    colNames=[paramNames,'bestRound',f'train-{EvalMetric}',f'test-{EvalMetric}','overfit']              
    df = pd.DataFrame(results,columns=list(itertools.chain.from_iterable(i if isinstance(i, list) else [i] for i in colNames)))
                  
    return df


def getLogLoss(y,yhat):
    '''
    logloss
    '''
    assert len(y)==len(yhat)
    eps= 1e-12
    
    yhat= np.clip(yhat,eps,1-eps)
    return -1*np.mean(y*np.log(yhat) + (1-y)*np.log(1-yhat))


def PreProcessX(df):
    '''
    Preprocessing for independent  vars
    encode categoricals
    
    returns processed df,
    '''
    df['cp_dose'] = (df['cp_dose'] == 'D1').astype(int)
    df['cp_type'] = (df['cp_type'] == 'trt_cp').astype(int)
    
    return df

In [3]:
xAll = pd.read_csv('../Data/train_features.csv')
yAll = pd.read_csv('../Data/train_targets_scored.csv')



In [4]:
xAll = PreProcessX(xAll)

trainIds= xAll['sig_id'].sample(frac=0.8,random_state=SEED)

In [5]:
xTrain = xAll[xAll['sig_id'].isin(trainIds)]
yTrain = yAll[yAll['sig_id'].isin(trainIds)]
xValid = xAll[~xAll['sig_id'].isin(trainIds)]
yValid = yAll[~yAll['sig_id'].isin(trainIds)]
print(f'xTrain {xTrain.shape} yTrain {yTrain.shape} xValid {xValid.shape} yValid {yValid.shape}')

xTrain (19051, 876) yTrain (19051, 207) xValid (4763, 876) yValid (4763, 207)


In [6]:
idList=['sig_id']
xTrain=xTrain.drop(idList,axis=1)
xValid=xValid.drop(idList,axis=1)
yTrain=yTrain.drop(idList,axis=1)
yValid=yValid.drop(idList,axis=1)

In [7]:
dtrain = xgb.DMatrix(xTrain, label=yTrain['cyclooxygenase_inhibitor'])

In [9]:
# not tuning gamma and max leaves # to be tuned based on overfit

NumRows =xTrain.shape[0]*0.8 ### 0.8 is due to 5 fold cv
gridParams = {
   
    'max_depth' : [4,6,8],
    'eta' : [0.01,0.05,0.1],
    'colsample_bytree' : [0.2,0.5,1],
    'min_child_weight' : [1,int(NumRows*0.005)]
}


In [10]:
# not to be tuned params
params = {
    'objective':'binary:logistic'
}

In [13]:
%time paramResults=GridParamTune(dtrain=dtrain,gridParams=gridParams,EvalMetric='logloss',params=params)


CPU times: user 20h 57min 12s, sys: 54 s, total: 20h 58min 6s
Wall time: 1h 46min 24s


In [14]:
paramResults.head()

Unnamed: 0,max_depth,eta,colsample_bytree,min_child_weight,bestRound,train-logloss,test-logloss,overfit
0,4,0.01,0.2,1,727,0.057525,0.087668,52.398419
1,4,0.01,0.2,76,773,0.079891,0.088966,11.360285
2,4,0.01,0.5,1,676,0.059015,0.087648,48.517844
3,4,0.01,0.5,76,751,0.079331,0.088793,11.927211
4,4,0.01,1.0,1,617,0.061563,0.087897,42.775554


#### Selection criteria

Sort by logloss,

Obs
All models got trained fully bestRound < Numrounds

Since the goal was to regularize the model later,
we can pick 5 with lower log loss and 1 least overfitted from those 


In [15]:
paramResults.sort_values(['test-logloss'])

Unnamed: 0,max_depth,eta,colsample_bytree,min_child_weight,bestRound,train-logloss,test-logloss,overfit
2,4,0.01,0.5,1,676,0.059015,0.087648,48.517844
14,4,0.1,0.5,1,56,0.065939,0.087667,32.950961
0,4,0.01,0.2,1,727,0.057525,0.087668,52.398419
4,4,0.01,1.0,1,617,0.061563,0.087897,42.775554
10,4,0.05,1.0,1,123,0.06107,0.087942,44.002109
12,4,0.1,0.2,1,72,0.056511,0.08797,55.669162
16,4,0.1,1.0,1,62,0.060099,0.087984,46.39729
6,4,0.05,0.2,1,146,0.056675,0.087984,55.244148
20,6,0.01,0.5,1,568,0.042952,0.087992,104.860309
8,4,0.05,0.5,1,126,0.061466,0.088024,43.208884
