# Eval models on train-validation split

See [here](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/test_train_split).

Use the interactions split in [this zip file](https://github.com/BrianDavisMath/FDA-COVID19/blob/master/test_train_split/training_validation_split.zip) to build and train a model. Evaluate the model using thte validation sett and return the model performance using a weighted F1 score.

_"target metric is f1_score with sample weights computed using the script included in the zip file"_

Weights should be applied in the _sklearn.metrics.f1_score_ function's _sample_weight_ argument. Weights for cid/pid pairs are obtained by running the _get_validation_weights_ function that's included in the [zip file](https://github.com/BrianDavisMath/FDA-COVID19/blob/master/test_train_split/training_validation_split.zip).

**Note:** bs_features.csv contains the reduced binding site fingerprints for all examples in training and validation sets. Use of this or the original data is optional.

**Note:** The two new interactions file, for training and validation feature sets, include a new column: _sample_activity_score_. This can optionally be used to sub-sample the data. Scores of zero are interactions where both the pid and cid show no variance in activity across entire data set; scores close to 1.0 have balanced activity for both pid and cid. The idea is that interactions with score close to zero might not improve model very much, and so may want to be omitted.

In [1]:
%pylab inline
%autosave 25

import h5py
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


### Data location

In [2]:
data_loc = '../data/training_validation_split/'

### Load the data

In [51]:
def load_data(path, data_type=None, nrows=None):
    if data_type:
        df = pd.read_csv(path, index_col=0, dtype=data_type, nrows=nrows)
    else:
        df = pd.read_csv(path, index_col=0, nrows=nrows)
    print('{}: Number of rows: {:,}\n'.format(path, len(df)))
    print('{}: Number of columns: {:,}\n'.format(path, len(df.columns)))
    
    print(df.head(1))
    
    return df

In [None]:
df_training_interactions = load_data(data_loc+'training_interactions.csv')

In [None]:
df_validation_interactions = load_data(data_loc+'validation_interactions.csv')

In [None]:
print('Validation set size is {:0.0f}%.'.format(len(df_validation_interactions)/len(df_training_interactions)*100))

In [None]:
# Create a small test data set for use in development of the HPC scripts

df_training_interactions = load_data(data_loc+'training_interactions.csv', nrows=1000)
df_validation_interactions = load_data(data_loc+'validation_interactions.csv', nrows=100)

df_training_interactions.rename(columns={"canonical_cid": "cid"}, inplace=True)
df_validation_interactions.rename(columns={"canonical_cid": "cid"}, inplace=True)

pids = df_training_interactions['pid'].tolist() + df_validation_interactions['pid'].tolist()
cids = df_training_interactions['cid'].tolist() + df_validation_interactions['cid'].tolist()

# drug features
df_dragon_features = load_data('../data/FDA-COVID19_files_v1.0/drug_features/dragon_features.csv', data_type=object)
df_fingerprints = load_data('../data/FDA-COVID19_files_v1.0/drug_features/fingerprints.csv')

# protein features
df_binding_sites = load_data('../data/FDA-COVID19_files_v1.0/protein_features/binding_sites_v1.0.csv')
df_expasy = load_data('../data/FDA-COVID19_files_v1.0/protein_features/expasy.csv')
df_profeat = load_data('../data/FDA-COVID19_files_v1.0/protein_features/profeat.csv')

df_dragon_features.index.name = 'cid'
df_fingerprints.index.name = 'cid'
df_binding_sites.index.name = 'pid'
df_expasy.index.name = 'pid'
df_profeat.index.name = 'pid'

# rename the dragon features since there are duplicate column names in the protein binding-sites data.
df_dragon_features.columns = ['cid_'+col for col in df_dragon_features.columns]

max_num_dimensions = 10

# only include rows that match the subset of interactions we sampled above
df_binding_sites = df_binding_sites[df_binding_sites.index.isin(pids)].iloc[:, :max_num_dimensions]
df_expasy = df_expasy[df_expasy.index.isin(pids)].iloc[:, :max_num_dimensions]
df_profeat = df_profeat[df_profeat.index.isin(pids)].iloc[:, :max_num_dimensions]

df_dragon_features = df_dragon_features[df_dragon_features.index.isin(cids)].iloc[:, :max_num_dimensions]
df_fingerprints = df_fingerprints[df_fingerprints.index.isin(cids)].iloc[:, :max_num_dimensions]

print(len(df_binding_sites))
print(len(df_expasy))
print(len(df_profeat))
print(len(df_dragon_features))
print(len(df_fingerprints))

In [71]:
# print out summary after each features merge
def print_merge_details(df_merge_result, df1_name, df2_name):
    print('Joining {} on protein {} yields {:,} rows and {:,} columns'. \
          format(df1_name, df2_name, len(df_merge_result), 
          len(df_merge_result.columns)))
    
def merge(df_interactions):
    print('\n\n-----------------------------------------------')
    print('df_interactions + df_binding_sites = df_features \n')
    df_features = pd.merge(df_interactions, df_binding_sites, on='pid', how='inner')
    print_merge_details(df_features, 'interactions', 'binding_sites')

    print('\n\n-----------------------------------------------')
    print('df_features + df_expasy \n')
    df_features = pd.merge(df_features, df_expasy, on='pid', how='inner')
    print_merge_details(df_features, 'features', 'expasy')

    print('\n\n-----------------------------------------------')
    print('df_features + df_profeat \n')
    df_features = pd.merge(df_features, df_profeat, on='pid', how='inner')
    print_merge_details(df_features, 'features', 'df_profeat')

    print('\n\n-----------------------------------------------')
    print('df_features + df_dragon_features \n')
    df_dragon_features.index.name = 'cid'
    df_features = pd.merge(df_features, df_dragon_features, on='cid', how='inner')
    print_merge_details(df_features, 'features', 'df_dragon_features')

    print('\n\n-----------------------------------------------')
    print('df_features + df_fingerprints \n')
    df_features = pd.merge(df_features, df_fingerprints, on='cid', how='inner')
    print_merge_details(df_features, 'features', 'df_fingerprints')
    
    return df_features

df_training_features = merge(df_training_interactions)
print('==========================')
df_validation_features = merge(df_validation_interactions)



-----------------------------------------------
df_interactions + df_binding_sites = df_features 

Joining interactions on protein binding_sites yields 970 rows and 14 columns


-----------------------------------------------
df_features + df_expasy 

Joining features on protein expasy yields 966 rows and 21 columns


-----------------------------------------------
df_features + df_profeat 

Joining features on protein df_profeat yields 965 rows and 31 columns


-----------------------------------------------
df_features + df_dragon_features 

Joining features on protein df_dragon_features yields 965 rows and 41 columns


-----------------------------------------------
df_features + df_fingerprints 

Joining features on protein df_fingerprints yields 965 rows and 51 columns


-----------------------------------------------
df_interactions + df_binding_sites = df_features 

Joining interactions on protein binding_sites yields 100 rows and 14 columns


---------------------------------

In [64]:
# save test features to file
print(len(df_binding_sites))
print(len(df_expasy))
print(len(df_profeat))
print(len(df_dragon_features))
print(len(df_fingerprints))

50

In [78]:
df_training_interactions.to_csv('../runner/test_data/training_validation_split/training_interactions.csv')
df_validation_interactions.to_csv('../runner/test_data/training_validation_split/validation_interactions.csv')

df_binding_sites.to_csv('../runner/test_data/protein_features/binding_sites.csv')
df_expasy.to_csv('../runner/test_data/protein_features/expasy.csv')
df_profeat.to_csv('../runner/test_data/protein_features/profeat.csv')

df_dragon_features.to_csv('../runner/test_data/drug_features/dragon_features.csv')
df_fingerprints.to_csv('../runner/test_data/drug_features/fingerprints.csv')