# Eval models on train-validation split

See [here](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/test_train_split).

Use the interactions split in [this zip file](https://github.com/BrianDavisMath/FDA-COVID19/blob/master/test_train_split/training_validation_split.zip) to build and train a model. Evaluate the model using thte validation sett and return the model performance using a weighted F1 score.

_"target metric is f1_score with sample weights computed using the script included in the zip file"_

Weights should be applied in the _sklearn.metrics.f1_score_ function's _sample_weight_ argument. Weights for cid/pid pairs are obtained by running the _get_validation_weights_ function that's included in the [zip file](https://github.com/BrianDavisMath/FDA-COVID19/blob/master/test_train_split/training_validation_split.zip).

**Note:** bs_features.csv contains the reduced binding site fingerprints for all examples in training and validation sets. Use of this or the original data is optional.

**Note:** The two new interactions file, for training and validation feature sets, include a new column: _sample_activity_score_. This can optionally be used to sub-sample the data. Scores of zero are interactions where both the pid and cid show no variance in activity across entire data set; scores close to 1.0 have balanced activity for both pid and cid. The idea is that interactions with score close to zero might not improve model very much, and so may want to be omitted.

In [1]:
%pylab inline
%autosave 25

import h5py
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


### Data location

In [26]:
data_loc = '../data/training_validation_split/'

### Load the data

In [4]:
def load_data(path, data_type=None):
    if data_type:
        df = pd.read_csv(path, index_col=0, dtype=data_type)
    else:
        df = pd.read_csv(path, index_col=0)
    print('Number of rows: {:,}\n'.format(len(df)))
    print('Number of columns: {:,}\n'.format(len(df.columns)))
    
    columns_missing_values = df.columns[df.isnull().any()].tolist()
    print('{} columns with missing values: {}\n\n'.format(len(columns_missing_values), columns_missing_values))
    
    cols = df.columns.tolist()
    column_types = [{col: df.dtypes[col].name} for col in cols]
    print('column types:\n')
    print(column_types, '\n\n')
    
    print(df.head())
    
    return df

In [7]:
df_training_interactions = load_data(data_loc+'training_interactions.csv')

Number of rows: 168,415

Number of columns: 4

0 columns with missing values: []


column types:

[{'cid': 'int64'}, {'pid': 'object'}, {'activity': 'int64'}, {'sample_activity_score': 'float64'}] 


             cid     pid  activity  sample_activity_score
185362    204106  Q9UP65         1                    1.0
159478  46938678  O15528         0                    1.0
159479  46938678  O15528         1                    1.0
150238  13703975  P22459         0                    1.0
152040  10202642  P01137         0                    1.0


In [8]:
df_validation_interactions = load_data(data_loc+'validation_interactions.csv')

Number of rows: 20,000

Number of columns: 4

0 columns with missing values: []


column types:

[{'cid': 'int64'}, {'pid': 'object'}, {'activity': 'int64'}, {'sample_activity_score': 'float64'}] 


             cid       pid  activity  sample_activity_score
106739      2264    P01106         1               0.134146
98502   49803313    P30530         1               0.205915
59873       2170  CAA56931         0               0.369108
18924       3878    1C3B_A         0               0.244034
11376      39765  CAQ07474         0               0.438238


In [25]:
print('Validation set size is {:0.0f}%.'.format(len(df_validation_interactions)/len(df_training_interactions)*100))

Validation set size is 12%.
