# Feature Generation
<hr>

This notebook takes the protein and drug files, cleans them and stitches them together into training and validation feature sets.

It assumes the data is in a sub-directory of the **/data** folder. I've already added entries to the _.gitignore_ file so that they won't be committed to the repository. Note that this file should be updated for new versions of the data.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data) for more details.

The output is a file called **features.csv**

<hr>

**NOTE:** the finger prints and binding sites data are unzipped from the reduced/derived sets from here:

* https://github.com/BrianDavisMath/FDA-COVID19/blob/master/data/reduced_binding_site_features_v2.zip
* https://github.com/BrianDavisMath/FDA-COVID19/blob/master/data/reduced_fingerprints_v1.5.zip

These into the corresponding drug-features and protein-features folder under the _data_loc_ directory.

<hr>

In [1]:
%pylab inline
%autosave 25

import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


## Data location

Change this when you get a new data set.

In [2]:
data_loc = '../data/FDA-COVID19_files_v1.0/'

# The following directory is unipped from the file at: 
# https://github.com/BrianDavisMath/FDA-COVID19/blob/master/test_train_split/training_validation_split.zip
interactions_data_loc = '../data/training_validation_split/'

## Functions for loading data and merging
<hr>

In [13]:
## load a specific features CSV file
def load_data(path, data_type=None):
    if data_type:
        df = pd.read_csv(path, index_col=0, dtype=data_type)
    else:
        df = pd.read_csv(path, index_col=0)
    print('Number of rows: {:,}\n'.format(len(df)))
    print('Number of columns: {:,}\n'.format(len(df.columns)))
    
    columns_missing_values = df.columns[df.isnull().any()].tolist()
    print('{} columns with missing values\n\n'.format(len(columns_missing_values)))
    
    cols = df.columns.tolist()
    column_types = [{col: df.dtypes[col].name} for col in cols][:10]
    print('column types:\n')
    print(column_types, '\n\n')
    print(df.head(2))
    
    return df


# print out summary after each features merge
def print_merge_details(df_merge_result, df1_name, df2_name):
    print('Joining {} on protein {} yields {:,} rows and {:,} columns'. \
          format(df1_name, df2_name, len(df_merge_result), 
          len(df_merge_result.columns)))


# Get the individual feature sets as data frames
def load_feature_files():
    print('===============================================')
    print('\ndragon_features.csv')
    print('===============================================')
    # note need to set the data_type to object because it complains, otherwise that the types vary.
    df_dragon_features = load_data(data_loc+'drug_features/dragon_features.csv', data_type=object)
    
    # rename the dragon features since there are duplicate column names in the protein binding-sites data.
    df_dragon_features.columns = ['cid_'+col for col in df_dragon_features.columns]
    
    # handle na values in dragon_features
    # Many cells contain "na" values. Find the columns that contain 2% or 
    # less of these values and retain them, throwing away the rest. 
    # Then mean-impute the "na" values in the remaining columns.
    pct_threshold = 2
    na_threshold = int(91424*pct_threshold/100)
    ok_cols = []
    for col in df_dragon_features:
        na_count = df_dragon_features[col].value_counts().get('na')
        if (na_count or 0) <= na_threshold:
            ok_cols.append(col)

    print('number of columns where the frequency of "na" values is <= {}%: {}.'.format(pct_threshold, len(ok_cols)))
    
    df_dragon_features = df_dragon_features[ok_cols].copy()

    # convert all values except "na"s to numbers and set "na" values to NaNs.
    df_dragon_features = df_dragon_features.apply(pd.to_numeric, errors='coerce')

    columns_missing_values = df_dragon_features.columns[df_dragon_features.isnull().any()].tolist()
    print('{} columns with missing values.\n\n'.format(len(columns_missing_values)))

    # replace NaNs with column means
    df_dragon_features.fillna(df_dragon_features.mean(), inplace=True)

    columns_missing_values = df_dragon_features.columns[df_dragon_features.isnull().any()].tolist()
    print('{} columns with missing values (after imputing): {}\n\n'.format(len(columns_missing_values), 
                                                                       columns_missing_values))
    print('===============================================')
    print('reduced_fingerprints_v1.5.csv')
    print('===============================================')
    df_fingerprints = load_data(data_loc+'drug_features/reduced_fingerprints_v1.5.csv')
    
    print('===============================================')
    print('reduced_binding_site_features_v2.0.csv')
    print('===============================================')
    df_binding_sites = load_data(data_loc+'protein_features/binding_site_features_v2.csv')
    
    # Name the index to 'pid' to allow joining to other feaure files later.
    df_binding_sites.index.name = 'pid'
    
    print('===============================================')
    print('expasy.csv')
    print('===============================================')
    df_expasy = load_data(data_loc+'protein_features/expasy.csv')
    
    print('===============================================')
    print('profeat.csv')
    print('===============================================')
    df_profeat = load_data(data_loc+'protein_features/profeat.csv')
    
    # Name the index to 'pid' to allow joining to other feaure files later.
    df_profeat.index.name = 'pid'
    
    # profeat has some missing values.
    s = df_profeat.isnull().sum(axis = 0)

    print('number of missing values for each column containing them is: {}'.format(len(s[s > 0])))

    # Drop the rows that have missing values.
    df_profeat.dropna(inplace=True)
    print('number of rows remaining, without NaNs: {:,}'.format(len(df_profeat)))
    
    return {'df_dragon_features': df_dragon_features,
           'df_fingerprints': df_fingerprints,
           'df_binding_sites': df_binding_sites,
           'df_expasy': df_expasy,
           'df_profeat': df_profeat}
    

    
def create_features(split_file_name, out_file_name, feature_sets):
    print('===============================================')
    print('interactions.csv')
    print('===============================================')
    
    # load interactions.csv
    df_interactions = load_data(interactions_data_loc+split_file_name)

    # Rename the 'canonical_cid' column simply to 'cid' to simplifiy joining to the other feature sets later.
    df_interactions.rename(columns={"canonical_cid": "cid"}, inplace=True)
    print(df_interactions.head())
    
    # Get the individual feature sets
    df_dragon_features = feature_sets['df_dragon_features']
    df_fingerprints = feature_sets['df_fingerprints']
    df_binding_sites = feature_sets['df_binding_sites']
    df_expasy = feature_sets['df_expasy']
    df_profeat = feature_sets['df_profeat']
    
    print('\n\n===============================================')
    print('Join the data using {}\n'.format(split_file_name))
    print('===============================================')
    
    # Form the complete feature set by joining the data frames according to _cid_ and _pid_.
    # See the data readme in the Gitbug repository:
    # https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data.
    
    # By convention, the file features should be concatenated in the following order (for consistency):
    # **binding_sites**, **expasy**, **profeat**, **dragon_features**, **fingerprints**.
    
    print('\n\n-----------------------------------------------')
    print('df_interactions + df_binding_sites = df_features \n')
    df_features = pd.merge(df_interactions, df_binding_sites, on='pid', how='inner')
    print_merge_details(df_features, 'interactions', 'binding_sites')
    
    print('\n\n-----------------------------------------------')
    print('df_features + df_expasy \n')
    df_features = pd.merge(df_features, df_expasy, on='pid', how='inner')
    print_merge_details(df_features, 'features', 'expasy')
    
    print('\n\n-----------------------------------------------')
    print('df_features + df_profeat \n')
    df_features = pd.merge(df_features, df_profeat, on='pid', how='inner')
    print_merge_details(df_features, 'features', 'df_profeat')
    
    print('\n\n-----------------------------------------------')
    print('df_features + df_dragon_features \n')
    df_dragon_features.index.name = 'cid'
    df_features = pd.merge(df_features, df_dragon_features, on='cid', how='inner')
    print_merge_details(df_features, 'features', 'df_dragon_features')
    
    print('\n\n-----------------------------------------------')
    print('df_features + df_fingerprints \n')
    df_features = pd.merge(df_features, df_fingerprints, on='cid', how='inner')
    print_merge_details(df_features, 'features', 'df_fingerprints')
    
    print('\n\n-----------------------------------------------')
    print('Number of rows in joined feature set: {:,}\n'.format(len(df_features)))
    print('Number of columns in joined feature set: {:,}\n'.format(len(df_features.columns)))
    
    # release memory used by previous dataframes.
    del df_interactions
    del df_binding_sites
    del df_expasy
    del df_profeat
    del df_dragon_features
    del df_fingerprints
    
    # Save features to file
    store = pd.HDFStore(data_loc + out_file_name)
    store['df'] = df_features
    store.close()
    

## Create the feature files

In [12]:
# Get the individual feature sets
feature_sets = load_feature_files()


dragon_features.csv
Number of rows: 88,105

Number of columns: 3,839

0 columns with missing values


column types:

[{'MW': 'object'}, {'AMW': 'object'}, {'Sv': 'object'}, {'Se': 'object'}, {'Sp': 'object'}, {'Si': 'object'}, {'Mv': 'object'}, {'Me': 'object'}, {'Mp': 'object'}, {'Mi': 'object'}] 


              MW                AMW      Sv                  Se  \
cid                                                               
72792562  474.67  6.781000000000001  41.039              70.101   
44394609  546.48              8.674  43.185  63.538000000000004   
378422    410.52              7.331  34.740   56.43600000000001   
57888919  451.06              6.834  38.685              65.858   
54581291  456.58              8.615  36.234               53.52   

                          Sp                 Si     Mv                  Me  \
cid                                                                          
72792562   43.54600000000001  80.52199999999999  0.586               1.

In [10]:
create_features('training_interactions_v2.csv', 'training_features.h5', feature_sets)
create_features('validation_interactions_v2.csv', 'validation_features.h5', feature_sets)



interactions.csv
Number of rows: 168,765

Number of columns: 6

0 columns with missing values


column types:

[{'index': 'int64'}, {'cid': 'int64'}, {'pid': 'object'}, {'activity': 'int64'}, {'sample_activity_score': 'float64'}, {'expanding_mean': 'float64'}] 


    index       cid     pid  activity  sample_activity_score  expanding_mean
0  185362    204106  Q9UP65         1                    1.0        1.000000
1  159478  46938678  O15528         0                    1.0        0.500000
2  159479  46938678  O15528         1                    1.0        0.666667
3  150238  13703975  P22459         0                    1.0        0.500000
4  150239  13088125  P22459         0                    1.0        0.400000
    index       cid     pid  activity  sample_activity_score  expanding_mean
0  185362    204106  Q9UP65         1                    1.0        1.000000
1  159478  46938678  O15528         0                    1.0        0.500000
2  159479  46938678  O15528         1    