# Feature Generation
<hr>

This notebook takes the protein and drug files, cleans them and stitches them together into single feature set.

It assumes the data is in a sub-directory of the **/data** folder. I've already added entries to the _.gitignore_ file so that they won't be committed to the repository. Note that this file should be updated for new versions of the data.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data) for more details.

The output is a file called **features.csv**

<hr>

In [1]:
%pylab inline
%autosave 25

import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


## Data location

Change this when you get a new data set.

In [2]:
data_loc = '../data/FDA-COVID19_files_v1.0/'
interactions_data_loc = '../data/training_validation_split/'

## Load the data
<hr>

In [3]:
def load_data(path, data_type=None):
    if data_type:
        df = pd.read_csv(path, index_col=0, dtype=data_type)
    else:
        df = pd.read_csv(path, index_col=0)
    print('Number of rows: {:,}\n'.format(len(df)))
    print('Number of columns: {:,}\n'.format(len(df.columns)))
    
    columns_missing_values = df.columns[df.isnull().any()].tolist()
    print('{} columns with missing values\n\n'.format(len(columns_missing_values)))
    
    cols = df.columns.tolist()
    column_types = [{col: df.dtypes[col].name} for col in cols][:10]
    print('column types:\n')
    print(column_types, '\n\n')
    
    print(df.head())
    
    return df

<span style="font-weight:bold; font-size:17pt; color:darkblue;">interactions.csv</span>

In [4]:
df_interactions = load_data(interactions_data_loc+'validation_interactions.csv')

# Rename the 'canonical_cid' column simply to 'cid' to simplifiy joining to the other feature sets later.
df_interactions.rename(columns={"canonical_cid": "cid"}, inplace=True)
df_interactions.head()

Number of rows: 20,000

Number of columns: 4

0 columns with missing values


column types:

[{'cid': 'int64'}, {'pid': 'object'}, {'activity': 'int64'}, {'sample_activity_score': 'float64'}] 


             cid       pid  activity  sample_activity_score
106739      2264    P01106         1               0.134146
98502   49803313    P30530         1               0.205915
59873       2170  CAA56931         0               0.369108
18924       3878    1C3B_A         0               0.244034
11376      39765  CAQ07474         0               0.438238


Unnamed: 0,cid,pid,activity,sample_activity_score
106739,2264,P01106,1,0.134146
98502,49803313,P30530,1,0.205915
59873,2170,CAA56931,0,0.369108
18924,3878,1C3B_A,0,0.244034
11376,39765,CAQ07474,0,0.438238


<span style="font-weight:bold; font-size:17pt; color:darkblue;">fda_drug_cids.csv</span>

In [5]:
df_fda_drug_cids = load_data(data_loc+'fda_drug_cids.csv')

Number of rows: 3,269

Number of columns: 1

0 columns with missing values


column types:

[{'cid': 'object'}] 


     cid
0  16078
1   4020
2   4021
3  60750
4   5988


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">drug_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">dragon_features.csv</span>

In [6]:
# note need to set the data_type to object because it complains, otherwise that the types vary.
df_dragon_features = load_data(data_loc+'drug_features/dragon_features.csv', data_type=object)

# rename the dragon features since there are duplicate column names in the protein binding-sites data.
df_dragon_features.columns = ['cid_'+col for col in df_dragon_features.columns]

Number of rows: 88,105

Number of columns: 3,839

0 columns with missing values


column types:

[{'MW': 'object'}, {'AMW': 'object'}, {'Sv': 'object'}, {'Se': 'object'}, {'Sp': 'object'}, {'Si': 'object'}, {'Mv': 'object'}, {'Me': 'object'}, {'Mp': 'object'}, {'Mi': 'object'}] 


              MW                AMW      Sv                  Se  \
cid                                                               
72792562  474.67  6.781000000000001  41.039              70.101   
44394609  546.48              8.674  43.185  63.538000000000004   
378422    410.52              7.331  34.740   56.43600000000001   
57888919  451.06              6.834  38.685              65.858   
54581291  456.58              8.615  36.234               53.52   

                          Sp                 Si     Mv                  Me  \
cid                                                                          
72792562   43.54600000000001  80.52199999999999  0.586               1.001   
44394609  45.2

## na values in dragon_features


Many cells contain **"na"** values. Find the columns that contain 2% or less of these values and retain them, throwing away the rest. Then mean-impute the "na" values in the remaining columns.

In [7]:
pct_threshold = 2
na_threshold = int(91424*pct_threshold/100)
ok_cols = []
for col in df_dragon_features:
    na_count = df_dragon_features[col].value_counts().get('na')
    if (na_count or 0) <= na_threshold:
        ok_cols.append(col)
        
print('number of columns where the frequency of "na" values is <= {}%: {}.'.format(pct_threshold, len(ok_cols)))

number of columns where the frequency of "na" values is <= 2%: 3640.


In [8]:
df_dragon_features = df_dragon_features[ok_cols].copy()

# convert all values except "na"s to numbers and set "na" values to NaNs.
df_dragon_features = df_dragon_features.apply(pd.to_numeric, errors='coerce')

columns_missing_values = df_dragon_features.columns[df_dragon_features.isnull().any()].tolist()
print('{} columns with missing values.\n\n'.format(len(columns_missing_values)))

# replace NaNs with column means
df_dragon_features.fillna(df_dragon_features.mean(), inplace=True)

columns_missing_values = df_dragon_features.columns[df_dragon_features.isnull().any()].tolist()
print('{} columns with missing values (after imputing): {}\n\n'.format(len(columns_missing_values), 
                                                                       columns_missing_values))

3565 columns with missing values.


0 columns with missing values (after imputing): []




### Handle duplicate cids in dragon_features

Later on when we were joining the dragon_features we noticed that the inner join increased the number of rows. This is due to duiplicate cids in the dragon_features set.

Here we investigate further and resolve.

In [9]:
# create a temporary cid column from the index.
df_dragon_features['cid'] = df_dragon_features.index
s = df_dragon_features['cid'].value_counts()
dup_cids = s[s > 1]
print('There are {:,} rows with duplicate cids.\n\n'.format(len(dup_cids)))

dup_cids.head()

There are 0 rows with duplicate cids.




Series([], Name: cid, dtype: int64)

Find the cids where the mean interpoint distance is below a threshold. These are the duplicates that can be removed.

In [10]:
from scipy.spatial.distance import pdist
from IPython.display import display, clear_output

dup_keys = dup_cids.keys().tolist()
num_dupes = len(dup_keys)

bad_cids = [] # just keep one row for each
mean_dist_threshold = 0.000001
i = 1
for cid in dup_keys:
    df = df_dragon_features[df_dragon_features['cid']==cid]
    
    # turn "na"s into zeros
    df = df.apply(pd.to_numeric, errors='coerce')
    df.fillna(0, inplace=True)
    
    mean_dist = pdist(df, metric='euclidean').mean()
    if mean_dist <= mean_dist:
        bad_cids.append(cid)
        
    clear_output(wait=True)
    display('row {} of {:,} (cid = {}), latest mean dist. = {} across {} entries'.format(i, num_dupes, cid, mean_dist, len(df)))
    i = i + 1
    
    del df

In [11]:
if len(bad_cids) > 0:
    print('deduping {} items'.format(len(bad_cids)))
    
    df_dupes = df_dragon_features[df_dragon_features.cid.isin(bad_cids)]
    df_dedupe = df_dragon_features[~df_dragon_features.cid.isin(bad_cids)]
    
    # re-add the first of each dupe back into the data.
    first_dupes = []
    for cid in bad_cids:
        df_first = df_dupes[df_dupes['cid'] == cid].iloc[0]
        first_dupes.append(df_first)
        
    first_dupes = pd.DataFrame(first_dupes)
    df_dedupe = df_dedupe.append(first_dupes)
    del first_dupes
    
    print('Number of rows: {:,}\n'.format(len(df_dedupe)))
    print('Number of columns: {:,}\n'.format(len(df_dedupe.columns)))
    
    del df_dupes
    df_dedupe.head()

If we've deduped properly then there should be only one row for each cid in _bad_cids_. Here we just check the first and leave it at that.

In [12]:
if len(bad_cids) > 0:
    assert(len(df_dedupe[df_dedupe['cid'] == bad_cids[0]] == 1))
    print('assertion passed')
    
    # delete temporary cid column
    df_dedupe.drop(['cid'],axis=1,inplace=True)
    del df_dragon_features
    df_dragon_features = df_dedupe

    print('Number of rows: {:,}\n'.format(len(df_dragon_features)))
    print('Number of columns: {:,}\n'.format(len(df_dragon_features.columns)))

    df_dragon_features.head()
else:
    # delete temporary cid column
    df_dragon_features.drop(['cid'],axis=1,inplace=True)

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">drug_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">fingerprints.csv</span>

In [13]:
df_fingerprints = load_data(data_loc+'drug_features/fingerprints.csv')

Number of rows: 91,756

Number of columns: 4,096

0 columns with missing values


column types:

[{'0': 'int64'}, {'1': 'int64'}, {'2': 'int64'}, {'3': 'int64'}, {'4': 'int64'}, {'5': 'int64'}, {'6': 'int64'}, {'7': 'int64'}, {'8': 'int64'}, {'9': 'int64'}] 


          0  1  2  3  4  5  6  7  8  9  ...  4086  4087  4088  4089  4090  \
cid                                     ...                                 
38258     0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   
23644997  0  0  0  0  0  0  0  1  0  0  ...     0     0     0     0     0   
76314488  0  1  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   
46225960  0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   
3005573   0  0  0  0  0  0  0  0  0  0  ...     0     0     0     0     0   

          4091  4092  4093  4094  4095  
cid                                     
38258        0     0     0     0     0  
23644997     0     0     0     0     0  
76314488     0     0     0     0   

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">binding_sites_v1.0.csv</span>

In [14]:
df_binding_sites = load_data(data_loc+'protein_features/binding_sites_v1.0.csv')

Number of rows: 4,165

Number of columns: 8,481

0 columns with missing values


column types:

[{'AEK': 'float64'}, {'VEL': 'float64'}, {'EKF': 'float64'}, {'LGM': 'float64'}, {'VKN': 'float64'}, {'LKP': 'float64'}, {'NEE': 'float64'}, {'TPN': 'float64'}, {'SRL': 'float64'}, {'KEY': 'float64'}] 


             AEK       VEL       EKF       LGM       VKN       LKP       NEE  \
pid                                                                            
Q9WXS0  3.652353  3.626009  3.545504  4.156362  2.803784  2.811862  2.518908   
Q16206  3.133440 -1.000000  4.449457 -1.000000 -1.000000 -1.000000 -1.000000   
P37231  1.227430 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000   
P05556  2.200393 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000   
Q9UGH3  1.930957 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000   

             TPN       SRL       KEY  ...  FFV  MRW  YWT  AFF  LYW  *PX  SWH  \
pid                                   ...                  

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">expasy.csv</span>

In [15]:
df_expasy = load_data(data_loc+'protein_features/expasy.csv')

Number of rows: 4,201

Number of columns: 7

0 columns with missing values


column types:

[{'helical': 'float64'}, {'beta': 'float64'}, {'coil': 'float64'}, {'veryBuried': 'float64'}, {'veryExposed': 'float64'}, {'someBuried': 'float64'}, {'someExposed': 'float64'}] 


        helical   beta   coil  veryBuried  veryExposed  someBuried  \
pid                                                                  
10GS_A    0.536  0.096  0.368       0.292        0.254       0.234   
1A2C_H    0.089  0.378  0.533       0.313        0.301       0.212   
1A30_A    0.091  0.475  0.434       0.192        0.354       0.273   
1A42_A    0.143  0.313  0.544       0.286        0.263       0.224   
1A4G_A    0.000  0.428  0.572       0.387        0.192       0.277   

        someExposed  
pid                  
10GS_A        0.220  
1A2C_H        0.174  
1A30_A        0.182  
1A42_A        0.228  
1A4G_A        0.144  


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">profeat.csv</span>

In [16]:
df_profeat = load_data(data_loc+'protein_features/profeat.csv')

# Name the index to 'pid' to allow joining to other feaure files later.
df_profeat.index.name = 'pid'

Number of rows: 4,167

Number of columns: 849

80 columns with missing values


column types:

[{'[G1.1.1.1]': 'float64'}, {'[G1.1.1.2]': 'float64'}, {'[G1.1.1.3]': 'float64'}, {'[G1.1.1.4]': 'float64'}, {'[G1.1.1.5]': 'float64'}, {'[G1.1.1.6]': 'float64'}, {'[G1.1.1.7]': 'float64'}, {'[G1.1.1.8]': 'float64'}, {'[G1.1.1.9]': 'float64'}, {'[G1.1.1.10]': 'float64'}] 


        [G1.1.1.1]  [G1.1.1.2]  [G1.1.1.3]  [G1.1.1.4]  [G1.1.1.5]  \
10GS_A    7.177033    1.913876    6.220096    4.784689    3.349282   
1A2C_H    4.633205    2.702703    6.177606    5.791506    3.474903   
1A30_A    3.030303    2.020202    4.040404    4.040404    2.020202   
1A42_A    5.019305    0.386100    7.335907    5.019305    4.633205   
1A4G_A    6.666667    4.358974    5.641026    6.410256    3.076923   

        [G1.1.1.6]  [G1.1.1.7]  [G1.1.1.8]  [G1.1.1.9]  [G1.1.1.10]  ...  \
10GS_A    8.612440    0.956938    3.349282    5.741627    15.311005  ...   
1A2C_H    8.494208    1.930502    6.177606    7.335907   

In [17]:
# profeat has some missing values.
s = df_profeat.isnull().sum(axis = 0)

print('number of missing values for each column containing them is: {}'.format(len(s[s > 0])))

# Drop the rows that have missing values.
df_profeat.dropna(inplace=True)
print('number of rows remaining, without NaNs: {:,}'.format(len(df_profeat)))

number of missing values for each column containing them is: 80
number of rows remaining, without NaNs: 4,161


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_expasy.csv</span>

In [18]:
df_coronavirus_expasy = load_data(data_loc+'coronavirus_features/coronavirus_expasy.csv')

Number of rows: 29

Number of columns: 88

0 columns with missing values


column types:

[{'length': 'int64'}, {'weight': 'float64'}, {'pI': 'float64'}, {'A Total': 'int64'}, {'A Percent': 'float64'}, {'R Total': 'int64'}, {'R Percent': 'float64'}, {'N Total': 'int64'}, {'N Percent': 'float64'}, {'D Total': 'int64'}] 


          length     weight    pI  A Total  A Percent  R Total  R Percent  \
pid                                                                         
QHD43415    7096  794057.79  6.32      487        6.9      244        3.4   
QHD43416    1273  141178.47  6.24       79        6.2       42        3.3   
QHD43417     275   31122.94  5.55       13        4.7        6        2.2   
QHD43418      75    8365.04  8.57        4        5.3        3        4.0   
QHD43419     222   25146.62  9.51       19        8.6       14        6.3   

          N Total  N Percent  D Total  ...  chargedTotal  chargedPercent  \
pid                                    ...                   

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_porter.csv</span>

In [19]:
df_coronavirus_porter = load_data(data_loc+'coronavirus_features/coronavirus_porter.csv')

Number of rows: 30

Number of columns: 7

0 columns with missing values


column types:

[{'helical': 'float64'}, {'beta': 'float64'}, {'coil': 'float64'}, {'veryBuried': 'float64'}, {'veryExposed': 'float64'}, {'someBuried': 'float64'}, {'someExposed': 'float64'}] 


          helical   beta   coil  veryBuried  veryExposed  someBuried  \
pid                                                                    
QHD43415    0.339  0.219  0.442       0.295        0.009       0.357   
QHD43416    0.245  0.312  0.443       0.436        0.106       0.287   
QHD43417    0.345  0.196  0.458       0.473        0.175       0.218   
QHD43418    0.653  0.000  0.347       0.040        0.787       0.080   
QHD43419    0.383  0.284  0.333       0.279        0.203       0.320   

          someExposed  
pid                    
QHD43415        0.339  
QHD43416        0.171  
QHD43417        0.135  
QHD43418        0.093  
QHD43419        0.198  


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_profeat.csv</span>

In [20]:
df_coronavirus_profeat = load_data(data_loc+'coronavirus_features/coronavirus_profeat.csv')

Number of rows: 29

Number of columns: 849

0 columns with missing values


column types:

[{'[G1.1.1.1]': 'float64'}, {'[G1.1.1.2]': 'float64'}, {'[G1.1.1.3]': 'float64'}, {'[G1.1.1.4]': 'float64'}, {'[G1.1.1.5]': 'float64'}, {'[G1.1.1.6]': 'float64'}, {'[G1.1.1.7]': 'float64'}, {'[G1.1.1.8]': 'float64'}, {'[G1.1.1.9]': 'float64'}, {'[G1.1.1.10]': 'float64'}] 


          [G1.1.1.1]  [G1.1.1.2]  [G1.1.1.3]  [G1.1.1.4]  [G1.1.1.5]  \
QHD43415    6.863021    3.184893    5.481962    4.791432    4.918264   
QHD43416    6.205813    3.142184    4.870385    3.770621    6.048704   
QHD43417    4.727273    2.545455    4.727273    4.000000    5.090909   
QHD43418    5.333333    4.000000    1.333333    2.666667    6.666667   
QHD43419    8.558559    1.801802    2.702703    3.153153    4.954955   

          [G1.1.1.6]  [G1.1.1.7]  [G1.1.1.8]  [G1.1.1.9]  [G1.1.1.10]  ...  \
QHD43415    5.806088    2.043405    4.833709    6.116122     9.413754  ...   
QHD43416    6.441477    1.335428    5.970149 

## Join the data

Form the complete feature set by joining the data frames according to _cid_ and _pid_.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data).

<span style="font-weight:bold; font-size:12pt; color:darkblue;">Note:</span> By convention, the file features should be concatenated in the following order (for consistency): **binding_sites**, **expasy**, **profeat**, **dragon_features**, **fingerprints**.

### Example Feature Concatenation

In [21]:
# df_example_features = load_data(data_loc+'example_feature_concatenation.csv')

### Let the merging begin

In [22]:
def print_merge_details(df_merge_result, df1_name, df2_name):
    print('Joining {} on protein {} yields {:,} rows and {:,} columns'. \
          format(df1_name, df2_name, len(df_features), 
          len(df_features.columns)))

<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_interactions + df_binding_sites = df_features</span>

In [23]:
df_features = pd.merge(df_interactions, df_binding_sites, on='pid', how='inner')
print_merge_details(df_features, 'interactions', 'binding_sites')

Joining interactions on protein binding_sites yields 19,864 rows and 8,485 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_expasy</span>

In [24]:
df_features = pd.merge(df_features, df_expasy, on='pid', how='inner')
print_merge_details(df_features, 'features', 'expasy')

Joining features on protein expasy yields 19,811 rows and 8,492 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_profeat</span>

In [25]:
df_features = pd.merge(df_features, df_profeat, on='pid', how='inner')
print_merge_details(df_features, 'features', 'df_profeat')

Joining features on protein df_profeat yields 19,247 rows and 9,341 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_dragon_features</span>

In [26]:
df_dragon_features.index.name = 'cid'
df_features = pd.merge(df_features, df_dragon_features, on='cid', how='inner')
print_merge_details(df_features, 'features', 'df_dragon_features')

Joining features on protein df_dragon_features yields 19,240 rows and 12,981 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_fingerprints</span>

In [27]:
df_features = pd.merge(df_features, df_fingerprints, on='cid', how='inner')
print_merge_details(df_features, 'features', 'df_fingerprints')

Joining features on protein df_fingerprints yields 19,240 rows and 17,077 columns


In [28]:
# Any missing values:
columns_missing_values = df_features.columns[df_features.isnull().any()].tolist()

print('{} columns with missing values: {}\n\n'.format(len(columns_missing_values), columns_missing_values))

df_features.head()

0 columns with missing values: []




Unnamed: 0,cid,pid,activity,sample_activity_score,AEK,VEL,EKF,LGM,VKN,LKP,...,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095
0,2264,P01106,1,0.134146,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
1,1573,P01106,1,0.46748,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
2,1858,P01106,1,0.46748,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
3,1858,4KC3_B,1,0.767068,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0
4,1858,P48039,0,0.394309,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,0,0,0,0,0,0,0,0,0,0


## Free up some memory

In [29]:
# release memory used by previous dataframes.
del df_interactions
del df_binding_sites
del df_expasy
del df_profeat
del df_dragon_features
del df_fingerprints

## Save features to file

In [30]:
store = pd.HDFStore(data_loc + 'validation_features.h5')
store['df'] = df_features
store.close()