# Exploration of Dimension Reduction
<hr>

This notebook is for EDA, feature extraction, engineering and the subsequent evaluation of dimension reduction techniques.

It assumes the data is a sub-directory of the **/data** folder. I've already added entries to the _.gitignore_ file so that they won't be committed to the repository. Note that this file should be updated for new versions of the data.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data) for more details.

<hr>

In [1]:
%pylab inline
%autosave 25

import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


## Data location

Change this when you get a new data set.

In [2]:
data_loc = '../data/FDA-COVID19_files_v0.5/'

## Load the data
<hr>

In [3]:
def load_data(path, data_type=None):
    if data_type:
        df = pd.read_csv(path, index_col=0, dtype=data_type)
    else:
        df = pd.read_csv(path, index_col=0)
    print('Number of rows: {:,}\n'.format(len(df)))
    print('Number of columns: {:,}\n'.format(len(df.columns)))
    
    columns_missing_values = df.columns[df.isnull().any()].tolist()
    print('{} columns with missing values: {}\n\n'.format(len(columns_missing_values), columns_missing_values))
    
    cols = df.columns.tolist()
    column_types = [{col: df.dtypes[col].name} for col in cols]
    print('column types:\n')
    print(column_types, '\n\n')
    
    print(df.head())
    
    return df

<span style="font-weight:bold; font-size:17pt; color:darkblue;">interactions.csv</span>

In [4]:
df_interactions = load_data(data_loc+'interactions.csv')

# Rename the 'canonical_cid' column simply to 'cid' to simplifiy joining to the other feature sets later.
df_interactions.rename(columns={"canonical_cid": "cid"}, inplace=True)
df_interactions.head()

Number of rows: 189,312

Number of columns: 3

0 columns with missing values: []


column types:

[{'canonical_cid': 'int64'}, {'pid': 'object'}, {'activity': 'int64'}] 


   canonical_cid       pid  activity
0          38258  CAA96025         0
1       23644994    P11511         0
2       76314488    P31391         0
3       46225960    Q96DB2         0
4        3005573    P04798         1


Unnamed: 0,cid,pid,activity
0,38258,CAA96025,0
1,23644994,P11511,0
2,76314488,P31391,0
3,46225960,Q96DB2,0
4,3005573,P04798,1


<span style="font-weight:bold; font-size:17pt; color:darkblue;">fda_drug_cids.csv</span>

In [5]:
df_fda_drug_cids = load_data(data_loc+'fda_drug_cids.csv')

Number of rows: 3,269

Number of columns: 1

0 columns with missing values: []


column types:

[{'cid': 'object'}] 


     cid
0  16078
1   4020
2   4021
3  60750
4   5988


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">drug_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">dragon_features.csv</span>

In [6]:
# note need to set the data_type to object because it complains, otherwise that the types vary.
df_dragon_features = load_data(data_loc+'drug_features/dragon_features.csv', data_type=object)

Number of rows: 91,424

Number of columns: 3,839

0 columns with missing values: []


column types:

[{'MW': 'object'}, {'AMW': 'object'}, {'Sv': 'object'}, {'Se': 'object'}, {'Sp': 'object'}, {'Si': 'object'}, {'Mv': 'object'}, {'Me': 'object'}, {'Mp': 'object'}, {'Mi': 'object'}, {'GD': 'object'}, {'nAT': 'object'}, {'nSK': 'object'}, {'nTA': 'object'}, {'nBT': 'object'}, {'nBO': 'object'}, {'nBM': 'object'}, {'SCBO': 'object'}, {'RBN': 'object'}, {'RBF': 'object'}, {'nDB': 'object'}, {'nTB': 'object'}, {'nAB': 'object'}, {'nH': 'object'}, {'nC': 'object'}, {'nN': 'object'}, {'nO': 'object'}, {'nP': 'object'}, {'nS': 'object'}, {'nF': 'object'}, {'nCL': 'object'}, {'nBR': 'object'}, {'nI': 'object'}, {'nB': 'object'}, {'nHM': 'object'}, {'nHet': 'object'}, {'nX': 'object'}, {'H%': 'object'}, {'C%': 'object'}, {'N%': 'object'}, {'O%': 'object'}, {'X%': 'object'}, {'nCsp3': 'object'}, {'nCsp2': 'object'}, {'nCsp': 'object'}, {'nStructures': 'object'}, {'totalcharge': 'object'}, {'nCIC'

In [7]:
original_num_rows = len(df_dragon_features)

# Convert to numeric, replacing strings with NaNs
df_dragon_features = df_dragon_features.apply(pd.to_numeric, errors='coerce').copy()

columns_missing_values = df_dragon_features.columns[df_dragon_features.isnull().any()].tolist()
print('{} columns with missing values: {}\n\n'.format(len(columns_missing_values), columns_missing_values))

# Drop the rows that have missing values.
df_dragon_features.dropna(inplace=True)

print('number of rows reamining: {:,}, from {:,}'.format(len(df_dragon_features), original_num_rows))

3764 columns with missing values: ['Sv', 'Se', 'Mv', 'Me', 'ZM1', 'ZM1V', 'ZM1Kup', 'ZM1Mad', 'ZM1Per', 'ZM1MulPer', 'ZM2', 'ZM2V', 'ZM2Kup', 'ZM2Mad', 'ZM2Per', 'ZM2MulPer', 'ON0', 'ON0V', 'ON1', 'ON1V', 'Qindex', 'BBI', 'DBI', 'SNar', 'HNar', 'GNar', 'Xt', 'Dz', 'Ram', 'BLI', 'Pol', 'LPRS', 'MSD', 'SPI', 'PJI2', 'ECC', 'AECC', 'DECC', 'MDDD', 'UNIP', 'CENT', 'VAR', 'ICR', 'SMTI', 'SMTIV', 'GMTI', 'GMTIV', 'Xu', 'CSI', 'Wap', 'S1K', 'S2K', 'S3K', 'PHI', 'PW2', 'PW3', 'PW4', 'PW5', 'MAXDN', 'MAXDP', 'DELS', 'TIE', 'Psi_i_s', 'Psi_i_A', 'Psi_i_0', 'Psi_i_1', 'Psi_i_t', 'Psi_i_0d', 'Psi_i_1d', 'Psi_i_1s', 'Psi_e_A', 'Psi_e_0', 'Psi_e_1', 'Psi_e_t', 'Psi_e_0d', 'Psi_e_1d', 'Psi_e_1s', 'BAC', 'LOC', 'MWC01', 'MWC02', 'MWC03', 'MWC04', 'MWC05', 'MWC06', 'MWC07', 'MWC08', 'MWC09', 'MWC10', 'SRW02', 'SRW03', 'SRW04', 'SRW05', 'SRW06', 'SRW07', 'SRW08', 'SRW09', 'SRW10', 'MPC01', 'MPC02', 'MPC03', 'MPC04', 'MPC05', 'MPC06', 'MPC07', 'MPC08', 'MPC09', 'MPC10', 'piPC01', 'piPC02', 'piPC03', 'piP

number of rows reamining: 423, from 91,424


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">drug_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">fingerprints.csv</span>

In [8]:
df_fingerprints = load_data(data_loc+'drug_features/fingerprints.csv')

Number of rows: 91,756

Number of columns: 4,096

0 columns with missing values: []


column types:

[{'0': 'int64'}, {'1': 'int64'}, {'2': 'int64'}, {'3': 'int64'}, {'4': 'int64'}, {'5': 'int64'}, {'6': 'int64'}, {'7': 'int64'}, {'8': 'int64'}, {'9': 'int64'}, {'10': 'int64'}, {'11': 'int64'}, {'12': 'int64'}, {'13': 'int64'}, {'14': 'int64'}, {'15': 'int64'}, {'16': 'int64'}, {'17': 'int64'}, {'18': 'int64'}, {'19': 'int64'}, {'20': 'int64'}, {'21': 'int64'}, {'22': 'int64'}, {'23': 'int64'}, {'24': 'int64'}, {'25': 'int64'}, {'26': 'int64'}, {'27': 'int64'}, {'28': 'int64'}, {'29': 'int64'}, {'30': 'int64'}, {'31': 'int64'}, {'32': 'int64'}, {'33': 'int64'}, {'34': 'int64'}, {'35': 'int64'}, {'36': 'int64'}, {'37': 'int64'}, {'38': 'int64'}, {'39': 'int64'}, {'40': 'int64'}, {'41': 'int64'}, {'42': 'int64'}, {'43': 'int64'}, {'44': 'int64'}, {'45': 'int64'}, {'46': 'int64'}, {'47': 'int64'}, {'48': 'int64'}, {'49': 'int64'}, {'50': 'int64'}, {'51': 'int64'}, {'52': 'int64'}, {'53': 

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">binding_sites_v0.5.csv</span>

In [9]:
df_binding_sites = load_data(data_loc+'protein_features/binding_sites_v0.5.csv')

Number of rows: 2,743

Number of columns: 22

0 columns with missing values: []


column types:

[{'GLY': 'float64'}, {'ARG': 'float64'}, {'GLN': 'float64'}, {'GLU': 'float64'}, {'ILE': 'float64'}, {'ALA': 'float64'}, {'THR': 'float64'}, {'PRO': 'float64'}, {'ASP': 'float64'}, {'SER': 'float64'}, {'ASN': 'float64'}, {'LYS': 'float64'}, {'VAL': 'float64'}, {'CYS': 'float64'}, {'LEU': 'float64'}, {'TYR': 'float64'}, {'HIS': 'float64'}, {'MET': 'float64'}, {'PHE': 'float64'}, {'TRP': 'float64'}, {'GLX': 'float64'}, {'Unnamed: 22': 'float64'}] 


               GLY       ARG       GLN       GLU       ILE       ALA  \
pid                                                                    
ACM69038  5.561992  5.192893  5.750618  5.491832  4.448955  6.587973   
P42898    4.460553  5.578368  4.860317  5.276580  3.033603  7.342901   
P56696    4.877018  4.978129  4.814894  5.974694  4.478648  6.033997   
P0AD68    5.096210  3.579407  4.678623  4.083968  3.018757  6.718320   
P02774    3.885175 

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">expasy.csv</span>

In [10]:
df_expasy = load_data(data_loc+'protein_features/expasy.csv')

Number of rows: 4,201

Number of columns: 7

0 columns with missing values: []


column types:

[{'helical': 'float64'}, {'beta': 'float64'}, {'coil': 'float64'}, {'veryBuried': 'float64'}, {'veryExposed': 'float64'}, {'someBuried': 'float64'}, {'someExposed': 'float64'}] 


        helical   beta   coil  veryBuried  veryExposed  someBuried  \
pid                                                                  
10GS_A    0.536  0.096  0.368       0.292        0.254       0.234   
1A2C_H    0.089  0.378  0.533       0.313        0.301       0.212   
1A30_A    0.091  0.475  0.434       0.192        0.354       0.273   
1A42_A    0.143  0.313  0.544       0.286        0.263       0.224   
1A4G_A    0.000  0.428  0.572       0.387        0.192       0.277   

        someExposed  
pid                  
10GS_A        0.220  
1A2C_H        0.174  
1A30_A        0.182  
1A42_A        0.228  
1A4G_A        0.144  


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">profeat.csv</span>

In [11]:
df_profeat = load_data(data_loc+'protein_features/profeat.csv')

# Name the index to 'pid' to allow joining to other feaure files later.
df_profeat.index.name = 'pid'

Number of rows: 4,167

Number of columns: 849

80 columns with missing values: ['[G7.1.1.1]', '[G7.1.1.2]', '[G7.1.1.3]', '[G7.1.1.4]', '[G7.1.1.5]', '[G7.1.1.6]', '[G7.1.1.7]', '[G7.1.1.8]', '[G7.1.1.9]', '[G7.1.1.10]', '[G7.1.1.11]', '[G7.1.1.12]', '[G7.1.1.13]', '[G7.1.1.14]', '[G7.1.1.15]', '[G7.1.1.16]', '[G7.1.1.17]', '[G7.1.1.18]', '[G7.1.1.19]', '[G7.1.1.20]', '[G7.1.1.21]', '[G7.1.1.22]', '[G7.1.1.23]', '[G7.1.1.24]', '[G7.1.1.25]', '[G7.1.1.26]', '[G7.1.1.27]', '[G7.1.1.28]', '[G7.1.1.29]', '[G7.1.1.30]', '[G7.1.1.31]', '[G7.1.1.32]', '[G7.1.1.33]', '[G7.1.1.34]', '[G7.1.1.35]', '[G7.1.1.36]', '[G7.1.1.37]', '[G7.1.1.38]', '[G7.1.1.39]', '[G7.1.1.40]', '[G7.1.1.41]', '[G7.1.1.42]', '[G7.1.1.43]', '[G7.1.1.44]', '[G7.1.1.45]', '[G7.1.1.46]', '[G7.1.1.47]', '[G7.1.1.48]', '[G7.1.1.49]', '[G7.1.1.50]', '[G7.1.1.51]', '[G7.1.1.52]', '[G7.1.1.53]', '[G7.1.1.54]', '[G7.1.1.55]', '[G7.1.1.56]', '[G7.1.1.57]', '[G7.1.1.58]', '[G7.1.1.59]', '[G7.1.1.60]', '[G7.1.1.61]', '[G7.1.1.62]',

In [12]:
# profeat has some missing values.
s = df_profeat.isnull().sum(axis = 0)

print('number of missing values for each column containing them is: {}'.format(len(s[s > 0])))

# Drop the rows that have missing values.
df_profeat.dropna(inplace=True)
print('number of rows reamining, without NaNs: {:,}'.format(len(df_profeat)))

number of missing values for each column containing them is: 80
number of rows reamining, without NaNs: 4,161


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_expasy.csv</span>

In [13]:
df_coronavirus_expasy = load_data(data_loc+'coronavirus_features/coronavirus_expasy.csv')

Number of rows: 9

Number of columns: 88

0 columns with missing values: []


column types:

[{'length': 'int64'}, {'weight': 'float64'}, {'pI': 'float64'}, {'A Total': 'int64'}, {'A Percent': 'float64'}, {'R Total': 'int64'}, {'R Percent': 'float64'}, {'N Total': 'int64'}, {'N Percent': 'float64'}, {'D Total': 'int64'}, {'D Percent': 'float64'}, {'C Total': 'int64'}, {'C Percent': 'float64'}, {'Q Total': 'int64'}, {'Q Percent': 'float64'}, {'E Total': 'int64'}, {'E Percent': 'float64'}, {'G Total': 'int64'}, {'G Percent': 'float64'}, {'H Total': 'int64'}, {'H Percent': 'float64'}, {'I Total': 'int64'}, {'I Percent': 'float64'}, {'L Total': 'int64'}, {'L Percent': 'float64'}, {'K Total': 'int64'}, {'K Percent': 'float64'}, {'M Total': 'int64'}, {'M Percent': 'float64'}, {'F Total': 'int64'}, {'F Percent': 'float64'}, {'P Total': 'int64'}, {'P Percent': 'float64'}, {'S Total': 'int64'}, {'S Percent': 'float64'}, {'T Total': 'int64'}, {'T Percent': 'float64'}, {'W Total': 'int64'}, {'W P

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_porter.csv</span>

In [14]:
df_coronavirus_porter = load_data(data_loc+'coronavirus_features/coronavirus_porter.csv')

Number of rows: 9

Number of columns: 7

0 columns with missing values: []


column types:

[{'helical': 'float64'}, {'beta': 'float64'}, {'coil': 'float64'}, {'veryBuried': 'float64'}, {'veryExposed': 'float64'}, {'someBuried': 'float64'}, {'someExposed': 'float64'}] 


          helical   beta   coil  veryBuried  veryExposed  someBuried  \
pid                                                                    
QHD43415    0.339  0.219  0.442       0.295        0.009       0.357   
QHD43416    0.245  0.312  0.443       0.436        0.106       0.287   
QHD43417    0.345  0.196  0.458       0.473        0.175       0.218   
QHD43418    0.653  0.000  0.347       0.040        0.787       0.080   
QHD43419    0.383  0.284  0.333       0.279        0.203       0.320   

          someExposed  
pid                    
QHD43415        0.339  
QHD43416        0.171  
QHD43417        0.135  
QHD43418        0.093  
QHD43419        0.198  


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_profeat.csv</span>

In [15]:
df_coronavirus_profeat = load_data(data_loc+'coronavirus_features/coronavirus_profeat.csv')

Number of rows: 9

Number of columns: 849

0 columns with missing values: []


column types:

[{'[G1.1.1.1]': 'float64'}, {'[G1.1.1.2]': 'float64'}, {'[G1.1.1.3]': 'float64'}, {'[G1.1.1.4]': 'float64'}, {'[G1.1.1.5]': 'float64'}, {'[G1.1.1.6]': 'float64'}, {'[G1.1.1.7]': 'float64'}, {'[G1.1.1.8]': 'float64'}, {'[G1.1.1.9]': 'float64'}, {'[G1.1.1.10]': 'float64'}, {'[G1.1.1.11]': 'float64'}, {'[G1.1.1.12]': 'float64'}, {'[G1.1.1.13]': 'float64'}, {'[G1.1.1.14]': 'float64'}, {'[G1.1.1.15]': 'float64'}, {'[G1.1.1.16]': 'float64'}, {'[G1.1.1.17]': 'float64'}, {'[G1.1.1.18]': 'float64'}, {'[G1.1.1.19]': 'float64'}, {'[G1.1.1.20]': 'float64'}, {'[G2.1.1.1]': 'float64'}, {'[G2.1.1.2]': 'float64'}, {'[G2.1.1.3]': 'float64'}, {'[G2.1.1.4]': 'float64'}, {'[G2.1.1.5]': 'float64'}, {'[G2.1.1.6]': 'float64'}, {'[G2.1.1.7]': 'float64'}, {'[G2.1.1.8]': 'float64'}, {'[G2.1.1.9]': 'float64'}, {'[G2.1.1.10]': 'float64'}, {'[G2.1.1.11]': 'float64'}, {'[G2.1.1.12]': 'float64'}, {'[G2.1.1.13]': 'float64'},

## Join the data

Form the complete feature set by joining the data frames according to _cid_ and _pid_.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data).

<span style="font-weight:bold; font-size:12pt; color:darkblue;">Note:</span> By convention, the file features should be concatenated in the following order (for consistency): **binding_sites**, **expasy**, **profeat**, **dragon_features**, **fingerprints**.

### Example Feature Concatenation

In [16]:
df_example_features = load_data(data_loc+'example_feature_concatenation.csv')

Number of rows: 8,813

Number of columns: 1

0 columns with missing values: []


column types:

[{'0': 'object'}] 


                    0
0  3.4248720461173585
1   3.612447614836665
2   4.810351184958711
3   5.747807206060877
4  1.9052810017181654


### Let the merging begin

In [17]:
def print_merge_details(df_merge_result, df1_name, df2_name):
    print('Joining {} on protein {} yields {:,} rows and {:,} columns'. \
          format(df1_name, df2_name, len(df_features), 
          len(df_features.columns)))

<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_interactions + df_binding_sites = df_features</span>

In [18]:
df_features = pd.merge(df_interactions, df_binding_sites, on='pid', how='inner')
print_merge_details(df_features, 'interactions', 'binding_sites')

Joining interactions on protein binding_sites yields 139,019 rows and 25 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_expasy</span>

In [19]:
df_features = pd.merge(df_features, df_expasy, on='pid', how='inner')
print_merge_details(df_features, 'features', 'expasy')

Joining features on protein expasy yields 138,824 rows and 32 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_profeat</span>

In [20]:
df_features = pd.merge(df_features, df_profeat, on='pid', how='inner')
print_merge_details(df_features, 'features', 'df_profeat')

Joining features on protein df_profeat yields 135,657 rows and 881 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_dragon_features</span>

In [21]:
df_features = pd.merge(df_features, df_dragon_features, on='cid', how='inner')
print_merge_details(df_features, 'features', 'df_dragon_features')

Joining features on protein df_dragon_features yields 3,363 rows and 4,720 columns


<span style="font-weight:bold; font-size:12pt; color:darkblue;">df_features + df_fingerprints</span>

In [22]:
df_features = pd.merge(df_features, df_fingerprints, on='cid', how='inner')
print_merge_details(df_features, 'features', 'df_fingerprints')

Joining features on protein df_fingerprints yields 3,363 rows and 8,816 columns


In [23]:
# Any missing values:
columns_missing_values = df_features.columns[df_features.isnull().any()].tolist()

print('{} columns with missing values: {}\n\n'.format(len(columns_missing_values), columns_missing_values))

0 columns with missing values: []




## Experiment 1

XGBoost for features extraction.

(1) split into train and test sets <br>
(2) scale the training set <br>
(3) scale the test set with the same scaler <br>
(4) train the model

In [28]:
from sklearn.preprocessing import MinMaxScaler

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
number_cols = df_features.select_dtypes(include=numerics)
number_cols = [col for col in number_cols]
number_cols.remove('activity')
number_cols.remove('cid')

print('Number of numeric columns: {:,}'.format(len(number_cols)))

df_data = df_features[number_cols]

Number of numeric columns: 8,813
