# Exploration of Dimension Reduction
<hr>

This notebook is for EDA, feature extraction, engineering and the subsequent evaluation of dimension reduction techniques.

It assumes the data is a sub-directory of the **/data** folder. I've already added entries to the _.gitignore_ file so that they won't be committed to the repository. Note that this file should be updated for new versions of the data.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data) for more details.

<hr>

In [1]:
%pylab inline
%autosave 25

import pandas as pd

Populating the interactive namespace from numpy and matplotlib


Autosaving every 25 seconds


## Data location

Change this when you get a new data set.

In [4]:
data_loc = '../data/FDA-COVID19_files_v0.5/'

## Load the data
<hr>

In [31]:
def load_data(path, data_type=None):
    if data_type:
        df = pd.read_csv(path, index_col=0, dtype=data_type)
    else:
        df = pd.read_csv(path, index_col=0)
    print('Number of rows: {:,}\n'.format(len(df)))
    print('Number of columns: {:,}\n\n'.format(len(df.columns)))
    print(df.head())
    return df

<span style="font-weight:bold; font-size:17pt; color:darkblue;">interactions.csv</span>

In [32]:
df_interactions = load_data(data_loc+'interactions.csv')

Number of rows: 189,312

Number of columns: 3


   canonical_cid       pid  activity
0          38258  CAA96025         0
1       23644994    P11511         0
2       76314488    P31391         0
3       46225960    Q96DB2         0
4        3005573    P04798         1


<span style="font-weight:bold; font-size:17pt; color:darkblue;">fda_drug_cids.csv</span>

In [33]:
df_fda_drug_cids = load_data(data_loc+'fda_drug_cids.csv')

Number of rows: 3,269

Number of columns: 1


     cid
0  16078
1   4020
2   4021
3  60750
4   5988


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">drug_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">dragon_features.csv</span>

In [44]:
df_dragon_features = load_data(data_loc+'drug_features/dragon_features.csv', data_type=object)

Number of rows: 91,424

Number of columns: 3,839


              MW                AMW                  Sv                  Se  \
cid                                                                           
72792562  474.67  6.781000000000001  41.038999999999994              70.101   
44394609  546.48              8.674              43.185  63.538000000000004   
378422    410.52              7.331               34.74   56.43600000000001   
57888919  451.06              6.834              38.685              65.858   
54581291  456.58              8.615              36.234               53.52   

                          Sp                 Si     Mv                  Me  \
cid                                                                          
72792562   43.54600000000001  80.52199999999999  0.586               1.001   
44394609  45.233000000000004             69.993  0.685  1.0090000000000001   
378422                36.216             63.398   0.62               1.008   
57888

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">drug_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">fingerprints.csv</span>

In [43]:
df_fingerprints = load_data(data_loc+'drug_features/fingerprints.csv')

Number of rows: 91,756

Number of columns: 4,096


          0  1  2  3  4  5  6  7  8  9  ...   4086  4087  4088  4089  4090  \
cid                                     ...                                  
38258     0  0  0  0  0  0  0  0  0  0  ...      0     0     0     0     0   
23644997  0  0  0  0  0  0  0  1  0  0  ...      0     0     0     0     0   
76314488  0  1  0  0  0  0  0  0  0  0  ...      0     0     0     0     0   
46225960  0  0  0  0  0  0  0  0  0  0  ...      0     0     0     0     0   
3005573   0  0  0  0  0  0  0  0  0  0  ...      0     0     0     0     0   

          4091  4092  4093  4094  4095  
cid                                     
38258        0     0     0     0     0  
23644997     0     0     0     0     0  
76314488     0     0     0     0     1  
46225960     0     0     0     0     0  
3005573      0     0     0     0     0  

[5 rows x 4096 columns]


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">binding_sites_v0.5.csv</span>

In [37]:
df_binding_sites = load_data(data_loc+'protein_features/binding_sites_v0.5.csv')

Number of rows: 2,743

Number of columns: 22


               GLY       ARG       GLN       GLU       ILE       ALA  \
pid                                                                    
ACM69038  5.561992  5.192893  5.750618  5.491832  4.448955  6.587973   
P42898    4.460553  5.578368  4.860317  5.276580  3.033603  7.342901   
P56696    4.877018  4.978129  4.814894  5.974694  4.478648  6.033997   
P0AD68    5.096210  3.579407  4.678623  4.083968  3.018757  6.718320   
P02774    3.885175  5.490260  4.814894  5.771078 -1.000000  6.142619   

               THR       PRO       ASP       SER     ...            VAL  \
pid                                                  ...                  
ACM69038  4.868316  5.011431  4.481827  5.000618     ...       3.171188   
P42898    0.268170  4.811418  4.339997  5.089027     ...       3.379818   
P56696    4.527946  6.211507  5.537042  5.459238     ...       4.469914   
P0AD68    3.078797  3.772463  4.300284  5.089027     ...       3.781432  

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">expasy.csv</span>

In [38]:
df_expasy = load_data(data_loc+'protein_features/expasy.csv')

Number of rows: 4,201

Number of columns: 7


        helical   beta   coil  veryBuried  veryExposed  someBuried  \
pid                                                                  
10GS_A    0.536  0.096  0.368       0.292        0.254       0.234   
1A2C_H    0.089  0.378  0.533       0.313        0.301       0.212   
1A30_A    0.091  0.475  0.434       0.192        0.354       0.273   
1A42_A    0.143  0.313  0.544       0.286        0.263       0.224   
1A4G_A    0.000  0.428  0.572       0.387        0.192       0.277   

        someExposed  
pid                  
10GS_A        0.220  
1A2C_H        0.174  
1A30_A        0.182  
1A42_A        0.228  
1A4G_A        0.144  


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">protein_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">profeat.csv</span>

In [39]:
df_profeat = load_data(data_loc+'protein_features/profeat.csv')

Number of rows: 4,167

Number of columns: 849


        [G1.1.1.1]  [G1.1.1.2]  [G1.1.1.3]  [G1.1.1.4]  [G1.1.1.5]  \
10GS_A    7.177033    1.913876    6.220096    4.784689    3.349282   
1A2C_H    4.633205    2.702703    6.177606    5.791506    3.474903   
1A30_A    3.030303    2.020202    4.040404    4.040404    2.020202   
1A42_A    5.019305    0.386100    7.335907    5.019305    4.633205   
1A4G_A    6.666667    4.358974    5.641026    6.410256    3.076923   

        [G1.1.1.6]  [G1.1.1.7]  [G1.1.1.8]  [G1.1.1.9]  [G1.1.1.10]  \
10GS_A    8.612440    0.956938    3.349282    5.741627    15.311005   
1A2C_H    8.494208    1.930502    6.177606    7.335907     7.722008   
1A30_A   13.131313    1.010101   15.151515    7.070707    10.101010   
1A42_A    8.494208    4.633205    3.474903    9.266409    10.038610   
1A4G_A   10.256410    3.076923    6.923077    6.153846     5.384615   

           ...       [G7.1.1.71]  [G7.1.1.72]  [G7.1.1.73]  [G7.1.1.74]  \
10GS_A     ...         -0.001

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_expasy.csv</span>

In [40]:
df_coronavirus_expasy = load_data(data_loc+'coronavirus_features/coronavirus_expasy.csv')

Number of rows: 9

Number of columns: 88


          length     weight    pI  A Total  A Percent  R Total  R Percent  \
pid                                                                         
QHD43415    7096  794057.79  6.32      487        6.9      244        3.4   
QHD43416    1273  141178.47  6.24       79        6.2       42        3.3   
QHD43417     275   31122.94  5.55       13        4.7        6        2.2   
QHD43418      75    8365.04  8.57        4        5.3        3        4.0   
QHD43419     222   25146.62  9.51       19        8.6       14        6.3   

          N Total  N Percent  D Total         ...          chargedTotal  \
pid                                           ...                         
QHD43415      384        5.4      389         ...                  1552   
QHD43416       88        6.9       62         ...                   230   
QHD43417        8        2.9       13         ...                    49   
QHD43418        5        6.7        1     

<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_porter.csv</span>

In [41]:
df_coronavirus_porter = load_data(data_loc+'coronavirus_features/coronavirus_porter.csv')

Number of rows: 9

Number of columns: 7


          helical   beta   coil  veryBuried  veryExposed  someBuried  \
pid                                                                    
QHD43415    0.339  0.219  0.442       0.295        0.009       0.357   
QHD43416    0.245  0.312  0.443       0.436        0.106       0.287   
QHD43417    0.345  0.196  0.458       0.473        0.175       0.218   
QHD43418    0.653  0.000  0.347       0.040        0.787       0.080   
QHD43419    0.383  0.284  0.333       0.279        0.203       0.320   

          someExposed  
pid                    
QHD43415        0.339  
QHD43416        0.171  
QHD43417        0.135  
QHD43418        0.093  
QHD43419        0.198  


<span style="font-weight:bold; font-size:17pt; color:darkgreen;">coronavirus_features/</span><span style="font-weight:bold; font-size:17pt; color:darkblue;">coronavirus_profeat.csv</span>

In [42]:
df_coronavirus_profeat = load_data(data_loc+'coronavirus_features/coronavirus_profeat.csv')

Number of rows: 9

Number of columns: 849


          [G1.1.1.1]  [G1.1.1.2]  [G1.1.1.3]  [G1.1.1.4]  [G1.1.1.5]  \
QHD43415    6.863021    3.184893    5.481962    4.791432    4.918264   
QHD43416    6.205813    3.142184    4.870385    3.770621    6.048704   
QHD43417    4.727273    2.545455    4.727273    4.000000    5.090909   
QHD43418    5.333333    4.000000    1.333333    2.666667    6.666667   
QHD43419    8.558559    1.801802    2.702703    3.153153    4.954955   

          [G1.1.1.6]  [G1.1.1.7]  [G1.1.1.8]  [G1.1.1.9]  [G1.1.1.10]  \
QHD43415    5.806088    2.043405    4.833709    6.116122     9.413754   
QHD43416    6.441477    1.335428    5.970149    4.791830     8.483896   
QHD43417    5.090909    2.909091    7.636364    4.000000    10.909091   
QHD43418    1.333333    0.000000    4.000000    2.666667    18.666667   
QHD43419    6.306306    2.252252    9.009009    3.153153    15.765766   

             ...       [G7.1.1.71]  [G7.1.1.72]  [G7.1.1.73]  [G7.1.1.74]  \
QHD4341

## Join the data

Form the complete feature set by joining the data frames according to _cid_ and _pid_.

See the [data readme in the Gitbug repository](https://github.com/BrianDavisMath/FDA-COVID19/tree/master/data).

### Example Feature Concatenation

In [46]:
df_example_features = load_data(data_loc+'example_feature_concatenation.csv')

Number of rows: 8,813

Number of columns: 1


                    0
0  3.4248720461173585
1   3.612447614836665
2   4.810351184958711
3   5.747807206060877
4  1.9052810017181654
