# Molecular Descriptors
---

From DREAM paper:

    Of the 18 teams who submitted models to predict individual perception, Team GuanLab (author Y.G.) was the best performer with a Z-score of 34.18 (Fig. 1H and table S1). Team IKW Allstars (author R.C.G.) was the best performer of 19 teams to submit models to predict population perception, with a Z-score of 8.87 (Fig. 1H and table S1). The aggregation of all participant models gave Z-scores of 34.02 (individual) and 9.17 (population) (Fig. 1H), and a postchallenge community phase where initial models and additional molecular features were shared across teams gave even better models with Z-scores of 36.45 (individual) and 9.92 (population) (Fig. 1H).
    
Team GuanLab (Winner of Individual Prediction) pre-processed these descriptors in the following manner:

- Eliminate molecular descriptors with 'non responses', negative values, and identical values for all compounds. This reduces the number of descriptors from about 5,000 to about 900
- Normalize molecular values by the variance of the attribute and take the square root
- In the paper [cite] published later (2018) the team reported using min-max normalisation (x′=x−min(x)/max(x)−min(x)). 

Team IKW Allstars (Winner of the Population prediction) pre-processed these descriptors in the following manner:
 
- Load molecular descriptor file into a bunch of matrices. Each entry in a matrix is one observation, or the mean of observations in the case of replicates.
- Discard any columns (descriptors) that contained too many NaN entries.
- For remaining columns, perform median imputation to convert NaNs to real values.
- Cube root transform data, then normalize each column to mean 0, variance 1.


## Imports

In [3]:
# Math Libraries
import scipy
import numpy as np
import pandas as pd

# Visalisation
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
# sb.set_style('whitegrid')

# I/O
import json
import xlrd

# Utility
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Paths

In [50]:
path_to_transformed_data = "../../data/transformed/"
path_to_data = "../../data/"
molecular_file = '../../data/molecular_descriptors_data.txt'
# top20features_file = '/Users/admin/workspace/CA684Assignment/data/top20features.xlsx'

## Load data

In [52]:
# Load the molecular descriptors 
molecular_descriptors = pd.read_csv(molecular_file, sep='\t', header = 0)

In [8]:
# Read in file containing respective top 20 features for each target perception
#top20features = pd.read_excel(top20features_file)

## Missing Data

In [54]:
# This output shows us that we have 4870 columns of which 2763 do not contain NaN
# Leaving 2107 columns with NaN values
print("Raw Molecular Descriptor Columns NaN count: ")
molecular_descriptors.isna().any().describe()
print("Raw Shape: " + molecular_descriptors.shape.__str__())

Raw Molecular Descriptor Columns NaN count: 


count      4870
unique        2
top       False
freq       2763
dtype: object

Raw Shape: (476, 4870)


In [55]:
# Store and drop the unique molecule ID to preserve IDs through normalisation
cids = molecular_descriptors['CID']
molecular_descriptors = molecular_descriptors.drop('CID', axis=1) 

# Drop columns containing NaN values as per both winning teams
dropna_molecular_descriptors = molecular_descriptors.dropna(axis=1)

print("\nMolecular Descriptor w/ NaN Columns dropped: ")
dropna_molecular_descriptors.isna().any().describe()
print("Dropped NaN Shape: " + dropna_molecular_descriptors.shape.__str__())


Molecular Descriptor w/ NaN Columns dropped: 


count      2762
unique        1
top       False
freq       2762
dtype: object

Dropped NaN Shape: (476, 2762)


In [56]:
# Apply min max normalisation to place all values between 1 and 0.
min_max_dropna = dropna_molecular_descriptors.apply(lambda x: (x-x.min())/(x.max()-x.min()), axis=0)

In [57]:
# We have created NaN values through this calculation. Now we have 1754 columns containing NaNs
print("Before:")
min_max_dropna.isna().any().describe()

# Drop columsn in which we introduced NaN values
min_max_dropna =  min_max_dropna.dropna(axis=1)

# Leavus us with 1008 columns
print("After:")
min_max_dropna.isna().any().describe()

Before:


count     2762
unique       2
top       True
freq      1754
dtype: object

After:


count      1008
unique        1
top       False
freq       1008
dtype: object

In [58]:
# Add back in CID after normalisation and removal of missing data
min_max_dropna.insert(0, 'CID', cids)

In [59]:
min_max_dropna.to_pickle(path_to_transformed_data + "MOL_min_max_dropna.zip")