## Perceptual Descriptors - Structure Odor Relationship
---

This dataset comes from a 2014 study - the most comprehensive publically available raw dataest of perceptual responses from human sujects to a structurally diverse set of molecules [cite] The purpose of this study was to provide data to illuminate the Structure-Odor-Relationship. It has been suggested in the literature that this type of dataset is not well suited to elucidating the underlying mechanics of olfaction [cite], the more appropriate dataset being SAR rather than SOR. 

- Structure **Activity** Relationship (SAR) = data contating **pyhsical responses** to smell (e.g. brain imaging or receptor or glumurlus activity response levels)
- Strucutre **Odor** Relationship (SOR) = data containing **concious, perceptual responses** to smell (e.g. linguistic descriptors or similarity based measures)

Due to extra centres of the brain that are involved in the final perception of smell, including the LIST [cite] the data that measures the SOR involves. SOR data also exhibits a lot of noise, due to genetic variation and learned behaviour/past experience, and subjectivity of ratings as well as the intrinsic and cultural linguistic differences. For these reasons is has been suggested to use SAR data to study the underlying mechanics of the biological system of human olfaction. I had some minor success early in the project lifecylce in finding such data but there were very little experiments pertaining to humans and also the geonomic data available was sparse and some of it was dated (15+ years) and incomplete.

For machine learning to be effective I need a sizeable dataset to get results. Such a dataset for human SAR relationships was not publically available and as on the advice of my superviser, and a member of the School of Science, I opted to use SOR data such as the 2014 experiment as this data was available and sufficient to train machine learning models with a high degree of prediction accuracy [citeDREAM]

Obtaining such high prediction rates was unexpected, given the literature preivously published. The ability for machine learning algorithms and statistical techniques to account for noise, multidimensionality, feature selection & feature transormation leads to some impressive results. Given the unprecedented performance of these models I believe studying how different machine learning algorithms and features we can indeed learn something about the structure or transformations of the olfactory system. Certain teams reproduced previous SOR models that have been derived mathmatically or experimentally[cite].

## DREAM Challenge Notes
---
Team GuanLab (Winner of Individual Prediction) pre-processed these descriptors in the following manner:

- "There were many cases in which subjects indicated that they smelled nothing, so the intensity rating was automatically set to “0” and the ratings for other perceptual attributes were left blank (NaN); therefore, we have removed all the “NaN” entries." 
- "For pleasantness and 19 semantic attributes, we used target values at “high” concentration as a set of examples, and the average value at both “high” and “low” concentrations as another set of examples. As the original number of the sample (407) is relatively small, combining high and low concentrations doubles the sample size, and this step is crucial to achieve high performance."

Team IKW Allstars (Winner of the Population prediction) pre-processed these descriptors in the following manner:

- "For subchallenge 1, I had to impute values to generate matrices with no NaNs for input to the estimator. I mostly used median imputation again, although I also explored imputing 0’s (50’s for ple). I found that median imputation was best for means, and 0 imputation best for sigmas, probably because median imputation artificially deflated the standard deviation in proportion to subject non-respondence. For subchallenge 2 I simply masked all NaNs for each subject and computed means and sigmas across subjects, ignoring the masked values. This made the most sense because the means and sigmas in the ground truth data also ignore NaNs."


### Import Statements

In [16]:
# Math Libraries
import scipy
import numpy as np
import pandas as pd

# Visalisation
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sb
sb.set_style('whitegrid')

# I/O
import json
import xlrd

# Utility
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Paths

In [17]:
path_to_transformed_data = "../../data/transformed/"
path_to_data = "../../data/"

### Reading Data

In [39]:
# Read in perceptual dataset from 2014 Keller & Voshal Study [cite]
keller_data = pd.read_excel('..//machine learning/training/data/KellerData.xlsx',
                           header=1, skiprows=1)


In [40]:
# Strip whitespace from column names
keller_data = keller_data.rename(columns=lambda x: x.strip())

# Rename verbose columns for readability
keller_data = keller_data.rename(index=str, columns={
    "HOW STRONG IS THE SMELL?": "INTENSITY",
    "HOW PLEASANT IS THE SMELL?": "PLEASANTNESS",
    "AMMONIA/URINOUS": "URINOUS",
    "Subject # (DREAM challenge)": "SUBJECT",
    "C.A.S.": "CAS",
    "Race\n (\"unknown\" indicates\n that the subject did not\n wish to specify)": "Race",
    "HOW FAMILIAR IS THE SMELL?": "FAMILIARITY"})

# Drop all observations from subjects who were excluded from DREAM challenge
# due to objectively poor response patterns such as high variability on repeated samples
keller_data = keller_data[keller_data.iloc[:, 6].notna()]
# reduces observations from 55000 to 49000 (55 subjects to 49 subjects)

keller_data.head()
keller_data.shape

Unnamed: 0,CAS,Catalogue #*,CID,Odor,Odor dilution,Subject # (this study),SUBJECT,Gender,Race,Ethnicity,...,ACID,WARM,MUSKY,SWEATY,URINOUS,DECAYED,WOOD,GRASS,FLOWER,CHEMICAL
0,2257-09-2,W401404,16741,2-Phenylethyl isothiocyanate,"1/1,000",1,1.0,M,Black,Non-Hispanic,...,,,,,,,,,,
1,2257-09-2,W401404,16741,2-Phenylethyl isothiocyanate,"1/100,000",1,1.0,M,Black,Non-Hispanic,...,,,,,,,,,,
2,2442-10-6,W358207,17121,1-Octen-3-yl acetate,"1/1,000",1,1.0,M,Black,Non-Hispanic,...,,,,,,63.0,,,,
3,2442-10-6,W358207,17121,1-Octen-3-yl acetate,"1/100,000",1,1.0,M,Black,Non-Hispanic,...,,,,,,,,,,
4,2530-10-1,W352705,520191,"3-Acetyl-2,5-dimethylthiophene","1/100,000",1,1.0,M,Black,Non-Hispanic,...,,,,,,,,,,


(49000, 38)

In [41]:
# ASIDE: I'm still unsure as to whether we should drop this missing data or not 
# The missing rows (observations) seem MAR as they depend upon the subject and the molecule
# Best performance noted by discounting all rows where the subject said they don't smell anything

# Drop all observations where the person said they don't smell anything 
keller_no_response = keller_data[keller_data.iloc[:, 12] == "I can't smell anything"]
keller_data = keller_data[keller_data.iloc[:, 12] != "I can't smell anything"]
# reduces observations from 49000 rows to 36536 rows - leaving 12437 no responses

In [42]:
print("Keller Data Shape:   "+ keller_data.shape.__str__())
print("Keller no responses: " + keller_no_response.shape.__str__())

Keller Data Shape:   (36563, 38)
Keller no responses: (12437, 38)


## Training Test Split
Before we pre-process the data we need to split our data into test and training sets. This is so any imputation or transormations of the will not be affected by the datapoints in the test set. If we were to split our data after pre-processing then the test could have affected the imputed or transformed values, thus biasing our training phase with information about the test set.

In [43]:
# For fair comparrison against the teams in the DREAM challenge we can use the same
# training and testing set split that was specified for the final submission.
CID_test_file = '/Users/admin/workspace/CA684Assignment/data/CID_testset.txt'
test_mol_cid = list(pd.read_csv(CID_test_file, header=None)[0])

test_mol_cid_str = [str(i) for i in test_mol_cid]
DREAM_test_set = keller_data[keller_data.CID.isin(test_mol_cid)== True]
DREAM_training_set = keller_data[keller_data.CID.isin(test_mol_cid)== False]

In [50]:
print("Number of test molecules: " + len(DREAM_test_set.CID.unique()).__str__())
print("Number of training molecules: " + len(DREAM_training_set.CID.unique()).__str__())
print("Number of test observations: " + DREAM_test_set.shape[0].__str__())
print("Number of training observations: " + DREAM_training_set.shape[0].__str__())

Number of test molecules: 69
Number of training molecules: 411
Number of test observations: 5238
Number of training observations: 31325


In [61]:
# An alternative is to use the sklearn training_test_split method
# from sklearn.model_selection import train_test_split

# use DREAM split for now
test_set = DREAM_test_set
training_set = DREAM_training_set
training_set.shape
test_set.shape

(31325, 38)

(5238, 38)

## Imputation
---
Calculate the mean and median response accross subjects for a given molecule.

Perform median, mean, and zero imputation.

### Training Set
___
We perform training and test imputation separately to explicitly avoid data leakage. Imputation of the test set labels does seem off to me. But we cannot have NaN values with our estimator and we have to impute something.

In [52]:
def mean_median_average(series) -> pd.Series:
    result =((series.median() + series.mean()) / 2)
    return result

In [53]:
# All unique molecules in the dataset - 411 (train) 
molecules = training_set.CID.unique()

# Create duplicate data structures for different imputation methods
mean_imputation = pd.DataFrame(training_set.copy())
median_imputation = pd.DataFrame(training_set.copy())
mean_median_average_imputation = pd.DataFrame(training_set.copy())
zero_imputation = pd.DataFrame(training_set.copy())

# For every descriptor column calculate & impute group-based statistics 
# (based on chemical ID) and store to the relevant dataframe
for col in keller_data.iloc[:, 15:38].keys().values:
    mean_imputation[col] = mean_imputation.groupby("CID")[col].transform(lambda x: x.fillna(x.mean(), inplace=False))
    median_imputation[col] = median_imputation.groupby("CID")[col].transform(lambda x: x.fillna(x.median(), inplace=False))
    mean_median_average_imputation[col] = mean_median_average_imputation.groupby("CID")[col].transform(lambda x: x.fillna(mean_median_average(x), inplace=False))
    zero_imputation = zero_imputation.fillna(0)

    

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


In [55]:
## Check for leftover NaN
mean_imputation.iloc[:, 18:38].isna().sum()
median_imputation.iloc[:, 18:38].isna().sum()
mean_median_average_imputation.iloc[:, 18:38].isna().sum()
zero_imputation.iloc[:, 18:38].isna().sum() # no NaNs left

EDIBLE         0
BAKERY       127
SWEET          0
FRUIT         51
FISH        1054
GARLIC       124
SPICES         0
COLD          35
SOUR           0
BURNT          0
ACID           0
WARM           0
MUSKY          0
SWEATY         0
URINOUS        0
DECAYED        0
WOOD           0
GRASS         92
FLOWER         0
CHEMICAL       0
dtype: int64

EDIBLE         0
BAKERY       127
SWEET          0
FRUIT         51
FISH        1054
GARLIC       124
SPICES         0
COLD          35
SOUR           0
BURNT          0
ACID           0
WARM           0
MUSKY          0
SWEATY         0
URINOUS        0
DECAYED        0
WOOD           0
GRASS         92
FLOWER         0
CHEMICAL       0
dtype: int64

EDIBLE         0
BAKERY       127
SWEET          0
FRUIT         51
FISH        1054
GARLIC       124
SPICES         0
COLD          35
SOUR           0
BURNT          0
ACID           0
WARM           0
MUSKY          0
SWEATY         0
URINOUS        0
DECAYED        0
WOOD           0
GRASS         92
FLOWER         0
CHEMICAL       0
dtype: int64

EDIBLE      0
BAKERY      0
SWEET       0
FRUIT       0
FISH        0
GARLIC      0
SPICES      0
COLD        0
SOUR        0
BURNT       0
ACID        0
WARM        0
MUSKY       0
SWEATY      0
URINOUS     0
DECAYED     0
WOOD        0
GRASS       0
FLOWER      0
CHEMICAL    0
dtype: int64

In [69]:
# These remaining NaNs represent cases where noone in the population rated a certain molecule 
# using a particular descriptor. In this case it is safe to impute zero as no subject in
# the study rated this chemical with the descriptor in question

# e.g. 1540 responses / 55 subjects = 28 molecules which were not rated as FISH by any subject
# e.g. 110 responses / 55 subjects = 2 molecules which were not rated as COLD by any subject

# Impute zero for these responses
mean_imputation = mean_imputation.fillna(0)
median_imputation = median_imputation.fillna(0)
mean_median_average_imputation = mean_median_average_imputation.fillna(0)

### Test Set

In [64]:
# All unique molecules in the dataset - 411 (test) 
molecules = test_set.CID.unique()

# Create duplicate data structures for different imputation methods
test_mean_imputation = pd.DataFrame(test_set.copy())
test_median_imputation = pd.DataFrame(test_set.copy())
test_mean_median_average_imputation = pd.DataFrame(test_set.copy())
test_zero_imputation = pd.DataFrame(test_set.copy())

# For every descriptor column calculate & impute group-based statistics 
# (based on chemical ID) and store to the relevant dataframe
for col in keller_data.iloc[:, 15:38].keys().values:
    test_mean_imputation[col] = test_mean_imputation.groupby("CID")[col].transform(lambda x: x.fillna(x.mean(), inplace=False))
    test_median_imputation[col] = test_median_imputation.groupby("CID")[col].transform(lambda x: x.fillna(x.median(), inplace=False))
    test_mean_median_average_imputation[col] = test_mean_median_average_imputation.groupby("CID")[col].transform(lambda x: x.fillna(mean_median_average(x), inplace=False))
    test_zero_imputation = test_zero_imputation.fillna(0)

    

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


In [65]:
## Check for leftover NaN
test_mean_imputation.iloc[:, 18:38].isna().sum()
test_median_imputation.iloc[:, 18:38].isna().sum()
test_mean_median_average_imputation.iloc[:, 18:38].isna().sum()
test_zero_imputation.iloc[:, 18:38].isna().sum() # no NaNs left

EDIBLE        0
BAKERY        0
SWEET         0
FRUIT       162
FISH         97
GARLIC        0
SPICES        0
COLD          0
SOUR          0
BURNT         0
ACID          0
WARM          0
MUSKY         0
SWEATY        0
URINOUS       0
DECAYED       0
WOOD          0
GRASS         0
FLOWER        0
CHEMICAL      0
dtype: int64

EDIBLE        0
BAKERY        0
SWEET         0
FRUIT       162
FISH         97
GARLIC        0
SPICES        0
COLD          0
SOUR          0
BURNT         0
ACID          0
WARM          0
MUSKY         0
SWEATY        0
URINOUS       0
DECAYED       0
WOOD          0
GRASS         0
FLOWER        0
CHEMICAL      0
dtype: int64

EDIBLE        0
BAKERY        0
SWEET         0
FRUIT       162
FISH         97
GARLIC        0
SPICES        0
COLD          0
SOUR          0
BURNT         0
ACID          0
WARM          0
MUSKY         0
SWEATY        0
URINOUS       0
DECAYED       0
WOOD          0
GRASS         0
FLOWER        0
CHEMICAL      0
dtype: int64

EDIBLE      0
BAKERY      0
SWEET       0
FRUIT       0
FISH        0
GARLIC      0
SPICES      0
COLD        0
SOUR        0
BURNT       0
ACID        0
WARM        0
MUSKY       0
SWEATY      0
URINOUS     0
DECAYED     0
WOOD        0
GRASS       0
FLOWER      0
CHEMICAL    0
dtype: int64

In [66]:
# Impute zero for these responses
test_mean_imputation = test_mean_imputation.fillna(0)
test_median_imputation = test_median_imputation.fillna(0)
test_mean_median_average_imputation = test_mean_median_average_imputation.fillna(0)

In [67]:
# Displays first 5 rows of each dataframe containing a different imputation method
print("### Raw Data ###") 
keller_data.iloc[0:5, 15:38]
print("### Mean Impute ###") 
mean_imputation.iloc[0:5, 15:38]
print("### Median Impute ###") 
median_imputation.iloc[0:5, 15:38]
print("### Zero Impute ###") 
zero_imputation.iloc[0:5, 15:38]
print("### Mean/Median Average ###") 
mean_median_average_imputation.iloc[0:5, 15:38]

### Raw Data ###


Unnamed: 0,INTENSITY,PLEASANTNESS,FAMILIARITY,EDIBLE,BAKERY,SWEET,FRUIT,FISH,GARLIC,SPICES,...,ACID,WARM,MUSKY,SWEATY,URINOUS,DECAYED,WOOD,GRASS,FLOWER,CHEMICAL
0,59.0,64.0,66.0,,,16.0,,,,,...,,,,,,,,,,
1,38.0,60.0,44.0,,,57.0,,,,,...,,,,,,,,,,
2,58.0,34.0,16.0,,,,,,,,...,,,,,,63.0,,,,
3,2.0,0.0,0.0,,,,,,,,...,,,,,,,,,,
4,0.0,48.0,0.0,,,,,,,7.0,...,,,,,,,,,,


### Mean Impute ###


Unnamed: 0,INTENSITY,PLEASANTNESS,FAMILIARITY,EDIBLE,BAKERY,SWEET,FRUIT,FISH,GARLIC,SPICES,...,ACID,WARM,MUSKY,SWEATY,URINOUS,DECAYED,WOOD,GRASS,FLOWER,CHEMICAL
0,59.0,64.0,66.0,24.066667,16.666667,16.0,17.0,22.5,24.266667,25.304348,...,27.3,23.2,33.375,26.352941,24.631579,26.0,19.066667,11.5,8.714286,31.321429
1,38.0,60.0,44.0,24.066667,16.666667,57.0,17.0,22.5,24.266667,25.304348,...,27.3,23.2,33.375,26.352941,24.631579,26.0,19.066667,11.5,8.714286,31.321429
2,58.0,34.0,16.0,24.363636,4.0,35.588235,23.8,14.0,22.875,24.2,...,27.777778,19.333333,23.238095,27.5,16.285714,63.0,14.777778,30.25,15.625,21.6
3,2.0,0.0,0.0,24.363636,4.0,35.588235,23.8,14.0,22.875,24.2,...,27.777778,19.333333,23.238095,27.5,16.285714,19.3,14.777778,30.25,15.625,21.6
4,0.0,48.0,0.0,32.777778,32.666667,24.956522,32.5,40.777778,29.5,7.0,...,24.714286,20.461538,34.37931,21.466667,30.210526,26.619048,19.333333,27.0,27.764706,35.806452


### Median Impute ###


Unnamed: 0,INTENSITY,PLEASANTNESS,FAMILIARITY,EDIBLE,BAKERY,SWEET,FRUIT,FISH,GARLIC,SPICES,...,ACID,WARM,MUSKY,SWEATY,URINOUS,DECAYED,WOOD,GRASS,FLOWER,CHEMICAL
0,59.0,64.0,66.0,14.0,11.0,16.0,1.0,14.0,12.0,17.0,...,16.5,9.0,25.0,28.0,22.0,23.0,13.0,7.0,4.0,33.5
1,38.0,60.0,44.0,14.0,11.0,57.0,1.0,14.0,12.0,17.0,...,16.5,9.0,25.0,28.0,22.0,23.0,13.0,7.0,4.0,33.5
2,58.0,34.0,16.0,25.0,4.0,34.0,18.0,17.0,26.5,20.0,...,22.0,11.5,22.0,15.5,11.0,63.0,8.0,25.0,3.0,17.0
3,2.0,0.0,0.0,25.0,4.0,34.0,18.0,17.0,26.5,20.0,...,22.0,11.5,22.0,15.5,11.0,10.0,8.0,25.0,3.0,17.0
4,0.0,48.0,0.0,28.0,23.0,21.0,27.0,36.0,24.5,7.0,...,16.0,14.0,26.0,19.0,21.0,25.0,5.0,14.5,24.0,35.0


### Zero Impute ###


Unnamed: 0,INTENSITY,PLEASANTNESS,FAMILIARITY,EDIBLE,BAKERY,SWEET,FRUIT,FISH,GARLIC,SPICES,...,ACID,WARM,MUSKY,SWEATY,URINOUS,DECAYED,WOOD,GRASS,FLOWER,CHEMICAL
0,59.0,64.0,66.0,0.0,0.0,16.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,38.0,60.0,44.0,0.0,0.0,57.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,58.0,34.0,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,63.0,0.0,0.0,0.0,0.0
3,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,48.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Mean/Median Average ###


Unnamed: 0,INTENSITY,PLEASANTNESS,FAMILIARITY,EDIBLE,BAKERY,SWEET,FRUIT,FISH,GARLIC,SPICES,...,ACID,WARM,MUSKY,SWEATY,URINOUS,DECAYED,WOOD,GRASS,FLOWER,CHEMICAL
0,59.0,64.0,66.0,19.033333,13.833333,16.0,9.0,18.25,18.133333,21.152174,...,21.9,16.1,29.1875,27.176471,23.315789,24.5,16.033333,9.25,6.357143,32.410714
1,38.0,60.0,44.0,19.033333,13.833333,57.0,9.0,18.25,18.133333,21.152174,...,21.9,16.1,29.1875,27.176471,23.315789,24.5,16.033333,9.25,6.357143,32.410714
2,58.0,34.0,16.0,24.681818,4.0,34.794118,20.9,15.5,24.6875,22.1,...,24.888889,15.416667,22.619048,21.5,13.642857,63.0,11.388889,27.625,9.3125,19.3
3,2.0,0.0,0.0,24.681818,4.0,34.794118,20.9,15.5,24.6875,22.1,...,24.888889,15.416667,22.619048,21.5,13.642857,14.65,11.388889,27.625,9.3125,19.3
4,0.0,48.0,0.0,30.388889,27.833333,22.978261,29.75,38.388889,27.0,7.0,...,20.357143,17.230769,30.189655,20.233333,25.605263,25.809524,12.166667,20.75,25.882353,35.403226


## Store to disk

### Training Sets

In [70]:
mean_imputation.to_pickle(path_to_transformed_data + "mean_imputation.zip")
median_imputation.to_pickle(path_to_transformed_data + "median_imputation.zip")
zero_imputation.to_pickle(path_to_transformed_data + "zero_imputation.zip")
mean_median_average_imputation.to_pickle(path_to_transformed_data + "mean_median_average_imputation.zip")

### Test Sets

In [68]:
test_mean_imputation.to_pickle(path_to_transformed_data + "test_mean_imputation.zip")
test_median_imputation.to_pickle(path_to_transformed_data + "test_median_imputation.zip")
test_zero_imputation.to_pickle(path_to_transformed_data + "test_zero_imputation.zip")
test_mean_median_average_imputation.to_pickle(path_to_transformed_data + "test_mean_median_average_imputation.zip")

In [63]:
test_median_imputation.shape

(31325, 38)