<h3>Preprocessing and PCA data reduction of input data</h3>

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
import sklearn.decomposition as pca
import matplotlib.pyplot as plt

<h4>Load in datasets and perform pandas cleaning up before scaling</h4>

In [2]:
pretrain_data = pd.read_csv('./pretrain_features.csv.zip')
pretrain_start = pd.DataFrame(pretrain_data,columns=['Id','smiles'])
pretrain_trimmed = pretrain_data.drop(['Id','smiles'], axis=1)

train_data = pd.read_csv('./train_features.csv.zip')
train_start = pd.DataFrame(train_data,columns=['Id','smiles'])
train_trimmed = train_data.drop(['Id','smiles'], axis=1)

test_data = pd.read_csv('./test_features.csv.zip')
test_start = pd.DataFrame(test_data,columns=['Id','smiles'])
test_trimmed = test_data.drop(['Id','smiles'], axis=1)

<h4>Preprocess data to achieve mean = 0 and var = 1 NOTE scaler trained on pretrain features but applied to all three</h4>

In [3]:
scaler = preprocessing.StandardScaler().fit(pretrain_trimmed)
pretrain_data_processed = scaler.transform(pretrain_trimmed)
train_data_processed = scaler.transform(train_trimmed)
test_data_processed = scaler.transform(test_trimmed)
# print(f"mean of scaler: {data_processed.mean(axis=0)}")
# print(f"size of scaler: {data_processed.std(axis=0)}")

<h4>Apply dimensionality reduction to break the 1000 features down into principle components NOTE not sure if same pca is good procedure but I think the more similar the NN input the easier to apply transfer learning</h4>

In [5]:
pca_handler = pca.PCA(n_components=10)  # TODO see with how many features per molecule we perform best 
pca_handler.fit(pretrain_data_processed)
pca_pretrain_data = pca_handler.transform(pretrain_data_processed)
pca_train_data = pca_handler.transform(train_data_processed)
pca_test_data = pca_handler.transform(test_data_processed)

<h4>Save the reduced datasets</h4>

In [8]:
feature_cols = ['feature_' + str(i+1) for i in range(pca_train_data.shape[1])] #create label names for the new pca labels
#create datasets
processed_pretrain_dataset = pd.DataFrame(pca_pretrain_data,columns=feature_cols)
# print(f"pretrain dataset: {processed_pretrain_dataset}")
processed_pretrain_dataset = pd.concat([pretrain_start,processed_pretrain_dataset],axis=1)
# print(f"size of pretrain start: {pretrain_start.shape}")

processed_train_dataset = pd.DataFrame(pca_train_data,columns=feature_cols)
# print(f"pretrain dataset: {processed_train_dataset}")
processed_train_dataset = pd.concat([train_start,processed_train_dataset],axis=1)
# print(f"size of train start: {train_start.shape}")

processed_test_dataset = pd.DataFrame(pca_test_data,columns=feature_cols)
# print(f"pretrain dataset: {processed_test_dataset}")
processed_test_dataset = pd.concat([test_start,processed_test_dataset],axis=1)
# print(f"size of test start: {test_start.shape}")

print(f"pretrain dataset concatenated: {processed_pretrain_dataset}")
print(f"train dataset concatenated: {processed_train_dataset}")
print(f"test dataset concatenated: {processed_test_dataset}")

processed_pretrain_dataset.to_csv('./processed_pretrain_dataset.csv',index=False)
processed_train_dataset.to_csv('./processed_train_dataset.csv',index=False)
processed_test_dataset.to_csv('./processed_test_dataset.csv',index=False)

pretrain dataset concatenated:           Id                                             smiles  feature_1  \
0          0  c1occ2c1c1ccc3cscc3c1c1ncc3cc(ccc3c21)-c1cccc2...  -4.626551   
1          1  C1C=c2c(cc3ncc4c5[SiH2]C=Cc5oc4c3c2=C1)-c1scc2...   0.954428   
2          2  C1C=c2c3cccnc3c3c4c[nH]cc4c4cc(cnc4c3c2=C1)-c1...  -4.419375   
3          3  [SiH2]1C=Cc2c1csc2-c1cnc2c(c1)c1ccccc1c1cc3ccc...   1.282784   
4          4        c1occ2c1c(cc1[se]c3ccncc3c21)-c1cccc2nsnc12  -3.170778   
...      ...                                                ...        ...   
49995  49995      C1cc2c3-c4[nH]ccc4-ncc3c3c4c[SiH2]cc4ccc3c2c1  -4.541318   
49996  49996  [SiH2]1C=c2c3C=C([SiH2]c3c3c4cscc4c4C=C[SiH2]c...   0.619174   
49997  49997              C1C=Cc2csc(C3=Cc4ccc5cc[se]c5c4C3)c12   5.223223   
49998  49998  [SiH2]1C=c2c3cc([nH]c3c3c4ccccc4c4ccc5nsnc5c4c...  -3.073758   
49999  49999  C1C=c2c3ccccc3c3c(c4ccc(cc4c4=C[SiH2]C=c34)-c3...  -3.611722   

       feature_2  feature_3  fea

<h4>NOTE If we want we can extract more features from the chemistry database to enhance input data</h4>