# Kullback-Leibler divergence

<span style="color:red; font-weight:bold;">UPDATE: GMM and K-L Divergence did not work. GMM only returned binary values from predict_proba.</span>


Use K-L divergence to determine the effect of target variable (activity) values (0 or 1) on a Gaussian mixture model over the data.

Apply this only to the continuous features, separately for cid and pid.

References:

* [KL Divergence Python Example](https://towardsdatascience.com/kl-divergence-python-example-b87069e4b810)—a nice explanation and example
* [Stackexchange: Calculating KL Divergence in Python
](https://datascience.stackexchange.com/questions/9262/calculating-kl-divergence-in-python)—some good notes on which Python libraries to use
* [Mutual Information](https://en.wikipedia.org/wiki/Mutual_information)—seems relevant here
* [Sensitivity Analysis in Gaussian Bayesian Networks Using a Divergence Measure](https://www.researchgate.net/publication/233216409_Sensitivity_Analysis_in_Gaussian_Bayesian_Networks_Using_a_Divergence_Measure)
* [Clustering with Gaussian Mixture Models](https://pythonmachinelearning.pro/clustering-with-gaussian-mixture-models/)
* [scikit-learn Gaussian mixture models](https://scikit-learn.org/stable/modules/mixture.html)
* [Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models](https://www.researchgate.net/publication/4249249_Approximating_the_Kullback_Leibler_Divergence_Between_Gaussian_Mixture_Models)
* [Stackoverlflow: KL-Divergence of two GMMs
](https://stackoverflow.com/questions/26079881/kl-divergence-of-two-gmms)
* [Stackexchange: Trying to implement the Jensen-Shannon Divergence for Multivariate Gaussians](https://stats.stackexchange.com/questions/345915/trying-to-implement-the-jensen-shannon-divergence-for-multivariate-gaussians)—related and refering to the above Stackoverflow Q&A
* [Stackoverlflow: predict_proba is not working for my gaussian mixture model (sklearn, python)
](https://stackoverflow.com/questions/56993070/predict-proba-is-not-working-for-my-gaussian-mixture-model-sklearn-python)

In [1]:
from sklearn.mixture import GaussianMixture
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
import numpy as np

### Load the features files for CIDs and PIDs

In [2]:
data_loc = '../data/FDA-COVID19_files_v1.0/'

In [3]:
# Get the individual feature sets as data frames
  def __load_feature_files():
    print('===============================================')
    print('dragon_features.csv')
    print('===============================================')
    # note need to set the data_type to object because it complains, otherwise that the types vary.
    df_dragon_features = __load_data(data_loc+'drug_features/dragon_features.csv', data_type=object)
    
    # rename the dragon features since there are duplicate column names in the protein binding-sites data.
    df_dragon_features.columns = ['cid_'+col for col in df_dragon_features.columns]
    
    # handle na values in dragon_features
    # Many cells contain "na" values. Find the columns that contain 2% or 
    # less of these values and retain them, throwing away the rest. 
    # Then mean-impute the "na" values in the remaining columns.
    pct_threshold = 2
    na_threshold = int(len(df_dragon_features)*pct_threshold/100)
    ok_cols = []

    for col in df_dragon_features:
        na_count = df_dragon_features[col].value_counts().get('na')
        if (na_count or 0) <= na_threshold:
            ok_cols.append(col)

    print('number of columns where the frequency of "na" values is <= {}%: {}.'.format(pct_threshold, len(ok_cols)))
    
    df_dragon_features = df_dragon_features[ok_cols]

    # convert all values except "na"s to numbers and set "na" values to NaNs.
    df_dragon_features = df_dragon_features.apply(pd.to_numeric, errors='coerce')

    columns_missing_values = df_dragon_features.columns[df_dragon_features.isnull().any()].tolist()
    print('{} columns with missing values.'.format(len(columns_missing_values)))

    # replace NaNs with column means
    df_dragon_features.fillna(df_dragon_features.mean(), inplace=True)

    columns_missing_values = df_dragon_features.columns[df_dragon_features.isnull().any()].tolist()
    print('{} columns with missing values (after imputing): {}'.format(len(columns_missing_values), 
                                                                       columns_missing_values))    
    print('===============================================')
    print('binding_site_features_v2.csv')
    print('===============================================')
    df_binding_sites = __load_data(data_loc+'protein_features/binding_site_features_v2.csv')
    
    # Name the index to 'pid' to allow joining to other feaure files later.
    df_binding_sites.index.name = 'pid'
    
    print('===============================================')
    print('expasy.csv')
    print('===============================================')
    df_expasy = __load_data(data_loc+'protein_features/expasy.csv')
    
    print('===============================================')
    print('profeat.csv')
    print('===============================================')
    df_profeat = __load_data(data_loc+'protein_features/profeat.csv')
    
    # Name the index to 'pid' to allow joining to other feaure files later.
    df_profeat.index.name = 'pid'
    
    # profeat has some missing values.
    s = df_profeat.isnull().sum(axis = 0)

    print('number of missing values for each column containing them is: {}'.format(len(s[s > 0])))

    # Drop the rows that have missing values.
    df_profeat.dropna(inplace=True)
    print('number of rows remaining, without NaNs: {:,}'.format(len(df_profeat)))
    
    return {'df_dragon_features': df_dragon_features,
           'df_binding_sites': df_binding_sites,
           'df_expasy': df_expasy,
           'df_profeat': df_profeat}

In [4]:
# load a specific features CSV file
  def __load_data(path, data_type=None):
    if data_type:
        df = pd.read_csv(path, index_col=0, dtype=data_type)
    else:
        df = pd.read_csv(path, index_col=0)
    print('Number of rows: {:,}'.format(len(df)))
    print('Number of columns: {:,}'.format(len(df.columns)))
    
    columns_missing_values = df.columns[df.isnull().any()].tolist()
    print('{} columns with missing values'.format(len(columns_missing_values)))
    
    print(df.head(2))
    
    return df

In [5]:
feature_sets = __load_feature_files()

df_dragon_features = feature_sets['df_dragon_features']
df_binding_sites = feature_sets['df_binding_sites']
df_expasy = feature_sets['df_expasy']
df_profeat = feature_sets['df_profeat']

dragon_features.csv
Number of rows: 88,105
Number of columns: 3,839
0 columns with missing values
              MW                AMW      Sv                  Se  \
cid                                                               
72792562  474.67  6.781000000000001  41.039              70.101   
44394609  546.48              8.674  43.185  63.538000000000004   

                          Sp                 Si     Mv                  Me  \
cid                                                                          
72792562   43.54600000000001  80.52199999999999  0.586               1.001   
44394609  45.233000000000004             69.993  0.685  1.0090000000000001   

             Mp     Mi  ... Psychotic-80 Psychotic-50 Hypertens-80  \
cid                     ...                                          
72792562  0.622   1.15  ...            0            0            0   
44394609  0.718  1.111  ...            0            0            0   

         Hypertens-50 Hypnotic-80 Hypno

### Validation set: Merge the features for CIDs and PIDs with the interactions.

This yields a set of CID features and a set of PID features.

In [8]:
validation_interactions = '../data/v5/validation_interactions_v5.csv'
df_interactions = __load_data(validation_interactions)

df_pid_vld = pd.merge(df_interactions, df_expasy, on='pid', how='inner')
df_pid_vld = pd.merge(df_pid_vld, df_profeat, on='pid', how='inner')

df_cid_vld = pd.merge(df_interactions, df_dragon_features, on='cid', how='inner')

Number of rows: 7,972
Number of columns: 5
0 columns with missing values
    cid       pid  activity  cid_binary_weights  pid_binary_weights
0   938  AAB59829         1            0.952894            0.996703
1  1986  AAB59829         1            0.975912            0.996703


### Training set: Merge the features for CIDs and PIDs with the interactions.

This yields a set of CID features and a set of PID features.

In [6]:
optional_training_interactions_csv = '../data/v5/optional_training_interactions_v5.csv'
required_training_interactions_csv = '../data/v5/required_training_interactions_v5.csv'

optional_training_interactions = pd.read_csv(optional_training_interactions_csv, index_col=0)
required_training_interactions = pd.read_csv(required_training_interactions_csv, index_col=0)

drop_cols = ['cid_binary_weights', 'pid_binary_weights', 'activity_score']

opt_unique = optional_training_interactions.drop_duplicates(["cid", "activity"]).drop(drop_cols, axis=1)
req_unique = required_training_interactions.drop_duplicates(["cid", "activity"])

df_training_unique_interactions = pd.concat([opt_unique, req_unique])

df_pid_trn = pd.merge(df_training_unique_interactions, df_expasy, on='pid', how='inner')
df_pid_trn = pd.merge(df_pid_trn, df_profeat, on='pid', how='inner')

df_cid_trn = pd.merge(df_training_unique_interactions, df_dragon_features, on='cid', how='inner')

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  if sys.path[0] == '':


### Check for types and missing values

In [9]:
df_pid_vld.activity = df_pid_vld.activity.astype(float)
df_cid_vld.activity = df_cid_vld.activity.astype(float)
df_pid_trn.activity = df_pid_trn.activity.astype(float)
df_cid_trn.activity = df_cid_trn.activity.astype(float)
print('pid validation set: {}'.format(df_pid_vld.shape))
print('cid validation set: {}'.format(df_cid_vld.shape))
print('pid training set: {}'.format(df_pid_trn.shape))
print('cid training set: {}'.format(df_cid_trn.shape))

pid validation set: (7763, 861)
cid validation set: (7969, 3645)
pid training set: (94805, 861)
cid training set: (95178, 3645)


In [10]:
# Any missing values?
print(df_pid_vld.isnull().values.any())
print(df_cid_vld.isnull().values.any())
print(df_pid_trn.isnull().values.any())
print(df_cid_trn.isnull().values.any())

False
False
True
True


In [11]:
df_pid_vld.dtypes.value_counts()

float64    859
int64        1
object       1
dtype: int64

In [12]:
df_cid_vld.dtypes.value_counts()

float64    3597
int64        47
object        1
dtype: int64

In [13]:
df_cid_trn.dtypes.value_counts()

float64    3597
int64        47
object        1
dtype: int64

In [14]:
df_pid_trn.dtypes.value_counts()

float64    859
int64        1
object       1
dtype: int64

In [15]:
df_cid_vld.columns.to_series().groupby(df_cid_vld.dtypes).groups

{dtype('int64'): Index(['cid', 'cid_nAT', 'cid_nSK', 'cid_nTA', 'cid_nBT', 'cid_nBO', 'cid_nBM',
        'cid_RBN', 'cid_nDB', 'cid_nTB', 'cid_nAB', 'cid_nH', 'cid_nC',
        'cid_nN', 'cid_nO', 'cid_nP', 'cid_nS', 'cid_nF', 'cid_nCL', 'cid_nBR',
        'cid_nI', 'cid_nB', 'cid_nHM', 'cid_nHet', 'cid_nX', 'cid_nCsp3',
        'cid_nCsp2', 'cid_nCsp', 'cid_nStructures', 'cid_totalcharge',
        'cid_nCIC', 'cid_nCIR', 'cid_TRS', 'cid_Rperim', 'cid_Rbrid', 'cid_NRS',
        'cid_nR03', 'cid_nR04', 'cid_nR05', 'cid_nR06', 'cid_nR07', 'cid_nR08',
        'cid_nR09', 'cid_nR10', 'cid_nR11', 'cid_nR12', 'cid_nBnz'],
       dtype='object'),
 dtype('float64'): Index(['activity', 'cid_binary_weights', 'pid_binary_weights', 'cid_MW',
        'cid_AMW', 'cid_Sv', 'cid_Se', 'cid_Sp', 'cid_Si', 'cid_Mv',
        ...
        'cid_Hy', 'cid_TPSA(NO)', 'cid_TPSA(Tot)', 'cid_SAtot', 'cid_SAacc',
        'cid_SAdon', 'cid_Vx', 'cid_VvdwMG', 'cid_VvdwZAZ', 'cid_PDI'],
       dtype='object', length=

In [16]:
df_pid_vld.head()

Unnamed: 0,cid,pid,activity,cid_binary_weights,pid_binary_weights,helical,beta,coil,veryBuried,veryExposed,...,[G7.1.1.71],[G7.1.1.72],[G7.1.1.73],[G7.1.1.74],[G7.1.1.75],[G7.1.1.76],[G7.1.1.77],[G7.1.1.78],[G7.1.1.79],[G7.1.1.80]
0,938,AAB59829,1.0,0.952894,0.996703,0.349,0.219,0.432,0.326,0.187,...,0.003074,0.000574,-0.001384,-0.001869,-0.000164,-0.001509,0.000604,-0.000104,0.001366,0.002322
1,1986,AAB59829,1.0,0.975912,0.996703,0.349,0.219,0.432,0.326,0.187,...,0.003074,0.000574,-0.001384,-0.001869,-0.000164,-0.001509,0.000604,-0.000104,0.001366,0.002322
2,37542,AAB59829,0.0,0.963498,0.996703,0.349,0.219,0.432,0.326,0.187,...,0.003074,0.000574,-0.001384,-0.001869,-0.000164,-0.001509,0.000604,-0.000104,0.001366,0.002322
3,445580,AAB59829,0.0,0.97333,0.996703,0.349,0.219,0.432,0.326,0.187,...,0.003074,0.000574,-0.001384,-0.001869,-0.000164,-0.001509,0.000604,-0.000104,0.001366,0.002322
4,4100,AAB59829,0.0,0.923492,0.996703,0.349,0.219,0.432,0.326,0.187,...,0.003074,0.000574,-0.001384,-0.001869,-0.000164,-0.001509,0.000604,-0.000104,0.001366,0.002322


In [17]:
df_pid_trn.head()

Unnamed: 0,activity,cid,cid_binary_weights,pid,pid_binary_weights,helical,beta,coil,veryBuried,veryExposed,...,[G7.1.1.71],[G7.1.1.72],[G7.1.1.73],[G7.1.1.74],[G7.1.1.75],[G7.1.1.76],[G7.1.1.77],[G7.1.1.78],[G7.1.1.79],[G7.1.1.80]
0,0.0,38258,,CAA96025,,0.595,0.0,0.405,0.513,0.196,...,0.000589,0.003648,-0.000553,0.001793,0.002062,0.005031,0.002675,0.00584,0.001753,0.005002
1,0.0,5281718,,CAA96025,,0.595,0.0,0.405,0.513,0.196,...,0.000589,0.003648,-0.000553,0.001793,0.002062,0.005031,0.002675,0.00584,0.001753,0.005002
2,0.0,443936,,CAA96025,,0.595,0.0,0.405,0.513,0.196,...,0.000589,0.003648,-0.000553,0.001793,0.002062,0.005031,0.002675,0.00584,0.001753,0.005002
3,0.0,28417,,CAA96025,,0.595,0.0,0.405,0.513,0.196,...,0.000589,0.003648,-0.000553,0.001793,0.002062,0.005031,0.002675,0.00584,0.001753,0.005002
4,0.0,2153,,CAA96025,,0.595,0.0,0.405,0.513,0.196,...,0.000589,0.003648,-0.000553,0.001793,0.002062,0.005031,0.002675,0.00584,0.001753,0.005002


In [18]:
df_cid_trn.head()

Unnamed: 0,activity,cid,cid_binary_weights,pid,pid_binary_weights,cid_MW,cid_AMW,cid_Sv,cid_Se,cid_Sp,...,cid_Hy,cid_TPSA(NO),cid_TPSA(Tot),cid_SAtot,cid_SAacc,cid_SAdon,cid_Vx,cid_VvdwMG,cid_VvdwZAZ,cid_PDI
0,0.0,38258,,CAA96025,,388.35,9.031,26.997,45.967,25.727,...,5.553,221.54,221.54,472.908,350.549,193.431,407.193,170.259,316.62,0.861
1,0.0,38258,0.940816,2VM6_A,0.92974,388.35,9.031,26.997,45.967,25.727,...,5.553,221.54,221.54,472.908,350.549,193.431,407.193,170.259,316.62,0.861
2,0.0,5281718,,CAA96025,,390.42,7.808,31.513,51.338,32.011,...,3.356,139.84,139.84,563.698,278.1,256.1,460.033,191.915,347.52,0.816
3,1.0,5281718,,P08659,,390.42,7.808,31.513,51.338,32.011,...,3.356,139.84,139.84,563.698,278.1,256.1,460.033,191.915,347.52,0.816
4,0.0,5281718,0.996001,2VM6_A,0.92974,390.42,7.808,31.513,51.338,32.011,...,3.356,139.84,139.84,563.698,278.1,256.1,460.033,191.915,347.52,0.816


### Apply LDA

Gaussian Mixture Models are not working well with our data so we'll try LDA.

In [None]:
# For reference: Brian's code for doing a similiar thing on the binary features using LDA:

'''
# validation weighting
# train LDA for validation weights
validation_interactions = pd.read_csv(validation_interactions_csv, index_col=0)
validation_interactions.drop("latent_prob_delta_ratio", axis=1, inplace=True)
required_training_interactions = pd.read_csv(required_training_interactions_csv, index_col=0)
optional_training_interactions = pd.read_csv(optional_training_interactions_csv, index_col=0)

opt_unique = optional_training_interactions.drop_duplicates(["cid", "activity"]).drop("activity_score", axis=1)
req_unique = required_training_interactions.drop_duplicates(["cid", "activity"])

training_unique = pd.concat([opt_unique, req_unique]).drop_duplicates(["cid", "activity"])
num_topics = 100
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42, learning_method="online", n_jobs=-1)

cid_fingerprints_file = "dataset_raw_files/fingerprints.csv"
cid_fingerprints = pd.read_csv(cid_fingerprints_file)
cid_interactions = cid_fingerprints.merge(training_unique.drop("pid", axis=1), on="cid").drop("cid", axis=1)

lda.fit(cid_interactions)
del cid_interactions
cid_fingerprints["pos"] = 1
cid_fingerprints["neg"] = 0
cid_fingerprints.set_index("cid", inplace=True)

latent_prob_pos = np.max(lda.transform(cid_fingerprints.drop("neg", axis=1)), axis=1)
cid_fingerprints["latent_prob_neg"] = np.max(lda.transform(cid_fingerprints.drop("pos", axis=1)), axis=1)
cid_fingerprints.reset_index(inplace=True)
cid_fingerprints["latent_prob_pos"] = latent_prob_pos
cid_fingerprints.drop(["pos", "neg"], axis=1, inplace=True)
cid_fingerprints["latent_prob_delta"] = np.abs(cid_fingerprints.latent_prob_pos - cid_fingerprints.latent_prob_neg)
cid_fingerprints["latent_prob_delta_ratio"] = 1 - 2 * cid_fingerprints.latent_prob_delta / (cid_fingerprints.latent_prob_pos + cid_fingerprints.latent_prob_neg)
'''


In [24]:
# Drop non-feature cols
pids_vld = df_pid_vld.drop(['cid','pid','cid_binary_weights','pid_binary_weights'], axis=1)
cids_vld = df_cid_vld.drop(['cid','pid','cid_binary_weights','pid_binary_weights'], axis=1)

pids_trn = df_pid_trn.drop(['cid','pid','cid_binary_weights','pid_binary_weights'], axis=1)
cids_trn = df_cid_trn.drop(['cid','pid','cid_binary_weights','pid_binary_weights'], axis=1)

In [25]:
print('pid validation set: {}'.format(pids_vld.shape))
print('cid validation set: {}'.format(cids_vld.shape))
print('pid training set: {}'.format(pids_trn.shape))
print('cid training set: {}'.format(cids_trn.shape))

pid validation set: (7763, 857)
cid validation set: (7969, 3641)
pid training set: (94805, 857)
cid training set: (95178, 3641)


In [26]:
# Looks for negative values
print('There are {} columns with negative values in the cid validation set.'
      .format(len(cids_vld.columns[(cids_vld < 0).any()])))

There are 268 columns with negative values in the cid validation set.


In [27]:
# Normalize
pid_scaler_trn = MinMaxScaler()
pids_trn[pids_trn.columns] = pid_scaler_trn.fit_transform(pids_trn)
cid_scaler_trn = MinMaxScaler()
cids_trn[cids_trn.columns] = cid_scaler_trn.fit_transform(cids_trn)

In [28]:
'''
Scale the validation data using the MinMaxScaler fitted on the training data.
'''
pids_vld[pids_vld.columns] = pid_scaler_trn.transform(pids_vld)
cids_vld[cids_vld.columns] = cid_scaler_trn.transform(cids_vld)

In [29]:
# Looks for negative values in the scaled validation data
print('There are {} columns with negative values in the pid validation set.'
      .format(len(pids_vld.columns[(pids_vld < 0).any()])))

There are 168 columns with negative values in the pid validation set.


In [30]:
'''
Some of the pid colums contain -ve values. Find the minimum value and add 
it to each value of the affected column in both the validation 
and training sets.
'''
neg_cols = pids_vld.columns[(pids_vld < 0).any()]
min_vals = pids_vld[neg_cols].min().values

for idx, col in enumerate(neg_cols):
    pids_vld[col] -=  min_vals[idx]
    pids_trn[col] -=  min_vals[idx]

In [31]:
pids_vld[neg_cols].head()

Unnamed: 0,coil,veryBuried,[G1.1.1.10],[G1.1.1.18],[G4.1.2.2],[G4.1.3.3],[G4.1.5.3],[G4.1.9.3],[G4.2.1.2],[G4.2.2.3],...,[G7.1.1.26],[G7.1.1.33],[G7.1.1.34],[G7.1.1.41],[G7.1.1.47],[G7.1.1.48],[G7.1.1.50],[G7.1.1.63],[G7.1.1.64],[G7.1.1.80]
0,0.452278,0.46505,0.47382,0.428714,0.648706,0.512861,0.435462,0.767511,0.679673,0.690049,...,0.267085,0.44627,0.436036,0.397102,0.611445,0.861962,0.66876,0.483128,0.538031,0.716353
1,0.452278,0.46505,0.47382,0.428714,0.648706,0.512861,0.435462,0.767511,0.679673,0.690049,...,0.267085,0.44627,0.436036,0.397102,0.611445,0.861962,0.66876,0.483128,0.538031,0.716353
2,0.452278,0.46505,0.47382,0.428714,0.648706,0.512861,0.435462,0.767511,0.679673,0.690049,...,0.267085,0.44627,0.436036,0.397102,0.611445,0.861962,0.66876,0.483128,0.538031,0.716353
3,0.452278,0.46505,0.47382,0.428714,0.648706,0.512861,0.435462,0.767511,0.679673,0.690049,...,0.267085,0.44627,0.436036,0.397102,0.611445,0.861962,0.66876,0.483128,0.538031,0.716353
4,0.452278,0.46505,0.47382,0.428714,0.648706,0.512861,0.435462,0.767511,0.679673,0.690049,...,0.267085,0.44627,0.436036,0.397102,0.611445,0.861962,0.66876,0.483128,0.538031,0.716353


In [32]:
# There should now be no -ve values
print('There are {} columns with negative values in the pid validation set.'
      .format(len(pids_vld.columns[(pids_vld < 0).any()])))

There are 0 columns with negative values in the pid validation set.


In [33]:
# Looks for negative values in the scaled validation data
print('There are {} columns with negative values in the cid validation set.'
      .format(len(cids_vld.columns[(cids_vld < 0).any()])))

There are 0 columns with negative values in the cid validation set.


In [34]:
from sklearn.decomposition import LatentDirichletAllocation

num_topics = 100

def LDA_fit(df_in, num_topics):
    lda = LatentDirichletAllocation(n_components=num_topics, 
                                        random_state=42, learning_method="online", n_jobs=-1)
    lda.fit(df_in)

    return lda


def LDA_transform(df_in, model):
    df_in["activity"] = 1
    latent_prob_pos = np.max(model.transform(df_in), axis=1)
    
    df_in["activity"] = 0
    df_in["latent_prob_neg"] = np.max(model.transform(df_in), axis=1)
    
    df_in["latent_prob_pos"] = latent_prob_pos
    df_in["latent_prob_delta"] = np.abs(df_in.latent_prob_pos - df_in.latent_prob_neg)
    df_in["latent_prob_delta_ratio"] = 1 - 2 * df_in.latent_prob_delta / (df_in.latent_prob_pos + df_in.latent_prob_neg)
    
    return df_in

### Train LDA on the training set

In [35]:
lda_cid = LDA_fit(cids_trn, num_topics)
lda_pid = LDA_fit(pids_trn, num_topics)

In [38]:
# Save the models
import joblib

joblib.dump(lda_cid, 'lda_cid.jl')
joblib.dump(lda_pid, 'lda_pid.jl')

['lda_pid.jl']

In [76]:
# Create a full-feature training set, WITHOUT dropping duplicates this time.
optional_training_interactions_csv = '../data/v5/optional_training_interactions_v5.csv'
required_training_interactions_csv = '../data/v5/required_training_interactions_v5.csv'

optional_training_interactions = pd.read_csv(optional_training_interactions_csv, index_col=0)
required_training_interactions = pd.read_csv(required_training_interactions_csv, index_col=0)

df_opt_pid_training = pd.merge(optional_training_interactions, df_expasy, on='pid', how='inner')
df_opt_pid_training = pd.merge(df_opt_pid_training, df_profeat, on='pid', how='inner')

df_req_pid_training = pd.merge(required_training_interactions, df_expasy, on='pid', how='inner')
df_req_pid_training = pd.merge(df_req_pid_training, df_profeat, on='pid', how='inner')

df_opt_cid_training = pd.merge(optional_training_interactions, df_dragon_features, on='cid', how='inner')
df_req_cid_training = pd.merge(required_training_interactions, df_dragon_features, on='cid', how='inner')

In [77]:
print(len(df_opt_pid_training), len(df_req_pid_training))
print(len(df_opt_cid_training), len(df_req_cid_training))

154761 21010
157224 21619


In [None]:
# fit LDA to the cid and pid columns of the full set, separately.
opt_pid_cols = df_opt_pid_training.columns
opt_cid_cols = df_opt_cid_training.columns
req_pid_cols = df_req_pid_training.columns
req_cid_cols = df_req_cid_training.columns

opt_drop_cols = ['cid_binary_weights', 'pid_binary_weights', 'pid', 'cid', 'activity_score']
req_drop_cols = ['cid_binary_weights', 'pid_binary_weights', 'pid', 'cid']

df_opt_cid_training_in = df_opt_cid_training[opt_cid_cols].drop(opt_drop_cols, axis=1).copy()
df_req_cid_training_in = df_req_cid_training[req_cid_cols].drop(req_drop_cols, axis=1).copy()

df_opt_pid_training_in = df_opt_pid_training[opt_pid_cols].drop(opt_drop_cols, axis=1).copy()
df_req_pid_training_in = df_req_pid_training[req_pid_cols].drop(req_drop_cols, axis=1).copy()

# scale
df_opt_cid_training_in[df_opt_cid_training_in.columns] = cid_scaler_trn.transform(df_opt_cid_training_in)
df_req_cid_training_in[df_req_cid_training_in.columns] = cid_scaler_trn.transform(df_req_cid_training_in)

df_opt_pid_training_in[df_opt_pid_training_in.columns] = pid_scaler_trn.transform(df_opt_pid_training_in)
df_req_pid_training_in[df_req_pid_training_in.columns] = pid_scaler_trn.transform(df_req_pid_training_in)

df_opt_cid_training_in = LDA_transform(df_opt_cid_training_in, lda_cid)
df_req_cid_training_in = LDA_transform(df_req_cid_training_in, lda_cid)

df_opt_pid_training_in = LDA_transform(df_opt_pid_training_in, lda_pid)
df_req_pid_training_in = LDA_transform(df_req_pid_training_in, lda_pid)


# Add the new LDA weight columns to the full training set.
df_opt_cid_training_in['cid_continuous_weights'] = df_opt_cid_training_in['latent_prob_delta_ratio']
df_req_cid_training_in['cid_continuous_weights'] = df_req_cid_training_in['latent_prob_delta_ratio']

df_opt_pid_training_in['pid_continuous_weights'] = df_opt_pid_training_in['latent_prob_delta_ratio']
df_req_pid_training_in['pid_continuous_weights'] = df_req_pid_training_in['latent_prob_delta_ratio']

In [None]:
print(df_opt_cid_training_in['cid_continuous_weights'].min(), df_opt_cid_training_in['cid_continuous_weights'].max())
print(df_opt_pid_training_in['pid_continuous_weights'].min(), df_opt_pid_training_in['pid_continuous_weights'].max())

print(df_req_cid_training_in['cid_continuous_weights'].min(), df_req_cid_training_in['cid_continuous_weights'].max())
print(df_req_pid_training_in['pid_continuous_weights'].min(), df_req_pid_training_in['pid_continuous_weights'].max())

### Create new directory for the v6 validation files

These are just the v5 files with the LDA weights for continuous fields.

In [None]:
import os

v6_path = '../data/v6'
try:
  os.makedirs(v6_path, exist_ok=True)
except OSError:
  print("Creation of the directory %s failed" % v6_path)
else:
  print("Successfully created the directory %s " % v6_path)

In [None]:
optional_training_interactions['cid_continuous_weights'] = df_opt_cid_training_in['cid_continuous_weights']
optional_training_interactions['pid_continuous_weights'] = df_opt_pid_training_in['pid_continuous_weights']
required_training_interactions['cid_continuous_weights'] = df_req_cid_training_in['cid_continuous_weights']
required_training_interactions['pid_continuous_weights'] = df_req_pid_training_in['pid_continuous_weights']

optional_training_interactions.to_csv(v6_path+'/optional_training_interactions_v6.csv', index=False)
required_training_interactions.to_csv(v6_path'/required_training_interactions_v6.csv', index=False)

In [None]:
optional_training_interactions.head()

In [None]:
required_training_interactions.head()

### Transform the validation set

In [130]:
cids_vld = LDA_transform(cids_vld, lda_cid)
pids_vld = LDA_transform(pids_vld, lda_pid)

In [None]:
print(cids_vld['latent_prob_delta_ratio'].min(), cids_vld['latent_prob_delta_ratio'].max())
print(pids_vld['latent_prob_delta_ratio'].min(), pids_vld['latent_prob_delta_ratio'].max())

In [None]:
cids_vld.head()

In [None]:
pids_vld.head()

In [None]:
validation_interactions = '../data/v5/validation_interactions_v5.csv'
df_validation_interactions = __load_data(validation_interactions)

# Add the two new column for the continuous LDA weights for cids and pids to create the v6 validation file
df_validation_interactions['cid_continuous_weights'] = cids_vld['latent_prob_delta_ratio']
df_validation_interactions['pid_continuous_weights'] = pids_vld['latent_prob_delta_ratio']

# Save the new weights to the validation interactions file.
df_validation_interactions.to_csv(v6_path+'/validation_interactions_v6.csv', index=False)

In [131]:
# save to another file for potential future use
# Add the new latent_prob_delta_ratio to the original dataframes and save to csv.
new_cols = ['latent_prob_delta_ratio', 'latent_prob_neg', 'latent_prob_pos', 'latent_prob_delta']
df_pid_vld[new_cols] = pids_vld[new_cols]
df_cid_vld[new_cols] = cids_vld[new_cols]

df_pid_vld.to_csv('../data/pid_v5_LDA_continuous.csv', index=False)
df_cid_vld.to_csv('../data/cid_v5_LDA_continuous.csv', index=False)

In [132]:
cids_vld.head()

Unnamed: 0,activity,cid_MW,cid_AMW,cid_Sv,cid_Se,cid_Sp,cid_Si,cid_Mv,cid_Me,cid_Mp,...,cid_SAacc,cid_SAdon,cid_Vx,cid_VvdwMG,cid_VvdwZAZ,cid_PDI,latent_prob_neg,latent_prob_pos,latent_prob_delta,latent_prob_delta_ratio
0,0,0.02632,0.03697,0.027505,0.025197,0.022829,0.023663,0.075923,0.750645,0.010561,...,0.032122,0.022222,0.023525,0.023524,0.023067,0.513699,0.955786,0.956439,0.000653,0.999317
1,0,0.02632,0.03697,0.027505,0.025197,0.022829,0.023663,0.075923,0.750645,0.010561,...,0.032122,0.022222,0.023525,0.023524,0.023067,0.513699,0.955786,0.956439,0.000653,0.999317
2,0,0.02632,0.03697,0.027505,0.025197,0.022829,0.023663,0.075923,0.750645,0.010561,...,0.032122,0.022222,0.023525,0.023524,0.023067,0.513699,0.955786,0.956439,0.000653,0.999317
3,0,0.02632,0.03697,0.027505,0.025197,0.022829,0.023663,0.075923,0.750645,0.010561,...,0.032122,0.022222,0.023525,0.023524,0.023067,0.513699,0.955786,0.956439,0.000653,0.999317
4,0,0.02632,0.03697,0.027505,0.025197,0.022829,0.023663,0.075923,0.750645,0.010561,...,0.032122,0.022222,0.023525,0.023524,0.023067,0.513699,0.955786,0.956439,0.000653,0.999317


In [133]:
pids_vld.head()

Unnamed: 0,activity,helical,beta,coil,veryBuried,veryExposed,someBuried,someExposed,[G1.1.1.1],[G1.1.1.2],...,[G7.1.1.75],[G7.1.1.76],[G7.1.1.77],[G7.1.1.78],[G7.1.1.79],[G7.1.1.80],latent_prob_neg,latent_prob_pos,latent_prob_delta,latent_prob_delta_ratio
0,0,0.367368,0.353796,0.438947,0.450899,0.180444,0.528864,0.541333,0.087959,0.072763,...,0.582109,0.456048,0.583516,0.476059,0.577332,0.660866,0.803432,0.800348,0.003084,0.996154
1,0,0.367368,0.353796,0.438947,0.450899,0.180444,0.528864,0.541333,0.087959,0.072763,...,0.582109,0.456048,0.583516,0.476059,0.577332,0.660866,0.803432,0.800348,0.003084,0.996154
2,0,0.367368,0.353796,0.438947,0.450899,0.180444,0.528864,0.541333,0.087959,0.072763,...,0.582109,0.456048,0.583516,0.476059,0.577332,0.660866,0.803432,0.800348,0.003084,0.996154
3,0,0.367368,0.353796,0.438947,0.450899,0.180444,0.528864,0.541333,0.087959,0.072763,...,0.582109,0.456048,0.583516,0.476059,0.577332,0.660866,0.803432,0.800348,0.003084,0.996154
4,0,0.367368,0.353796,0.438947,0.450899,0.180444,0.528864,0.541333,0.087959,0.072763,...,0.582109,0.456048,0.583516,0.476059,0.577332,0.660866,0.803432,0.800348,0.003084,0.996154


### Train a GMM over each feature set

<span style="color:red; font-weight:bold;">None of the stuff below worked</span>

In [118]:
gmm_pid = GaussianMixture(n_components=50,
              covariance_type='full', max_iter=50, random_state=42, reg_covar=0.000001)

gmm_cid = GaussianMixture(n_components=50,
              covariance_type='full', max_iter=50, random_state=42, reg_covar=0.000001)

### Random Projection to reduce dimensionality before training GMM

<span style="color:red; font-weight:bold;">GMM predict_proba was only returning binary values for probabilties so I tried RP for dimension reduction to see if this improved things....it didn't.</span>

In [120]:
import numpy as np
from sklearn.random_projection import SparseRandomProjection
rng = np.random.RandomState(42)

transformer = SparseRandomProjection(random_state=rng, eps=0.2)
cids_trn = transformer.fit_transform(cids_trn)
print(cids_trn.shape)

transformer = SparseRandomProjection(random_state=rng, eps=0.4)
pids_trn = transformer.fit_transform(pids_trn)
print(pids.shape)

# very few components are non-zero
np.mean(transformer.components_ != 0)

(7969, 2073)
(7763, 610)


0.03418520573100982

### Fit the GMMs

In [121]:
gmm_cid.fit(cids_trn)
gmm_cid.converged_

True

In [122]:
gmm_pid.fit(pids_trn)
gmm_pid.converged_

True

In [125]:
gmm_cid.predict_proba([cids_trn[23]])

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.]])

In [126]:
gmm_pid.predict_proba([pids_trn[45]])

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.]])

### Obtain likelhood of cluster membership from multivariate normal pdf

<span style="color:red; font-weight:bold;">Maybe we can get non-binary probabilities by looking at the multivaraite pdf for a point's cluster? Nope, we can't, in this case. The pdf always just returns `inf`. </span>

See https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html

In [127]:
from scipy.stats import multivariate_normal

def p(model, v):
    p = model.predict_proba([v])
    cluster = np.argmax(p)
    mu = model.means_[cluster]
    cov = model.covariances_[cluster]
    return multivariate_normal.pdf(v, mean=mu, cov=cov)

p(gmm_pid, pids_trn[2000])

  out = np.exp(self._logpdf(x, mean, psd.U, psd.log_pdet, psd.rank))


inf

### Mahalanobis distance differences when flipping activity

<span style="color:red; font-weight:bold;">Maybe the Mahalanobis distance can be used? Nope, doesn't look lke it. The scores look off.</span>

For each point, obtain a prediction from GMM of its closest cluster. Then calculate the mahalanobis distance to its  cluster. Next, flip the activity bit and re-predict and calculate distance again. Finally, calculate the poiint's score given the change in distance. NOTE: this assumes that flipping activity does not change cluster membership, only likelihood of membership of the same cluster.

In [128]:
from scipy.spatial import distance

def dist(model, v):
    p = model.predict_proba(v)
    cluster = np.argmax(p)
    mu = model.means_[cluster]
    cov = model.covariances_[cluster]
    icov = np.linalg.inv(cov) 
    return distance.mahalanobis(v, mu, icov)

def get_score(vin, model):
    v = [vin]
    d1 = dist(model, v)
    print(d1)

    # Flip activity and measure distance
    if vin[0] == 1.0:
        vin[0] = 0.0
    else:
        vin[0] = 1.0

    d2 = dist(model, v)
    print(d2)
    
    score = 1 - 2 * np.abs(d1-d2)/d1+d2
    return score
    
pid_scores = [get_score(v, gmm_pid) for v in pids_trn[:10]] # Try on the first 10 pids

4.574986051786556
837.1345837309585
4.574986051786556
837.1345837309585
2.810909954716742
837.1268951985757
2.810909954716742
837.1268951985757
2.810909954716742
837.1268951985757
2.810909954716742
837.1268951985757
2.810909954716742
837.1268951985757
2.810909954716742
837.1268951985757
2.810909954716742
837.1268951985757
2.810909954716742
837.1268951985757


In [104]:
pid_scores

[464.86089241996586,
 464.86089241996586,
 -107.02854902185118,
 -107.02854902185118,
 -107.02854902185118,
 -107.02854902185118,
 -107.02854902185118,
 -107.02854902185118,
 -107.02854902185118,
 -107.02854902185118]

### K-L divergence

<span style="color:red; font-weight:bold;">Never got to try this since we couldn't get good results from GMM.</span>

Since there is no closed-form solution over GMMs we will use an approximation based upon Monte Carlo simulation.

In [None]:
def gmm_kl(gmm_p, gmm_q, n_samples=10**5):
    X = gmm_p.sample(n_samples)
    log_p_X, _ = gmm_p.score_samples(X)
    log_q_X, _ = gmm_q.score_samples(X)
    return log_p_X.mean() - log_q_X.mean()