# "Flat" Modeling for Gene Association with Lung Adenocarcinoma

Modeling with tabular data using 'classic' machine learning methods. The models aim to predict gene association with Lung Adenocarcinoma (LUAD). Data includes 'node features' (ontological features about the genes) and network features (a feature embedding on genes' position in the Protein-Protein Interaction (PPI) network). Genes are identified by their 'Ensembl' ID.

'Node features' come from the Human Protein Atlas, and the PPI network comes from the STRING dataset (restricted to human genes).

## Data and Setup

In [1]:
import numpy as np
import pandas as pd

from tqdm import tqdm

In [2]:
data_path = 'data/HPAnode_PPInetwork_labels_tempv2.csv'
data = pd.read_csv(data_path, index_col=0)

In [3]:
# FIXME: this is a temporary label. need to look into getting positive and negative labels.
# label: 1 if NIH_pos, 0 if not NIH_Cancer, NaN otherwise
label_name = 'my_label'
def my_labeler(NIH_pos, NIH_cancer):
    if NIH_pos and not NIH_cancer:
        raise ValueError('Data inconsistent. Found row with NIH label both positive and negative')
    if NIH_pos:
        return 1
    elif not NIH_cancer:
        return 0
    else:
        return np.nan

my_labels = pd.array([my_labeler(row.NIH_pos, row.NIH_Cancer) for id_, row in data.iterrows()], dtype=float)#, dtype='Int32')

data[label_name] = my_labels

In [4]:
data[label_name].value_counts()

1.0    521
0.0    135
Name: my_label, dtype: int64

In [5]:
# we use the NIH labels: it is 1 if positive for association with LUAD, 0 if negative, and NaN if unknown/low confidence
label_col = label_name
data[label_col] = data[label_col]#.astype('Int32')

In [6]:
from sklearn.metrics import classification_report

def eval_model(model, X, y):
    preds = model.predict(X)
    clf_report = classification_report(y, preds, labels=[0, 1], target_names=['negative', 'positive'], digits=2)
    print(clf_report)

## Node-only Modeling

### Set up

In [7]:
num_node_feats = 100
node_feat_cols = ['Tissue RNA - lung [NX]', 'Single Cell Type RNA - Mucus-secreting cells [NX]'] + [f'node_{i}' for i in range(num_node_feats)]

# get subset of node features features + labels
node_data = data[node_feat_cols + [label_col]]

# restrict to data with labels
node_data_labeled = node_data[node_data[label_col].notna()]
node_data_labeled

Unnamed: 0_level_0,Tissue RNA - lung [NX],Single Cell Type RNA - Mucus-secreting cells [NX],node_0,node_1,node_2,node_3,node_4,node_5,node_6,node_7,...,node_91,node_92,node_93,node_94,node_95,node_96,node_97,node_98,node_99,my_label
Ensembl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000002834,0.996506,0.038446,1.537459,0.429451,-0.302780,0.258347,-0.950399,1.365928,-0.064216,0.228324,...,-0.115296,-0.234800,0.049897,0.382442,0.168788,-0.033863,0.065672,0.088566,0.006333,1.0
ENSG00000005339,0.150101,-0.069827,2.622879,0.092524,1.558535,-1.148822,0.606971,0.573626,0.106728,-0.357630,...,-0.327284,-0.087676,0.254183,-0.066311,-0.014220,-0.059492,0.095315,0.159288,-0.186821,1.0
ENSG00000006468,0.046881,-0.103977,1.976907,-1.347319,1.559400,-0.076801,0.088430,0.544722,0.046640,-0.046761,...,0.037351,-0.026250,-0.042694,0.015399,-0.069292,-0.039175,-0.089943,0.059104,-0.008584,1.0
ENSG00000007237,-0.211169,-0.105079,1.302629,0.359702,0.487322,-0.169744,-0.610400,0.668143,-0.180892,-0.021217,...,0.075845,-0.315101,0.056129,0.535637,0.085542,-0.076495,-0.089464,0.086111,0.017891,1.0
ENSG00000007312,-0.655015,-0.105079,1.480835,0.846645,0.805566,0.485862,-0.194441,0.923558,1.438937,-0.492804,...,-0.031383,-0.038140,-0.029342,-0.007463,-0.029591,-0.034695,-0.038255,0.000011,0.000772,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000284922,-0.634371,-0.105079,0.700987,-0.087903,-0.319591,0.006010,-0.224073,-0.241506,-0.419347,0.136663,...,-0.038222,-0.073042,0.162214,-0.065836,0.078069,0.020572,-0.251188,-0.065744,0.097520,0.0
ENSG00000285043,-0.603405,0.835389,2.009425,1.142514,0.587500,-1.108236,-0.523388,0.751758,0.338092,0.156393,...,-0.067621,0.055089,-0.035550,-0.016057,0.030357,0.086635,-0.116914,-0.061450,-0.064007,0.0
ENSG00000285188,-0.618888,-0.105079,1.292300,0.347360,-0.751088,-0.469017,0.258115,0.182103,0.151207,0.596934,...,-0.028228,-0.029454,-0.024224,-0.004768,-0.009325,-0.014907,-0.000052,-0.002426,-0.016972,0.0
ENSG00000285292,-0.422770,-0.092489,0.942309,-0.071680,-0.648776,0.007953,-0.479971,0.004475,0.273699,0.507728,...,-0.001730,-0.000357,0.013551,-0.001779,-0.008815,0.000180,0.003282,-0.002532,-0.006610,0.0


In [8]:
# separate features and labels
node_feats = node_data_labeled[node_feat_cols]
node_labels = node_data_labeled[label_col]

In [9]:
# create train-test split

from sklearn.model_selection import train_test_split
test_size = 0.25

X_train, X_test, y_train, y_test = train_test_split(node_feats, node_labels, test_size=test_size, shuffle=True, stratify=node_labels)
# NOTE: train test split is shuffled and stratified across labels

### Random Forest

In [10]:
from sklearn.ensemble import RandomForestClassifier

# define and train model
rf_clf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=5)

rf_clf.fit(X_train, y_train)

# evaluate model
print('Training Metrics')
eval_model(rf_clf, X_train, y_train)

print()
print('Testing Metrics')
eval_model(rf_clf, X_test, y_test)

Training Metrics
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00       101
    positive       1.00      1.00      1.00       391

    accuracy                           1.00       492
   macro avg       1.00      1.00      1.00       492
weighted avg       1.00      1.00      1.00       492


Testing Metrics
              precision    recall  f1-score   support

    negative       0.97      0.85      0.91        34
    positive       0.96      0.99      0.98       130

    accuracy                           0.96       164
   macro avg       0.96      0.92      0.94       164
weighted avg       0.96      0.96      0.96       164



## Network-only Modeling

### Set up

In [11]:
num_network_feats = 128
network_feat_cols = [f'network_{i}' for i in range(num_node_feats)]

# get subset of node features features + labels
network_data = data[network_feat_cols + [label_col]]

# restrict to data with labels
network_data_labeled = network_data[network_data[label_col].notna()]
network_data_labeled

Unnamed: 0_level_0,network_0,network_1,network_2,network_3,network_4,network_5,network_6,network_7,network_8,network_9,...,network_91,network_92,network_93,network_94,network_95,network_96,network_97,network_98,network_99,my_label
Ensembl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000002834,0.071641,-0.061999,0.187635,-0.113627,0.047045,-0.162556,-0.015923,0.008927,-0.010027,0.137642,...,0.030174,-0.121302,-0.015925,-0.013704,-0.100824,0.015053,0.075565,0.012143,0.008626,1.0
ENSG00000005339,-0.063438,-0.123012,-0.025039,0.447147,0.187767,-0.070443,-0.044440,-0.194156,0.062328,0.117881,...,0.191070,-0.010945,0.051817,-0.096172,0.099404,0.052227,0.107331,-0.030173,0.059622,1.0
ENSG00000006468,-0.082036,0.021734,-0.018814,0.205492,0.104231,-0.236037,0.067416,-0.113189,-0.209449,0.048852,...,0.074782,-0.071042,0.035983,-0.088630,0.099139,0.219626,0.001993,-0.089294,0.137946,1.0
ENSG00000007237,0.008273,-0.023642,0.071034,-0.101778,0.054389,-0.140254,-0.078529,-0.053619,-0.062387,-0.131385,...,-0.022510,0.016630,0.058212,-0.112460,-0.022898,0.009220,0.161912,-0.013884,0.113748,1.0
ENSG00000007312,-0.098109,-0.085026,0.310362,0.172199,0.249988,0.150812,-0.117433,-0.093251,-0.021089,0.172613,...,-0.104160,-0.119484,-0.006960,-0.234720,0.234587,-0.303881,0.260561,0.104567,-0.057288,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000284922,0.000492,-0.049555,0.091058,0.163664,0.174454,-0.290877,-0.102104,-0.086791,-0.163219,0.077145,...,-0.009689,-0.139890,0.154665,0.072253,-0.063501,-0.144602,-0.012729,-0.269494,0.124650,0.0
ENSG00000285043,0.114601,0.039142,0.092125,-0.107969,0.198529,-0.228728,-0.168646,-0.062217,-0.083524,0.036507,...,-0.115445,0.042261,0.139413,-0.090886,-0.014777,-0.171768,-0.044119,-0.131712,-0.012335,0.0
ENSG00000285188,-0.052122,-0.014974,0.029358,0.015208,0.139700,-0.203601,0.032800,-0.195412,-0.222361,0.039767,...,0.008503,-0.147238,0.130866,0.024724,-0.034465,0.037281,0.032680,-0.226415,0.157722,0.0
ENSG00000285292,0.017367,-0.156002,0.182375,0.322544,0.273273,-0.094883,-0.014420,0.033458,-0.107985,-0.098200,...,-0.157078,-0.281500,0.088705,0.012320,-0.103573,-0.109943,0.078542,-0.220539,0.076873,0.0


In [12]:
# separate features and labels
network_feats = network_data_labeled[network_feat_cols]
network_labels = network_data_labeled[label_col]

In [13]:
# create train-test split

from sklearn.model_selection import train_test_split
test_size = 0.25

X_train, X_test, y_train, y_test = train_test_split(network_feats, network_labels, test_size=test_size, shuffle=True, stratify=network_labels)
# NOTE: train test split is shuffled and stratified across labels

### Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier

# define and train model
rf_clf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=5)

rf_clf.fit(X_train, y_train)

# evaluate model
print('Training Metrics')
eval_model(rf_clf, X_train, y_train)

print()
print('Testing Metrics')
eval_model(rf_clf, X_test, y_test)

Training Metrics
              precision    recall  f1-score   support

    negative       0.96      0.88      0.92       101
    positive       0.97      0.99      0.98       391

    accuracy                           0.97       492
   macro avg       0.96      0.94      0.95       492
weighted avg       0.97      0.97      0.97       492


Testing Metrics
              precision    recall  f1-score   support

    negative       0.62      0.59      0.61        34
    positive       0.89      0.91      0.90       130

    accuracy                           0.84       164
   macro avg       0.76      0.75      0.75       164
weighted avg       0.84      0.84      0.84       164



## Node + Network Modeling

### Set up

In [15]:
node_network_feat_cols = node_feat_cols + network_feat_cols

# get subset of node features features + labels
node_network_data = data[node_network_feat_cols + [label_col]]

# restrict to data with labels
node_network_data_labeled = node_network_data[node_network_data[label_col].notna()]
node_network_data_labeled

Unnamed: 0_level_0,Tissue RNA - lung [NX],Single Cell Type RNA - Mucus-secreting cells [NX],node_0,node_1,node_2,node_3,node_4,node_5,node_6,node_7,...,network_91,network_92,network_93,network_94,network_95,network_96,network_97,network_98,network_99,my_label
Ensembl,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000002834,0.996506,0.038446,1.537459,0.429451,-0.302780,0.258347,-0.950399,1.365928,-0.064216,0.228324,...,0.030174,-0.121302,-0.015925,-0.013704,-0.100824,0.015053,0.075565,0.012143,0.008626,1.0
ENSG00000005339,0.150101,-0.069827,2.622879,0.092524,1.558535,-1.148822,0.606971,0.573626,0.106728,-0.357630,...,0.191070,-0.010945,0.051817,-0.096172,0.099404,0.052227,0.107331,-0.030173,0.059622,1.0
ENSG00000006468,0.046881,-0.103977,1.976907,-1.347319,1.559400,-0.076801,0.088430,0.544722,0.046640,-0.046761,...,0.074782,-0.071042,0.035983,-0.088630,0.099139,0.219626,0.001993,-0.089294,0.137946,1.0
ENSG00000007237,-0.211169,-0.105079,1.302629,0.359702,0.487322,-0.169744,-0.610400,0.668143,-0.180892,-0.021217,...,-0.022510,0.016630,0.058212,-0.112460,-0.022898,0.009220,0.161912,-0.013884,0.113748,1.0
ENSG00000007312,-0.655015,-0.105079,1.480835,0.846645,0.805566,0.485862,-0.194441,0.923558,1.438937,-0.492804,...,-0.104160,-0.119484,-0.006960,-0.234720,0.234587,-0.303881,0.260561,0.104567,-0.057288,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000284922,-0.634371,-0.105079,0.700987,-0.087903,-0.319591,0.006010,-0.224073,-0.241506,-0.419347,0.136663,...,-0.009689,-0.139890,0.154665,0.072253,-0.063501,-0.144602,-0.012729,-0.269494,0.124650,0.0
ENSG00000285043,-0.603405,0.835389,2.009425,1.142514,0.587500,-1.108236,-0.523388,0.751758,0.338092,0.156393,...,-0.115445,0.042261,0.139413,-0.090886,-0.014777,-0.171768,-0.044119,-0.131712,-0.012335,0.0
ENSG00000285188,-0.618888,-0.105079,1.292300,0.347360,-0.751088,-0.469017,0.258115,0.182103,0.151207,0.596934,...,0.008503,-0.147238,0.130866,0.024724,-0.034465,0.037281,0.032680,-0.226415,0.157722,0.0
ENSG00000285292,-0.422770,-0.092489,0.942309,-0.071680,-0.648776,0.007953,-0.479971,0.004475,0.273699,0.507728,...,-0.157078,-0.281500,0.088705,0.012320,-0.103573,-0.109943,0.078542,-0.220539,0.076873,0.0


In [16]:
# separate features and labels
node_network_feats = node_network_data_labeled[node_network_feat_cols]
node_network_labels = node_network_data_labeled[label_col]

In [17]:
# create train-test split

from sklearn.model_selection import train_test_split
test_size = 0.25

X_train, X_test, y_train, y_test = train_test_split(node_network_feats, node_network_labels, test_size=test_size, shuffle=True, stratify=node_network_labels)
# NOTE: train test split is shuffled and stratified across labels

### Random Forest

In [18]:
from sklearn.ensemble import RandomForestClassifier

# define and train model
rf_clf = RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=5)

rf_clf.fit(X_train, y_train)

# evaluate model
print('Training Metrics')
eval_model(rf_clf, X_train, y_train)

print()
print('Testing Metrics')
eval_model(rf_clf, X_test, y_test)

Training Metrics
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00       101
    positive       1.00      1.00      1.00       391

    accuracy                           1.00       492
   macro avg       1.00      1.00      1.00       492
weighted avg       1.00      1.00      1.00       492


Testing Metrics
              precision    recall  f1-score   support

    negative       0.97      0.88      0.92        34
    positive       0.97      0.99      0.98       130

    accuracy                           0.97       164
   macro avg       0.97      0.94      0.95       164
weighted avg       0.97      0.97      0.97       164

