# Training 2D CNN on GTEx V8

This file is part of the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.

Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with the Verifying explainability of a deep learning tissue classifier trained on RNA-seq data project.  If not, see <http://www.gnu.org/licenses/>.



### Objective:
> Investigation into 2D CNN with different imbalanced and balanced datasets from GTEx v8

### Input files:
1. *gtex_filtered_tmm_intersect_{data_type}.pkl*
2. *gtex_filtered_tmm_intersect_test.pkl*


### Output files:
1. *{data_type}_model_topology.json*
2. *{data_type}_model_weights.hdf5*
 
### Table of contents:
1. [Import Modules](#1.-Import-Modules)  
2. [Set static paths](#2.-Set-static-paths)  
3. [Load files](#3.-Load-files)  
    3.1 [Load training data](#3.1-Load-training-data)  
4. [Process data](#4.-Process-data)  
    4.1 [Split X and y](#4.1-Split-X-and-y)  
    4.2 [Transform data](#4.2-Transform-data)  
    4.3 [Add labels](#4.3-Add-labels)
5. [Train model](#5.-Train-model)  
    5.1 [Fit model](#5.1-Fit-model)  
    5.2 [Save model](#5.2-Save-model)  
6. [Test model](#6.-Test-model)  
    6.1 [Load GTEx v8 data](#6.1-Load-GTEx-v8-data)  
    6.2 [Prepare data](#6.2-Prepare-data)  
    6.3 [Load model](#6.3-Load-model)  
    6.4 [Run inference](#6.4-Run-inference)  

## 1. Import Modules

In [None]:
import os

In [None]:
util_path = '../src'
os.chdir(util_path)

In [None]:
import pandas as pd
import numpy as np
import pickle

from keras import backend as K
from keras.models import model_from_json
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import f1_score

from modelling.cnn import convert_2d, convert_onehot, keras_cnn, log_transform
%load_ext autoreload
%autoreload 2

In [None]:
# Select a single GPU to use 
# Skip this part if you're parallelising or only have 1 GPU
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"; 
os.environ["CUDA_VISIBLE_DEVICES"]="0"

## 2. Set static paths

In [None]:
data_type = "imbalanced"

In [None]:
input_dir = "../data/processed/"
model_dir = "../models/"

## 3. Load files

#### 3.1 Load training data

In [None]:
%%time
key = f"gtex_filtered_tmm_intersect_{data_type}.pkl"
gtex_tmm = pickle.load(open(os.path.join(input_dir, key), "rb"))

## 4. Process data

#### 4.1 Split X and y

In [None]:
X = gtex_tmm.drop("type", axis=1)
y_train = gtex_tmm["type"]    

#### 4.2 Transform data

In [None]:
X_train = log_transform(X)

In [None]:
X_train_converted = convert_2d(X_train)
y_train_converted = convert_onehot(y_train)

## 5. Train model

#### 5.1 Fit model

In [None]:
model = keras_cnn(X_train_converted, y_train_converted)
print("Model Initiated.")

In [None]:
# Train model beautifully
model.fit(
    X_train_converted, 
    y_train_converted, 
    batch_size=128, 
    epochs=11,
    verbose=1,
    validation_split=0.1, 
) 

#### 5.2 Save model

In [None]:
# Save model JSON and weights
model_json = model.to_json()
with open(model_dir+f"{data_type}_model_topology.json", "w") as json_file:
    json_file.write(model_json)

model.save_weights(model_dir+f"{data_type}_model_weights.hdf5")

In [None]:
# Delete model after training
K.clear_session()
del model

## 6. Test model

#### 6.1 Load GTEx v8 data

In [None]:
test_data = pd.read_pickle(
    input_dir + 'gtex_filtered_tmm_intersect_test.pkl'
)

#### 6.2 Prepare data

In [None]:
X_test = test_data.drop("type", axis=1)
y_test = test_data["type"]


X_test = log_transform(X_test)
X_test = convert_2d(X_test)

lb = LabelBinarizer()
lb.fit(y_test.values)

#### 6.3 Load model

In [None]:
# Load model beatifully
model_json_path = model_dir+f"{data_type}_model_topology.json"
trained_model = model_from_json(
    open(model_json_path, "r").read()
)

# load weights into new model
model_weights_path = model_dir+f"{data_type}_model_weights.hdf5"
trained_model.load_weights(model_weights_path)

#### 6.4 Run inference

In [None]:
# Run predictions and add everything to a giant DataFrame
y_preds = trained_model.predict_classes(
    X_test
)
num_preds = len(y_preds)

classes = test_data["type"].unique()
num_classes = len(classes)

y_preds_onehot = np.zeros([num_preds, num_classes])
y_preds_onehot[np.arange(num_preds), y_preds] = 1

y_preds_labels = lb.inverse_transform(y_preds_onehot)

print(
    f"macro-average F1 : {f1_score(y_test, y_preds_labels, average='macro')}"
)