# Learning Bioinformatics in Python
Author: Dylan Loader

# Project Python Code

In [1]:
# Import statements
import os
from time import time
from IPython.display import Image
from pysster.Data import Data
from pysster.Model import Model
from pysster.Grid_Search import Grid_Search
from pysster import utils

# Generate a folder to hold the output
output_folder = "pysster_output/"

# Check to see if the output directory is in our path.
# If it is not, generate the output folder
if not os.path.isdir(output_folder):
    os.makedirs(output_folder)
    
# Make sure tensorflow is installed and that our gpu is accessible
import tensorflow as tf
print("TensorFlow version: "+ tf.__version__)
print("Current GPU used: "+ tf.test.gpu_device_name())
# This should return something like
# TensorFlow version: 1.12.0
# Current GPU used: /device:GPU:0
# If it returns GPU:0, the Jupyter notebook isn't recognizing your GPU.


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


TensorFlow version: 1.13.1
Current GPU used: 


In [None]:
import pdb

In [2]:
# Load datasets of RNA A to I editing

# Import the data using the ACGU alphabet for RNA and HIMS for proteins

data = Data(["data/alu.fa.gz",
             "data/rep.fa.gz"], ("ACGU", "HIMS")) 


In [3]:
print(data.get_summary())

              class_0    class_1
all data:       50000      50000
training:       35117      34883
validation:      7422       7578
test:            7461       7539


Segement the data into training,validation, and test sets.

In [None]:
# Split the data into training/validation/test sets with the relative proportions 0.7/0.15/0.15
# Seed is defined to allow users to replicate numbers
data.train_val_test_split(portion_train=0.7, portion_val=0.15, seed=1775)
print(data.get_summary())

## Model Training and Summary


In [4]:
model = Model({"conv_num": 2, "kernel_num": 10, "kernel_len": 10, "epochs": 2}, data)
# Record the stop and start times to see how long our training takes.
start = time()
model.train(data, verbose=True)
stop = time()
print("time in minutes: {}".format((stop-start)/60))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Epoch 1/2
Epoch 2/2
time in minutes: 1.6452104568481445


In [13]:
model.model.



[]

In [None]:
print(model.model.metrics_names[0])


From the model summary we see that the ROC-AUC is maximized when we use the model with no dropout. Since the ROC-AUC metric in this step is based on the validation data we may find that the out of sample prediction in the test set is poor, suggesting that the model over fits the training and validation sets. Dropping out connections between the layers of our network adds a degree of 'randomness' by disconnecting contiguous nodes in the network, and can often lead to better preditive power overall.

# Evaluation of Model Performance

We first look at the summary to 'visualize' in some sense to see the layers are set up. The dropout layers are still included in the model even though the optimal model was found to not require dropout. 

In [None]:
model.model.summary()


In [None]:
test_predictions = model.predict(data,"test")
print("Test Prediction Values")
print(test_predictions)

# Retrieve the labels of the test set
test_labels = data.get_labels("test")
print("Test Labels")
print(test_labels)

In [None]:
utils.plot_roc(test_labels, test_predictions, output_folder+"roc.png")
utils.plot_prec_recall(test_labels, test_predictions, output_folder+"prec.png")
print(utils.get_performance_report(test_labels, test_predictions))

Image(output_folder+"roc.png")

In [None]:
Image(output_folder+"prec.png")

In [None]:
utils.save_data(data, output_folder+"data.pkl")
utils.save_model(model, output_folder+"model.pkl")

# Resources

I will try to keep the resources used up to date and give credit to the fantastic people who dedicate themselves to teaching others in this section.

## Bioinformatics

Introductory Youtube series for BI: https://www.youtube.com/watch?v=UkSLdj_RRps&index=5&list=PL6yVKsUPBjJYXhGPlD8tAOglqefPBy35x

Book for BI: 'Elementary Sequence Analysis' by Brian Golding, Dick Morton and Wilfried Haerty 
http://helix.mcmaster.ca/3S03_2011.pdf

RNA A to I Editing: https://en.wikipedia.org/wiki/RNA_editing


## Python3

Getting tensorflow top recognize my gpu in windows: https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187/

It is very important to make sure you install tensorflow-gpu, for some reason Jupyter wouldn't recognize my GPU (RTX2070) using the suggested version of tensorflow.

## Machine Learning

For background information on Machine Learning: "Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems" by Aurélien Géron. It is really a great resource so far and I am hoping Tensorflow V2 is included in the new edition.

## Jupyter Notebooks

For visual styling in Jupyter: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks

