# Learning Bioinformatics in Python
Author: Dylan Loader

# Introduction

**Disclaimer**: To be completely transparent, I am not a Biologist, Bioinformatician, or Chemist of any kind. The last Biology course I took was in Winter 2014, recieving a C in Introduction to Cell Biology. There will  be places where I make gross over generalizations and if you find yourself saying:

<img src="correction.png" width="200">

Please contact me and correct me so this tutorial doesn't misguide anyone, no problem is too small to contact me about. 


## Motivation

This project will be an open exporation into the world of Bioinformatics (BI) for use in my project for MDSC 401 and STAT 641 at the University of Calgary. MDSC401 is Introduction to Bioinformatics, which covers an array of topics from Sequence Alignment to Markov Models. STAT543 is Statistical Learning, and is focused on understanding the fundamentals in Machine Learning techniques from a Statistical background. 

I am purposely choosing some fields here I am weak in to force myself to sink or swim in learning about these topics. Here are the topics:

* Bioinformatics
* Python3
* Machine Learning (Specifically CNNs)
* Jupyter Notebooks


At the current time I have little background in any of this and hope that S-T-ruggling through it will serve as informative, or a warning of sorts to anyone who comes across this. 

# Resources

I will try to keep the resources used up to date and give credit to the fantastic people who dedicate themselves to teaching others in this section.

## Bioinformatics

Introductory Youtube series for BI: https://www.youtube.com/watch?v=UkSLdj_RRps&index=5&list=PL6yVKsUPBjJYXhGPlD8tAOglqefPBy35x

Book for BI: 'Elementary Sequence Analysis' by Brian Golding, Dick Morton and Wilfried Haerty 
http://helix.mcmaster.ca/3S03_2011.pdf

RNA A to I Editing: https://en.wikipedia.org/wiki/RNA_editing


## Python3

Getting tensorflow top recognize my gpu in windows: https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187/

It is very important to make sure you install tensorflow-gpu, for some reason Jupyter wouldn't recognize my GPU (RTX2070) using the suggested version of tensorflow.

## Machine Learning

For background information on Machine Learning: "Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems" by Aurélien Géron. It is really a great resource so far and I am hoping Tensorflow V2 is included in the new edition.

## Jupyter Notebooks

For visual styling in Jupyter: https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks



# RNA A to I Editing

Primary resource: https://www.youtube.com/channel/UCEPMCywJ6FPZpQ_aPEZt5JA

My project focuses mainly on data from the RNA A to I database [REDIportal](http://srv00.recas.ba.infn.it/atlas/).
RNA A to I Editing is a useful process in which cells can use a genome of limited size to generate a greater number of proteins. This process has been recorded in both Prokaryotic and Eukaryotic cells.

## A crash course in RNA editing

Eukaryotes differ from prokaryotes in that prokaryotes do not have intron and exon regions in their RNA. This has to do with many factors including the limited size of the chromosome in prokaryotes relative to that of eukaryotes. 


# Project Python Code

In [1]:
# Import statements
import os
from time import time
from IPython.display import Image
from pysster.Data import Data
from pysster.Grid_Search import Grid_Search
from pysster import utils

# Generate a folder to hold the output
output_folder = "pysster_output/"

# Check to see if the output directory is in our path.
# If it is not, generate the output folder
if not os.path.isdir(output_folder):
    os.makedirs(output_folder)
    
# Make sure tensorflow is installed and that our gpu is accessible
import tensorflow as tf
print("TensorFlow version: "+ tf.__version__)
print("Current GPU used: "+ tf.test.gpu_device_name())
# This should return something like
# TensorFlow version: 1.12.0
# Current GPU used: /device:GPU:0
# If it returns GPU:0, the Jupyter notebook isn't recognizing your GPU.


Using TensorFlow backend.


TensorFlow version: 1.12.0
Current GPU used: /device:GPU:0


In [4]:
# Load datasets of RNA A to I editing

# Import the data using the ACGU alphabet for RNA and HIMS for proteins
data = Data(["data/alu.fa.gz",
             "data/rep.fa.gz",
             "data/nonrep.fa.gz"], ("ACGU", "HIMS")) 

print(data.get_summary())

              class_0    class_1    class_2
all data:       50000      50000      50000
training:       34931      34978      35091
validation:      7510       7536       7454
test:            7559       7486       7455


In [None]:

data.train_val_test_split(portion_train=0.7, portion_val=0.15, seed=1775)
print(data.get_summary())