# Bioinformatics Platform using JupyterLab

Author: Robert Bradford

This is a thesis project using JupyterLab to facilitate the study of BIOC-4010 course. This tutorial implements the following basic modules using Biopython, NGLView。
1. Databases and sequencing file formats
2. Translating DNA sequences into a protein sequence
3. Calculating DNA GC content
4. Pairwise sequence alignments using Needleman and Waterman algorithms
5. Substitution matrices
6. BLAST
7. Display a 3D structure

Requirements:

To run this notebook successfully, it is recommended to use Miniconda + JupyterLab and install the required packages and extensions. The notebook shall also work on [Google Colab](https://colab.research.google.com/) or [Binder](https://jupyter.org/binder) but this has not been tested.

The following packages are required and can be installed using conda. It is recommended to create a new environment and install these packages. You can use the [nglview-jupyterlab.sh script](https://github.com/nglviewer/nglview/blob/master/devtools/nglview-jupyterlab.sh) to install the nglview related packages.
* python 3.8+
* jupyterlab 2.1+
* biopython 1.7+
* ipywidgets 7.5+
* nodejs 12.0.0+, required for the jupyter-labextensions
* nglview 2.7+
    * if you do not use the nglview-jupyterlab.sh script, run the following two commands manually after you install jupyterlab
    * `jupyter-labextension install @jupyter-widgets/jupyterlab-manager`
    * `jupyter-labextension install nglview-js-widgets@$nglviewversion` where `$nglviewversion` is the version of the `nglview` installed package, which can be inspected with `conda list nglview`.
    

This notebook has been test on:
1. miniconda3
2. command line Windows (NGLView generated 3D structures do not load)
3. Google Colab

## If Running on Google CoLab:
This notebook may be run on Google Colab, should the proper modifications be performed.
Google Colab is based on the Jupyter-Environment, and consequenly the changes needed will be minimal.
If you are to run the notebook on Colab, perform the following modificaitons:

- This notebook makes use of four common files in order to familiarize you with common file types an bioinformatics tools, these can be stored locally in the same file local location of this notebook if you are using a locally hosted Jupyter session.
    - However, if if you wish to run this on Colab, these files must be stored differently.
    - Download these four files:
        - The ls_orchid.fasta file found [here](https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta)
        - The ls_orchid.gbk file found [here](https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk)
        - The alpha hemoglobin fasta file found [here](https://www.uniprot.org/uniprot/P69905.fasta)
        - The beta hemoglobin fasta file found [here](https://www.uniprot.org/uniprot/P68871.fasta)
    - Then run the following code (note, to make the code work, simply remove the "#" infront of each line)

In [None]:
#from google.colab import files
#uploaded = files.upload()
#for fn in uploaded.keys():
#print('User uploaded file "{name}" with length {length} bytes'.format(
#name=fn, length=len(uploaded[fn])))

This code stores the uploaded files into a dictionary named "uploaded".

if you wish to use a specific file in a section of the notebook add uploaded['File_name'].


Here is an example:

for a file call such as:
for record in SeqIO.parse("ls_orchid.fasta", "fasta"):

replace "ls_orchid.fasta" with uploaded['ls_orchid.fasta']

so that it looks like:
for record in SeqIO.parse(uploaded['ls_orchid.fasta'], "fasta"):

simply do this throughout, and the notebook should be compatible on Google Colab.

### Note Regarding Google Colab and NGLView
Google has not added ipywidgets compatibility to Google Colab.
as a result, NGLView cannot be run on Google Colab.
The rest of the notebook should run however.

## Setup environment

In [None]:
import os
import sys

!pip install nglview
import nglview as nv
import platform
!pip install ipywidgets
import ipywidgets as widgets

!pip install biopython
import Bio
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Data import CodonTable
from Bio import pairwise2
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio.PDB import *

print("Python version",sys.version_info)
print("Biopython version", Bio.__version__)

## Chapter 1 Databases and File Formats
The use of bioinformatics typically involves dealing with vast amounts of biological data. This can range from DNA sequences, protein sequences, protein structures, and all their associated annotations. This information is stored into various file formats and can be manipulated and read using biopython. Each file format has unique characteristics and information that a biologist might want to manipulate. In the interest of being able to access and manipulate this data, one must be able to understand these file formats, and where to access this information from. 

### Databases
Biological data, ranging from gene sequences to protein sequences is stored in databases. 

Beginning with DNA, there are **three** main databases commonly used when dealing with nucleotide sequences: 

1. GenBank, provided by the National Center for Biotechnology Information (NCBI)
2. European Molecular Biology Laboratory (EMBL)-Bank, provided by the European Bioinformatics Institute (EBI)
3. DNA Database of Japan (DDBJ), provided by the National Institute of Genetics in Mishima

These databases are coordinated by the International Nucleotide Sequence Database Collaboration 
(INSDC). Each of these databases are but a part of the resources and other database of interest provided by the NCBI, EBI, and DDBJ.

### File Formats
As mentioned, databases store a wide range of relevent biological information. Some of this information may include large DNA sequences, mRNA products, or proteins sequences. These sequences alone are fairly simple linear arangements of nucleotides and amino acids. However, the context around these sequences, such as the name of the organism, chromosome, gene, or intron they are taken from is also important. 

In order to keep track of this information such that its kept organized and a biologist can understand it, several standardized file formats have been developed, a list of which can be found [here](https://www.algosome.com/articles/bioinformatics-sequence-file-formats.html). Some of these include:

* GenBank, which can store a wide variety of sequence information
* EMBL, similar to GenBank, but refit for the EBI database
* ABI, which stores pure sequence information in binary. usually only used in special cases.
* PDB, protein database formate, used to store sequence information, and the protein's 3D structure gained through crystalography
* MLD, mostly stores information regarding smaller molecular strucutres, very similar to PDB
* BAM/SAM, stores next generation sequencing data, containing both a binary sequence readout and an alphabetical sequence readout.
* SFF, also stores next generation sequencing information, specifically the sequencing information from Ion-Torrent and Roche's '454'.

Each database uses their own unique type of format, however they each have similarities between them. This document covers some of the more commonly used file types typically encountered by bioinformatics students.

#### FASTA File Format
FASTA is one of the simplist of the sequence data formats. FASTA usually contains some idintifying information for the sequence of interest in a header, followed either by the actual DNA or protein sequence. An official description of the file format can be found [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp).

##### FASTA files look something like this:

Fasta files are quite simple to read:
* Generally, Fasta files begin with an information line, usually denoted with an ">" symbole.
* Stored sequence pertaining to that information line follows.


#### GenBank File Format
GenBank, EMBL, and DDBJ offer a wide range of molecular sequence data. We will use GenBank to demonstrate how this sequence information is organized and how to access it.GenBank mainly deals with genomic DNA sequences, their theoretically transcribed pre-mRNA, mRNA products, and their translated protein sequences. GenBank provides this and more information regarding specific proteins sequences through the GenBank file format.

GenBank uses a standardized file format to contain pertinent information for any given DNA sequence. The GenBank format helps keep information pertaining to a particular gene or protein relatively standard across platforms and help with ease of access. An offical description of each element in the format can be found [here](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html) at the NCBI's website.

##### GenBank files look something like this:

GenBank files appear vastly more complicated than FASTA files at an initial glance. However, FASTA files and GenBank files function quite similarly:
* instead of a simple identification line denoted by the ">" like in a FASTA file, Gene bank files can contain a fast amount of intial information pertaining to the gene of interest at the start of the file.
* GenBank files can contain many features related to specific sequences within the file.
* like FASTA files, GenBank files also contain the sequence information after the initial header.

#### Using data from from FASTA and GenBank files

Both GenBank and FASTA files store raw sequence data as well as some identifiers specific to those sequences. 
Biopython allows users to make novel FASTA and GenBank files with sequence data they'd like to store, and also allows users to pull data from existing files.
This chapter will use the example fasta file from the biopython tutorial [here](https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta) into any text file program as "ls_orchid.fasta" to show how to retrieve data from such files. Download the file, and put it into the same location as that of the bioinformatics tutorial file.

Using biopython, one can then extract sequence data from that file:

In [None]:
# This script uses "parse" to select the first entry in the FASTA file denoted by ">" 
# and stores that information into "record"
for record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    
    # then prints the record ID, the seqence in that record, and the length of the sequence
    print(record.id)
    print(repr(record.seq))
    print(len(record))

**As you can see above**, there are 94 individual records in the example FASTA file, and each displays their ID, sequence and length.

Using biopython, we can also do the same thing with **GenBank files**, even with their extra layer of apparent complication compared to FASTA files. Simply download the example GenBank file availible in the biopython tutorial [here](https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk) into a text file as "ls_orchid.gbk". Make sure to put it into the same location as that of this bioinformatics platform file.

Using similar code to that used above for the 

In [None]:
# This script uses "parse" to select the first entry in the GenBank file, in this case the "LOCUS" line
# and stores that information into "record"
# (note with the file name you may have to add a .txt if using notepad, might not recognize the genbank file type)
for record in SeqIO.parse("ls_orchid.gbk.txt", "genbank"):
    
    # then prints the record ID, the seqence in that record, and the length of the sequence
    print(record.id)
    print(repr(record.seq))
    print(len(record))

### Conclusion for databases and file formats

Overall, bioinformatics data is stored in large publically availible databases. These databases contain thousands of files that hold valueble biological data, and each database may specialize in different types of data and information. Each of these databases deals in standardized file formats that can be accessed remotely through bioinformatics software and read so as to be of use for bioinformatics scientists.

## Chapter 2 Translating DNA sequences into a protein sequences

After clearning the contents of sequence databases such as GenBank and how their file formats are arranged, it is of interest to a biologist to pull data from these files and databanks and manipulate it.

### The raw genetic code
Using the previously used FASTA file, we can pull a DNA sequence from a record:

In [None]:
#in the case of wanting the information stored in the first record, add "next" before the file is read
#this then stores that record in the variable "first_record"
first_record = next(SeqIO.parse("ls_orchid.fasta", "fasta"))

#then reads out the sequence stored in the record using the ".seq" function
print(first_record.seq)

The above example extracts the gene sequence from a fasta record. With this information, a biologist can do more than simply readout genetic sequences. Biopython offers ways of manipulating these types of information in a useful manner.

### Transcription
One way is to turn raw genetic sequences, such as the example above, and convert it into an RNA sequence to be later translated into a protein. In essence, biopython lets a biologist perform transcription, and translation, extremely conviniently.

converting DNA into RNA is simple, and can be done as so:

In [None]:
#in the case of wanting the information stored in the first record, add "next" before the file is read
#this then stores that record in the variable "first_record"
first_record = next(SeqIO.parse("ls_orchid.fasta", "fasta"))

#then stores the sequence from that record into the variable "gene_sequnece",
#gene_sequnce has the raw "GCTA" genetic code of the gene
gene_sequence = first_record.seq

#then use the biopython command "transcribe" to transcribe the sequence into RNA
#the store this information into the variable mRNA_sequence, and display it:
mRNA_sequence = gene_sequence.transcribe()
print(mRNA_sequence)

you'll notice now that instead of thymidine (T), the .transcribe() function replaces all thymidines with uracil, effectively making the sequence mRNA.

### Translation
then in regards to translating the sequence. There are a few tools offered to make working with translation relatively simple. Offered by the NCBI are translation codon tables, these being the triletter code for protein sequencing. The NCBI offers several different code tables to facilitate translations from organisms that use different codon tables.

Tables such as these:

In [None]:
#fetching the codon table, denoted as the standard table form the NCBI
#store the table, and display it using the "standard_Code" variable
standard_code = CodonTable.unambiguous_dna_by_name["Standard"]
print(standard_code)

Using tables such as these, one can then use commands that can translate RNA sequences.

The code to perform a standard translation of the previously used gene from above is as follows:

In [None]:
#in the case of wanting the information stored in the first record, add "next" before the file is read
#this then stores that record in the variable "first_record"
first_record = next(SeqIO.parse("ls_orchid.fasta", "fasta"))

#then stores the sequence from that record into the variable "gene_sequnece",
#gene_sequnce has the raw "GCTA" genetic code of the gene
gene_sequence = first_record.seq

#then use the biopython command ".transcribe()" to transcribe the sequence into RNA
#the store this information into the variable mRNA_sequence, and display it:
RNA_sequence = gene_sequence.transcribe()

#then use the biopython command ".translate()" and for the sake of keeping track, specify the standard table in the translation
pro_sequence = RNA_sequence.translate(table=1)
print(pro_sequence)

However, you should **get an error** if you do try to translate this sequence. Biopython will make note of the fact that the chosen sequence is not a multipule of three, and there may be an issue with matching a codon to some of the remaining nucleotides.

However, it does give a translated sequence with the inputed RNA regardless.

## Chapter 3 Calculating Guanine-Cytosine Content

As seen in previous chapters, an individual might use biopython to obtain key information from a sequence, and manipulate it. Aside from obtaining a translated form of a specific sequence, a biologist might also want to know the contents within that sequence.

As an example of content a biologist might want to extract from a sequence, this example will use the percent of guanine and cytosine in a generic sequence file. We'll start by taking the first sequence used from the Fasta file availible [here](https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta) and storing it into a varible we can work with (as done in the previous chapter).

In [None]:
#in the case of wanting the information stored in the first record, add "next" before the file is read
#this then stores that record in the variable "first_record"
first_record = next(SeqIO.parse("ls_orchid.fasta", "fasta"))

#then reads out the sequence stores in the record using the ".seq" function
print(first_record.seq)

However, instead of translating the sequence or reverse transcribing it, we will extract the GC content.
There are many ways to write code to give you GC content and some of them can be tediously basic as is seen below:

### The long way

In [None]:
#Convert the first_record sequence into a string
first_sequence = str(first_record.seq)

#then get the number of Gs and Cs in the form of integers and store them as such
#also get the length of the sequnce
sequence_length = len(first_sequence)
G_count = first_sequence.count("G")
C_count = first_sequence.count("C")

#then store the calculated %G %C and %GCs into variables as strings
G_content = (G_count/sequence_length)*100
C_content = (C_count/sequence_length)*100
GC_content = ((G_count + C_count)/sequence_length)*100

#then print a clean statement telling these figures (rounding and converting the numbers to string)
print("The guanine content is " + str(round(G_content, 1)) + "%, the cytosine content is " + str(round(C_content, 1)) + "%, and the guanine and cytosine content is " + str(round(GC_content, 1)) + "%.")


But this is rather tedious and a better method is offered by biopython in only a few lines of code:

### The short way

In [None]:
#At the start of the work book we imported several imports
#one of these was from the Bio.SeqUtils pack, and we will be using the "GC" import
GC_content = GC(first_record.seq)

#then say the content
print("The guanine/cytosine count for this sequence is: " + str(round(GC_content, 1)) + "%.")

As you can see, this method is far quicker, but you should note the longer example should you wish to extract specific types of information from the sequence in question, such as the number of adenines in a given sequence, and its total composition.

This also works for protein, in that you can single out desired amino acids from a sequence and figure out it's composition

## Chapter 4 Pairwise Sequence Alignments Using Needleman and Waterman Algorithms
One of the fundamental operations in bioinformatics is comparing two sequences, either nucleic acids or proteins, and line them up to archieve maximal levels of identity. This is called pairwise sequence alignment.
Biopython also offers a variety of tools for an individual to compare two sequences to one another. The chapter will cover pariwise sequences alignments through two different types of algorithms: The Needleman and the Waterman Algorithms.

Before covering the differences between these two algorithms it is worth mentioning what pariwise sequences alightments are. Basic pairwise alignments take two input sequences and compare them, in the goal of finding regions at which they align due to similarity. 

 - There are global alignments which try to align the full length of the two sequences as closely as possible. The Needleman–Wunsch algorithm is designed for these type of scenarios, in which the entire protein sequence is aligned to the other.

 - There are also local alignments where regions of a sequences might be aligned for similarity. This method is typically used in the interest of aligning similar regions of a protein sequence, but not necessarily the entire sequence. The Smith–Waterman algorithm is an example of such a system, it finds these regions of possible similarity from within two input sequences.

Biopython offers the ability to do both, in that one can use pairwise sequence alignments using a global and local alignment Algorithm.

### Global Alignments
In the interest of showing an example of using a global pairwise alignment, the biopython code found below compares two proteins sequences. Typically when dealing with global alignments, one should compare two seqences of similar length.
In this case, we shall use the sequences of alpha and beta hemoglobin, both of which can be found [here](https://www.uniprot.org/uniprot/P69905.fasta) for alpha hemoglobin, and [here](https://www.uniprot.org/uniprot/P68871.fasta) for beta. Store these sequences in the same file location as the workbook.


In [None]:
#first to use two example protein sequences
#both should be titled "alpha.fasta" and "beta.fasta" respectively   
alpha = SeqIO.read("alpha.fasta", "fasta")
beta = SeqIO.read("beta.fasta", "fasta")

#then throw these sequences into the aligner
#note, PAIRWISE2 command uses Needleman and Waterman for global and local alignments unless otherwise specified
alignments = pairwise2.align.globalxx(alpha.seq, beta.seq)

#then give a graphical display of the alignment
print(pairwise2.format_alignment(*alignments[0]))

Note that it also displays the score of the aligned sequence, as per the Needleman–Wunsch algorithm.

### Local Alignments
In the interest of showing an example of using a local pairwise alignment, the biopython code found below compares a full length protein sequence, and a single snipit of a protein sequence region. Local alignments can be used not only to compare full length sequence, but small regions within a protein sequence of interest.
In this case, we shall use the sequence of alpha hemoglobin again, which can be found [here](https://www.uniprot.org/uniprot/P69905.fasta) for alpha hemoglobin, and the small snipit will be added as code within the work book.

In [None]:
#first to use two example protein sequences
alpha = SeqIO.read("alpha.fasta", "fasta")
snipit = "AQVKGH"

#next to throw these two sequences into the aligner
#note, PAIRWISE2 command uses Needleman and Waterman for global and local alignments unless otherwise specified
alignments = pairwise2.align.localxx(alpha.seq, snipit)

#then give a graphical display of the alignment
print(pairwise2.format_alignment(*alignments[0]))

Note that it also displays the score of the aligned sequence, as per the Smith–Waterman algorithm.
However, in this case you can see that the result is not... the most accurate. This is due to the gap penalty.
Some other metrices give higher penalties for gaps than others in order to give alignments a more faithful match.

This brings us to substitution matrices:

## Chapter 5 Substitution Matrices
One of the fundamental operations in bioinformatics is comparing two sequences, either nucleic acids or proteins, and line them up to archieve maximal levels of identity. This is called _**pairwise sequence alignment**_. Quantification of the similarity of the two sequences is important for establishing whether they are homologs. There are two widely used methods developed to quantify the similarity.

- Method 1: align closely related homologs and count the frequencies of amino acid substitutions.
- Method 2: use a database of aligned sequences derived from protein domains that have a particular structure or function. The frequencies of amino acid substitutions are recorded.

These two methods gave rise to the *PAM* and *BLOSUM* series of amino acid substitution matrices, respectively. These substitution matrices are used in many sequence alignment tools.

The *Biopython* package includes these substitution matrices.

In [None]:
# this scriptlet display BLOSUM62 and PAM250 matrices
from Bio.Align import substitution_matrices as smatrices
blosum62 = smatrices.load("BLOSUM62")
pam250 = smatrices.load("PAM250")
print(blosum62)
print("-"*80)
print(pam250)

### Using Substitution Matrices
Using the aforementioned pairwise alignment tools and substitution matrices offered by biopython, one can make global and local sequence alignments far more accurate and suited to a specific need.

Using the previously used pairwise global alignment code, we can now specify which substitution matrix we want the alignment algorithm to use:

In [None]:
#load the needed substitution matrices needed for alignments
blosum62 = smatrices.load("BLOSUM62")

#first to use two example protein sequences
#both should be titled "alpha.fasta" and "beta.fasta" respectively
alpha = SeqIO.read("alpha.fasta", "fasta")
beta = SeqIO.read("beta.fasta", "fasta")

#then throw these sequences into the aligner
#note, PAIRWISE2 command uses Needleman and Waterman for global and local alignments unless otherwise specified
#in this case we can specify that we want to use the BLOSUM62 matrix
alignments = pairwise2.align.globalds(alpha.seq, beta.seq, blosum62, -10, -0.5)

#then give a graphical display of the alignment
print(pairwise2.format_alignment(*alignments[0]))

Note that this gives an alignment according to the BLOSUM62 matrix.

This also works for local alignments:

In [None]:
#load the needed substitution matrices needed for alignments
blosum62 = smatrices.load("BLOSUM62")

#first to use two example protein sequences
alpha = SeqIO.read("alpha.fasta", "fasta")
snipit = "AQVKGH"

#then throw these sequences into the aligner
#note, PAIRWISE2 command uses Needleman and Waterman for global and local alignments unless otherwise specified
#in this case we can specify that we want to use the BLOSUM62 matrix
alignments = pairwise2.align.localds(alpha.seq, snipit, blosum62, -10, -0.5)

#then give a graphical display of the alignment
print(pairwise2.format_alignment(*alignments[0]))

Using this matrix, unlike having used the local alignment alone, greatly improved the accuracy of the alignment. This largely a result of putting a penalty on gaps within a sequence of interest.

## Chapter 6 BLAST
BLAST is an acronyme standing fro basic local alignment search tool. BLAST allows a user to search the NCBI's databases for a matching sequence to an input DNA, RNA, or protein sequence of interest.
There are multiples BLAST offered tools to search for each of these different types of primary sequence data:
- BLASTn, for nucleotide sequences
- BLASTp, for protein sequences
- BLASTx, for potential translation products of a nucleotie sequence
- tBLAST, for comparing a protein sequence to matching nucleotide sequences
- tBLASTx, for comparing a translated nucleotide sequence to a protein sequence database
These tools might be of use to a biologist who's trying to figure out the fucntion of their novel gene, or who wishs to understand distantly related genes to the sequence of interest. It can also server to understand and gather more information about what purpose that sequence likely serves.

### How to do a BLAST search
To compare your sequence of interest to the NCBI database through BLAST, first you'll need the sequence of interest as a string (plain letters), a fasta file, or the sequence identifier.
In this case we will be performing a BLASTp search using the previously used alpha hemoglobin fasta file found [here](https://www.uniprot.org/uniprot/P69905.fasta).

In [None]:
#First get the alpha hemoglobin file and store it in "record"
record = SeqIO.read("alpha.fasta", "fasta")

#Then perform the search on the NCBI database through blast with the sequence of interest
blast_results = NCBIWWW.qblast("blastp", "nr", record.seq)

#then read out the results of the search
blast_records = NCBIXML.parse(blast_results)
blast_record = next(blast_records)

Searches on BLAST and NCBI might take a while to do, so remember to be patient. Hemoglobin proteins have been thoroughly studied and as a result will have a lot of matching entries listed in regards to them.

However, once the search result returnes they contain very useful information. Interpreting this information is essential to understanding the utility of BLAST searches.

### The E value and BLAST results
Of note when conducting BLAST searches is the E value. The E value is a variable that in the simplist of sense tells you how close a result matches your quary. To reiterate, BLAST searches the NCBI's databases to find likely matches to the input sequence. Initially, the expected E value is the number of hits a search might get by chance by being entered into the database. However, that number exponentially decreases as the score of the matched sequence increases. 

Generally, the lower the E value, the better the matching result.

In [None]:
#set a maximum E value so as to screen out any background noise in the seach results
E_VALUE_THRESH = 0.04

#then cycle through the various results obtained in the search, and print information pertaining to them:
for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            print("****Alignment****")
            print("sequence:", alignment.title)
            print("length:", alignment.length)
            print("e value:", hsp.expect)
            print(hsp.query[0:75] + "...")
            print(hsp.match[0:75] + "...")
            print(hsp.sbjct[0:75] + "...")

Once the search is complete it should return a wealth of information. From some of this information, you should be able to gather the E value for each entry in the database. As hemoglobin is a short protein sequence, and has been thoroughly sequenced, most of the returns have an especially low E value indicating that the results closely match the input sequence. 

But an individual might gather more than simply the E value of their search from running their sequence through BLAST. The matching sequences, lengths and information pertaining to these results can also be gathered. Furthermore, an indivudual might also gather the pbd ID code for the 3D strucutre of the desired protein from conducting a BLAST search so as to run it through NGLViewer, as shown below.

## Chapter 7 Molecular Graphics

In this section, we use NGLView to display a structure of interest.

Below is a textbox where the protein data base code pertaining to a protein of interest might be entered.
Once entered, the code accesses 3D rendering information for the protein.
Usually this structural information has been initially gathered using x-ray chrystalography.

In [None]:
#In this block, the user will be prompted to enter the name of the pdb code that they would like to look at.
text_box = str(input("Please enter the PDB code of the structure: "))

In [None]:
# we minic the folder structure of PDB database and save the pdf file in the corresponding folder.
first_pdb_file = PDBList()
name = first_pdb_file.retrieve_pdb_file(text_box)
protein_file = ""
last_value = -1
print(name[-2])

if (platform.system() == 'Windows'):
    while (name[last_value] != '\\'):
        protein_file += name[last_value] #Each of the letters is added to the protein_name string, starting from the last letter.
        #The value of last_value (which is supposed to immitate the index) is reduced by one, and since it's negative, the constantly decreasing value goes towards the beginning of the string.
        last_value -= 1 

else:
    while (name[last_value] != '/'):
        protein_file += name[last_value] #Each of the letters is added to the protein_name string, starting from the last letter.
        #The value of last_value (which is supposed to immitate the index) is reduced by one, and since it's negative, the constantly decreasing value goes towards the beginning of the string.
        last_value -= 1 

protein_file = protein_file[::-1] #The protein name is reversed, to get a proper pdb file format.

protein_name = ""

index = 0
while(protein_file[index] != '.'):
    protein_name += protein_file[index]
    index += 1

protein_name = protein_name.upper()
print("The name of the protein is, " + protein_file)

In [None]:
protein_class = str(protein_file[1]) + str(protein_file[2])

#We create an instance of the MMCIF Parser, to load the protein file.
parser = MMCIFParser()

path = os.path.join(protein_class, protein_file)
structure = parser.get_structure(protein_name, path)
path = os.path.join(protein_class, protein_file)
structure = parser.get_structure(protein_name, path)

In [None]:

def clean_protein(obj):
    print(', '.join([a for a in dir(obj) if not a.startswith('_')]))
    
clean_protein(structure)

In [None]:
view_one = nv.show_biopython(structure)
view_one

In [None]:
view_two = nv.show_biopython(structure)
view_two.add_ball_and_stick()
view_two

In [None]:
# clean up the folder if neccessary
os.remove(path) #Removes the file after the user is done looking at the protein.
os.rmdir(protein_class)