# Bioinformatics Workshop 1

Welcome to the first workshop of the MBSI Bioinformatics course! In this Jupyter notebook, we will introduce you to the basics of genetic code manipulation with scikit-bio. Feel free to work through the exercises at your own pace, and ask the tutors if you encounter any problems. You can run the coding cells using Shift + Enter.

---

## Getting started with scikit-bio

The first thing we will have to do is import scikit-bio, along with other relevant libraries. Also, you may want to set the working directory to the place you have your files for today's workshop after the `%cd`.

In [None]:
#Import libraries 
import skbio
print(skbio.art) #Quick check to see if skbio is working
import numpy as np
import pandas as pd
%cd D:\workshop

Next, we will use the scikit-bio function `skbio.sequence.DNA()` to input our first snippets of DNA code using the DNA data type unique to scikit-bio. Remember that only characters in the IUPAC DNA character set are supported, although in our case we will simply use A, T, C and G. The characters should be input as a string with no spaces, eg. `'ATGAGGCCT'`. Have a go in the cell below, and check out the result!

In [None]:
#Exercise 0a
skbio.sequence.DNA('ATGAGGCCT')

On top of just the sequence of genetic bases, the DNA datatype also returns key stats of the sequence, including the length, if it has any gaps, its GC content (covered later) and more. Also, you can attatch metadata such as a description or id using a dictionary like so: `skbio.sequence.DNA('ACGT', metadata={'id': 'my-id', 'description': 'my-description'})`.

Also, you can try out these sequences to check the output:
- `skbio.sequence.DNA('ACGT--A-CG-B')`
- `skbio.sequence.DNA('AcggTaG, lowercase = True)`
- `skbio.sequence.DNA('BVMYSWHHKR')`

> 💡 **Further exercise:**
A genetician has a genetic seqeunce `atCGaa--cGGA` that they would like to input as a `skbio.DNA` file type. They also want to attach the id `GSDFGGH1132` and the description `Sample 6443` as metadata elements with the sequence. Hearing that you have 10 minutes of experience with scikit-bio, they turn to you to help them input the sequence.

In [None]:
#Exercise 0b
#Your code here

#### Importing Fasta and Fastq files

Today, we'll be using Fasta and Fastq files containing real-world genetic information. To import them, we will use the `skbio.io.read()` function. For our use, the function needs three inputs: the file name, the format and the data type we will read into, in this case `skbio.sequence.DNA`. For example:

In [119]:
Sample = skbio.io.read('Sample.fasta', format = 'fasta', into = skbio.sequence.DNA)
print(Sample)

TATTTACTTTCAGTTTTATTAATCTCCTCATTTTATTGGAAGAATTTTTGTTTGGGGTGTGATTGGGAGGGCTGAGCTAATTTGCTCCCTTCCTACCTTTTTTTGATGGCAAGCCTGATTCATCAATTTTTTTCTTTCTCTGTTTTAGTCAAGTAGGCCTCTAGAGAGGGGAACATCCCACACCCCTTGGTATGTCGTCTCATGATTGTCATTCTCTTGGGGAAAACGATTATTTGTGTCTAGGATTTGCTCTTTCACCCATTCTTTCTTTAGGTTCTTTTGTTTCCAGTGACTTTTGGATCCGAGTTATAACTTTGAATTTAGTGTTTATTGCTTGGTGATCTGCATTTGAAGTTGAGGTTTTTATGTCCCTGTATATACTCACTTTTTGCATAGGTTCCATAAACTCACAGAATATCCGAAGAGGATATTACACCTGAGGGGACTGGCTTTCCTAAGATTACTCTGGAGTTGGTCTTGGGGCCACGTGGTTTTGTTCTTGCCTGATCTTGGCAAGATCTAGTTGCCATCTAGTTTCTTTCTTTGCATGACATCTTGTGTTCCTACTTAAGAGCAGAATTTACTGAATTATTTTGGGGTGGAATGCTGATTGTGACAGGTGTTGACTCGATAATTTTGGAAATGTTTATGAATTTTCTTTTGGATCTTGTGGGACAGCTTTTCCTTTTTTCCCTCTGCTCATTATCTTGAGCTCAGAGAAAACTGTTAATTCCCTAGTGCCCCTCTTGCTTACTCTGATGATTCTGTTCACAGTTCTCTTGCTTGGGAAACATATTGTTTCGTTGACTTAGTCCTGCTCTCCCTCATCCATTACAGATGGGCAATCCTCCATGATCAGTTTGTTTTCCCTTCTGGGTAGTTCTTTTAATGCCAATCTGCTGCTTTTCTTTGGAATGTCATTTCTAGAGGTTTCCTTTTGGAATTTGGGTGGAGGAAAAGGAGCAAGTAGCCGATAAGCTTGTTTTACACTCACTTTCCA

Wow- now _that_ is a long sequence of DNA. But just how long? We can use `len()` as usual on a DNA data type to find out. 

In [50]:
len(Sample)

130564

Yeah, that's pretty long. Luckily, `scikit-bio` has a variety of useful methods we can use on the DNA data type to help us manage and interpret the sequence. Firstly, qualitative methods such as `gaps()` and `has_metadata()` help us quickly make sense of the seqeunce.

In [49]:
print(Sample.has_gaps())
print(Sample.has_metadata())

False
True


So this sample has no gaps, but has some metadata attatched to it. If it did have gaps, the `gaps()` method can be used to see the position of the gaps. To view the metadata, simply use the `metadata` attribute.

In [51]:
Example = skbio.sequence.DNA('ATC--GACCTG-A')
print(Example.gaps())
print(Sample.metadata)

[False False False  True  True False False False False False False  True
 False]
{'id': 's_harrisii_sample', 'description': ''}


This sample turns out to be from a certain 'S(arcophilus) harisii', or better known as a Tasmanian Devil. There are many other methods we can use, and we shall see them later in the workshop. For a full list, the documentation on the `skbio.seqeunce.DNA` data type is available here:http://scikit-bio.org/docs/latest/generated/skbio.sequence.DNA.html#id2. Now on to some exercises!

## Exercise 1: Counting Nucleotides

For our first exercise, we will write a simple function that will count the number of times `A`, `T`, `C` and `G` appear in our DNA sequence. To help you with this, here is a scaffold for the code needed:


- Create four **variables** that will keep track of the nucleotide counts. 
- Create a `for` loop to iterate along the enitre seqeunce
- Inside that `for` loop, add 1 to the counting variables everytime the iterator encounters the variables' corresponding letter.
- Return a dictionary of the nucleotide letter and its corresponding count.

> 💡 **Hint:**
Convert the sequence into a string before or in the for loop.

In [60]:
#Excercise 1a
def NuCounter(Sequence):
    #your code here
    countA = 0
    countT = 0
    countC = 0
    countG = 0
    for nuc in str(Sequence):
        if nuc == 'A':
            countA += 1
        elif nuc == 'T':
            countT += 1
        elif nuc == 'C':
            countC += 1
        elif nuc == 'G':
            countG += 1
        
    return {'A':countA, 'T':countT, 'C':countC, 'G':countG}


In [64]:
#Test your code below
print(NuCounter(skbio.sequence.DNA('ATCGGTCCAAGTACAG')) )#Should return {'A': 5, 'T': 3, 'C': 4, 'G': 4}
print(NuCounter(Sample))

{'A': 5, 'T': 3, 'C': 4, 'G': 4}
{'A': 42333, 'T': 44167, 'C': 22201, 'G': 21760}


Good job! As it turns out, `scikit-bio` already has a built-in method to count the nucleotides, `frequencies()`. Oh well, at least we know how that works now. Using the `frequencies()` method below:

In [65]:
Sample.frequencies()

{'A': 42333, 'C': 22201, 'G': 21760, 'N': 103, 'T': 44167}

Yields a slightly different result. In addition to the A,T,C and Gs, we also have another nucleotide: N. This just means that the specific position can be occupied by any nucleotide. Try modifying your code in exercise 1a to include counting the N positions.

In [68]:
#Excercise 1b
def NuCounterN(Sequence):
    #your code here
    countA = 0
    countT = 0
    countC = 0
    countG = 0
    countN = 0
    for nuc in str(Sequence):
        if nuc == 'A':
            countA += 1
        elif nuc == 'T':
            countT += 1
        elif nuc == 'C':
            countC += 1
        elif nuc == 'G':
            countG += 1
        else:
            countN += 1
        
    return {'A':countA, 'T':countT, 'C':countC, 'G':countG, 'N':countN}


In [69]:
#Test your code below
print(NuCounterN(skbio.sequence.DNA('ATNCGGTCCAAGNNTACAG')) )#Should return {'A': 5, 'T': 3, 'C': 4, 'G': 4}
print(NuCounterN(Sample))

{'A': 5, 'T': 3, 'C': 4, 'G': 4, 'N': 3}
{'A': 42333, 'T': 44167, 'C': 22201, 'G': 21760, 'N': 103}


**Great!** Another good use for counting nucleotides is determining the _GC content of a seqeunce_. The GC content is the percentage of bases in DNA or RNA that are either guanine (G) or cytosine (C). Knowing the GC content of a sequence has a variety of biological uses. For example, determining the evolution of a genome by recombination through tracking the GC-content over time, and discivering relationships between chromosome sizes and life-history traits.

For us to get to that point though, we first have to write a function to determine the GC content. You can adapt the code you have already written in the previous exercises to make your life eaiser. Return the output as the sentence `'This sequence has a GC content of [your result]%.'`.

In [85]:
#Exercise 1c
def GcCount(Sequence):
    #Your code here
    countG = 0
    countC = 0
    for nuc in str(Sequence):
        if nuc == 'G':
            countG += 1
        elif nuc == 'C':
            countC += 1
    
    GCpercent = ((countG+countC)/len(Sequence))*100
    
    return 'This sequence has a GC content of ' + str(GCpercent) + '%.'
    

In [86]:
#Test your code below
print(GcCount(Sample)) #output should be 'This sequence has a GC content of 33.670077509880215%.'

This sequence has a GC content of 33.670077509880215%.


**Excellent!** Again, `scikit-bio` has us beat. The in-built methods `gc_content()` and `gc_frequency()` calculate the GC content and also the total count of the Gs and Cs repectively.

In [87]:
print(Sample.gc_content()) #Note that the output is a decimal, not percentage.
print(Sample.gc_frequency())

0.33670077509880214
43961


Now that you are familiar with counting nucleotides on one sequence, time to do it on a list of seqeunces with Fastq.

---

## Exercise 2: Working with multiple sequences

As mentioned in the slides, the FASTQ file differs from the usual FASTA format in that it contains a  _set_ of reads, each with their own quality scores and metadata, rather than just a single string. This means that a slightly different method of importing and code is needed.

#### Importing a set of reads

Firstly, we shall use our old friend `skbio.io.read` to read in the FASTQ file. This time, we will leave it as a generator object, then append the reads into a list of reads. Note that we will incluse an extra argument `phred_offset=33`, so that we include all reads regardless of their associated quality scores.

In [93]:
Reads = skbio.io.read('ws1.fastq', format = 'fastq', phred_offset = 33)
#Append to a list of reads
Readset = []
for read in Reads:
    Readset.append(read)

In [120]:
#Lets see what we're working with:
print(len(Readset))
print(Readset)

200
[Sequence
--------------------------------------------------------------------
Metadata:
    'description': 'IL27_4976:3:1:4720:965#3/1'
    'id': 'ERR024571.1'
Positional metadata:
    'quality': <dtype: uint8>
Stats:
    length: 76
--------------------------------------------------------------------
0  CACCACAGCC CGTTGGCCAA CAGGTCAGCA AACTGTTTCA GAATACAGTA TGCCAACGCT
60 GGGGTAATGC TACCTG, Sequence
--------------------------------------------------------------------
Metadata:
    'description': 'IL27_4976:3:1:5121:959#3/1'
    'id': 'ERR024571.2'
Positional metadata:
    'quality': <dtype: uint8>
Stats:
    length: 64
--------------------------------------------------------------------
0  TTGCCACTGG CTTTCACTAA ACCAATGACT CGACATTTGC CGACCAGTAG CCTCATTTAT
60 AGCC, Sequence
--------------------------------------------------------------------
Metadata:
    'description': 'IL27_4976:3:1:5362:959#3/1'
    'id': 'ERR024571.3'
Positional metadata:
    'quality': <dtype: uint8>
Stats:
    

Again, that's a whole lot of reads. So how are we going to make sense of them? Well, getting the minimum, maximum and average lengths of the reads would be a good start. Here is a scaffold to use when constructing the function:

- Create an empty list to store the read lengths of all the sequences in the readset.
- Iterate using a for loop across all the reads in the readset, and store the lengths of each read in the list.
- Calculate the minimum, maximum and average lengths using the `min()`, `max()` and `sum()` functions.
- Return a dictionary containing the labels `min_length`, `max_length` and `avg_length`.

In [97]:
#Exercise 2a
def read_lengths(readset):
    #Your code here
    
    read_lengths = []
    for read in readset:
        read_lengths.append(len(read))
    
    lengthmin = min(read_lengths)
    lengthmax = max(read_lengths)
    lengthavg = sum(read_lengths)/float(len(read_lengths))
    
    return {'min_length':lengthmin, 'max_length':lengthmax, 'avg_length':lengthavg}

In [98]:
#Test your code here
print(read_lengths(Readset)) #Output should be "{'min_length': 56, 'max_length': 76, 'avg_length': 75.445}"

{'min_length': 56, 'max_length': 76, 'avg_length': 75.445}


**Nice!** Now, let's put together what we have learned so far. Create a function that outputs the average number of nucleotides A, T, C and G in a given readset. You do not have to consider any other DNA code character. This time, we will be using nested for loops, the outer one iterating over the reads, and the inner one iterating over each nucleotide in the sequence.

Here is a scaffold to help you along:
- Initialise four lists to keep track of the average counts of each nucleotide for each read
- Start a for loop interating over every read in the readset
    - **inside this loop:**
    - Create four **variables** that will keep track of the nucleotide counts. 
    - Create a `for` loop to iterate along the enitre seqeunce
        -  **Inside this loop:** 
        - Add 1 to the counting variables everytime the iterator encounters the variables' corresponding letter.
    - Append the counts to their corresponding lists.
- Take the average of each list (you can use `np.mean()` instead of using `sum()` and dividing)
- Return a dictionary containg the labels `avgA`, `avgT`, `avgC`, and `avgG` with their corresponding average counts.

> 💡 **Hint:**
A lot of the code inside the first `for` loop can be adapted or directly taken from your function in Exercise 1a!


In [101]:
#Exercise 2b
def avgATCG(readset):
    #Your code here
    
    listA = []
    listT = []
    listC = []
    listG = []
    
    for read in readset:
        countA = 0
        countT = 0
        countC = 0
        countG = 0
        
        for nuc in str(read):
            if nuc == 'A':
                countA += 1
            elif nuc == 'T':
                countT += 1
            elif nuc == 'C':
                countC += 1
            elif nuc == 'G':
                countG += 1
            
        listA.append(countA)
        listT.append(countT)
        listC.append(countC)
        listG.append(countG)
    
    return {'avgA':np.mean(listA), 'avgT':np.mean(listT), 'avgC':np.mean(listC), 'avgG':np.mean(listG)}

In [102]:
#Test your code here
print(avgATCG(Readset)) #output should be '{'avgA': 19.66, 'avgT': 18.905, 'avgC': 18.31, 'avgG': 18.57}'

{'avgA': 19.66, 'avgT': 18.905, 'avgC': 18.31, 'avgG': 18.57}


**Well Done!** Nested `for` loops are essential when dealing with a _set_ of sequences or reads. Sometimes, you just need to create a loop in a loop in a loop in a loop in a.... Anyways, we shall now take a ~~loop~~ look at transcribing, complementing and translating DNA sequences.

---

## Exercise 3: Transcribing and Translating


Genes provide information to create _proteins_. The production happens through two processes: **Transcription** and **Translation**. Transcription involves the using the DNA strand as a template to build a sequence of _RNA_. During translation, amino acids are produced according to the information from the RNA sequence. In the next few exercises, we will create a series of functions that will simulate this process.

Firstly, let's understand the basics of transcription. 
- **The four nucleotides found in DNA:** Adenine (A), Cytosine (C), Guanine (G), and Thymine (T).
- **The four nucleotides found in RNA**: Adenine (A), Cytosine (C), Guanine (G), and Uracil (U).

As you can see, RNA does not have a T nucleotide- instead, it contains Uracil(U) which we can treat as an equivalent.

For each DNA strand, the transcribed RNA strand is made by adding on the DNA's _complement nucleotide_. In this case:
- A &rarr; U
- T &rarr; A
- C &rarr; G
- G &rarr; C

In the exercise below, we shall be essentially following this process to output a transcribed sequence of RNA form a DNA input.

Here is a useful scaffold for the code needed:
- Create a `for` loop iterating over every nucleotide of the sequence
- Use `if` statements to replace the nucleotide at that specific position with it's RNA complement
- Return a `skbio.seqeunce.RNA` sequence of the transcribed RNA

In [117]:
#Exercise 3a
def transcribe(Sequence):
    #Your code here


In [None]:
#Test your code here
transcribe('GCTAA') #should return CGAUU