<a href="https://colab.research.google.com/github/TomKellyGenetics/toy_snp_caller/blob/master/Copy_of_ONT_pyCUDA_Align.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this exercise we will be creating a python code to generate some artificial DNA sequences that will be our set of reference genes. We will then create a new sequence that is similar to one of the references and create a function that finds the best match. We will measure the performance of this code and then consider how we might create a GPU-accelerated version.

First of all let's import a few libraries

In [0]:
import random
import numpy as np
import time

Next, set up some constants to define the size of the problem

In [0]:
REFERENCE_LENGTH = 500  # The length of the reference sequences
SEQUENCE_LENGTH = 100    # The length of the sequence to be generated (before noise is added)
NUMBER_REFERENCES = 100  # The number of references in which to find a match

# Dictionary that converts the base letters to an integer
ascii_to_index = {"C": 0, "G": 1, "A": 2, "T": 3}
# Array that can be used to get the base letter from the integer value
index_to_ascii = ["C", "G", "A", "T"]

Create a function to create a populate some artificial reference sequences, which we will be searching for the best alignment

In [0]:
# Randomly assign CAGTs to the reference sequences
def initialise_references():
    refs = []

    for i in range(NUMBER_REFERENCES):
        ref = ""
        for j in range(REFERENCE_LENGTH):
            irand = random.randint(0, 3)
            ref += index_to_ascii[irand]

        refs.append(ref)

    return refs


Now we are going to generate the sequence to match by randomly selecting one of the reference sequences and then adding some noise to it.

In [0]:
# Randomly pick one of the references, pick an offset and
# add noise - this will be the sequence we'll seek
def initialise_sequence(references):

    seq = ""
    # randomly select the reference string to match
    iref = random.randint(0, NUMBER_REFERENCES-1)
    ref_offset = random.randint(0,REFERENCE_LENGTH - SEQUENCE_LENGTH)
    print("Actual offset = " + str(ref_offset) + " on sequence #" + str(iref))
    ref = references[iref]
    dels = 0
    subs = 0
    ins = 0

    for i in range(ref_offset, ref_offset+SEQUENCE_LENGTH):
        base = ref[i]
        i_rand = random.randint(0, 1000)

        if (i_rand < 22):
            # Insertion of up to 4 characters
            i_len = random.randint(0, 3)
            ins += i_len
            cnew=""
            for i in range(i_len):
                irand = random.randrange(0, 3)
                cnew += index_to_ascii[irand]

            seq+=cnew + base
        elif (i_rand < 44):
            subs += 1
            # Substitute the current character
            inew = random.randint(0, 2)
            if (base == "A"):
                cnew = index_to_ascii[inew if (inew == 2) else 3]
            elif (base == "T"):
                cnew = index_to_ascii[inew]
            elif (base == "C"):
                cnew = index_to_ascii[inew+1]
            else:
                cnew = index_to_ascii[inew if (inew == 1) else 0]

            seq+=cnew
        elif (i_rand < 66):
            # Deletion
            dels += 1
        else:
            seq += base

    print(str(subs) + " subs, " + str(dels) + " dels," + str(ins) + " ins")
    print(ref)
    print(seq)

    return seq


Next, create a cost function. In this case it is simple +1 for a match and -1 for anything else (insertion, deletion or substitution)

In [0]:
def cost(char1, char2):
  return -1 if char1!=char2 else +1

Next we create the main alignment function

In [0]:
# Compute a cost and, optionally, an offset for the semi-global alignment of seq to ref
def align(seq, ref, matrix, trace, compute_trace=False):

    max_score = -len(seq)
    max_idx = 0

    # Initialise the scoring matrices
    for i in range(len(seq)):
        matrix[i] = -(i+1)
        if compute_trace:
            trace[i, 0] = -i

    # Do the NW matrix calculation (column by column - to preserve memory)
    for i, base_r in enumerate(ref):
        top_left = 0
        bottom_left = matrix[0]
        top_right = 0
        for j, base_s in enumerate(seq):
            cost_m = cost(base_s, base_r) + top_left
            cost_i = bottom_left - 1
            cost_d = top_right - 1
            top_left = matrix[j]
            matrix[j] = max(cost_m, cost_i, cost_d)
            bottom_left = matrix[j+1]
            top_right = matrix[j]

            if compute_trace:
                if cost_m >= cost_i:
                    if cost_m >= cost_d:
                        trace[i + 1, j + 1] = [i, j]
                    else:
                        trace[i + 1, j + 1] = [i + 1, j]
                else:
                    if cost_i >= cost_d:
                        trace[i + 1, j + 1] = [i, j + 1]
                    else:
                        trace[i + 1, j + 1] = [i + 1, j]

        if matrix[len(seq)-1] > max_score:
            max_score = matrix[len(seq)-1]
            max_idx = i

    return max_score, trace, max_idx


We also need a function to compute the cost of the aligment using the back-tracking part of the Needleman-Wunsch algorithm

In [0]:
# Use the traceback to get the offset
def compute_offset(traceback, len_seq, ref, max_idx):

    match = ""
    idx = max_idx
    ls = len_seq

    # Traceback to get the offset
    while ls > 0:
        match = ref[idx - 1] + match
        idx, ls = traceback[idx, ls]
        #print(idx, ls)

    return idx



Right, now we have all the main functions defined we can create the main body of the program that will execute the functions we defined in order to find the best alignment between our references and the noisy sequence.

In [0]:
offset = 0
score = 0
max_score = 0
ref_idx = 0
time_span = 0

# For degugging, it can be helpful to set the seed so that the same strings are produced each time
#random.seed(30)

t1 = time.time()

refs = initialise_references()
seq = initialise_sequence(refs)

# Create arrays for the scoring matrix and traceback matrix
score_matrix = np.zeros(len(seq) + 1, dtype=np.int32)
trace_matrix = np.zeros((REFERENCE_LENGTH+1, len(seq) + 1, 2), dtype=np.int32)
max_score = -len(seq)

time_span = time.time() - t1
print("Data initialisation took " + str(time_span) + " milliseconds.")

# Run alignment on the host
t1 = time.time()

scores=[]
for i in range(NUMBER_REFERENCES):
    score, _, _ = align(seq, refs[i], score_matrix, trace_matrix)
    scores.append(score)
    if score > max_score:
        max_score = score
        ref_idx = i

# now get offset
score, trace_matrix, offset_idx = align(seq, refs[ref_idx], score_matrix, trace_matrix, compute_trace=True)
offset = compute_offset(trace_matrix, len(seq), refs[ref_idx], offset_idx)
print("Optimal cost of " + str(max_score) + " found at offset " + str(offset) + " in reference " + str(ref_idx))
time_span = time.time() - t1
print("Serial took " + str(time_span) + " seconds")

Great - now can you create a kernel to do the alignment? Have a think for 10 minutes before you start implementing anything.

We also need to install pycuda and to import the libraries (for this, insert a scratch code window and type 'pip install pycuda')


In [0]:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda import gpuarray

# Create your kernel (or kernels)
myfunc = SourceModule("""
__global__ void ...
{
    
}
""")

Now create the calling code in Python (you can leave all the other functions as they were)

In [0]:
# Now run the same using our kernel
t1 = time.time()

#TODO - Insert data preparation code

time_span = time.time() - t1
print("Data initialisation took " + str(time_span) + " milliseconds.")

t1 = time.time()

#TODO Insert alignment code

# now get offset on the best reference match
score, trace_matrix, offset_idx = align(seq, refs[max_idx], score_matrix, trace_matrix, compute_trace=True)

offset = compute_offset(trace_matrix, len(seq), refs[max_idx], offset_idx)
print("Optimal cost of " + str(max_s) + " found at offset " + str(offset) + " in reference " + str(ref_idx))
time_span = time.time() - t1
print("CUDA version took " + str(time_span) + " seconds")