## 0. Import libraries and define global variable

Importing the json library to help export whatever data we gather from the DNA sequence.

In [54]:
import json
import pandas as pd

Setting the fixed width for the sliding window to traverse the DNA sequence.

In [55]:
windowWidth = 3

## 1. Define functions

Function to create DataFrames given nucleotide information. The DataFrames will have nucleotide frequency as well as the percentage that the nucleotide make up of the frame.  

In [56]:
def formNucleotideTable(dictionary, base):
    # Making the names of the count column dynamic by passing the name of the nucleotide as param
    countColName = 'Count of ' + base
    
    # Create DataFrame using the keys as the index and count values as data
    retTable = pd.DataFrame({
            'Frame Position': list(dictionary.keys()),
            countColName: list(dictionary.values())
        }
    ).set_index('Frame Position')
    
    # Calculate percentage of the frame that the nucleotide makes up 
    countList = retTable[countColName].tolist()
    percentage = [((i / windowWidth) * 100) for i in countList]

    # Add as a column
    retTable['Percentage of Frame'] = percentage
    
    return retTable

Function to implement the sliding window functionality passing the DNA sequence as a string. This function returns four dictionaries, one for each nucleotide containing it's frequency of occurrence in the sequence.

In [57]:
def slidingWindow(seq):
    length = len(seq)
    A = dict()
    T = dict()
    G = dict()
    C = dict()
    
    # Loop through entire sequence
    for i in range(length - windowWidth + 1):
        aCount = 0
        tCount = 0        
        gCount = 0
        cCount = 0
        
        # In each iteration, loop though window
        for j in range(i, windowWidth + i):
            if seq[j] == 'A':
                aCount += 1
            elif seq[j] == 'T':
                tCount += 1
            elif seq[j] == 'G':
                gCount += 1
            else:
                cCount += 1
                
        # Update counts of each nucleotide in respective dictionary
        A[i + 1] = aCount    
        T[i + 1] = tCount    
        G[i + 1] = gCount    
        C[i + 1] = cCount  
        
    return A, T, G, C

Simple function to convert a list to a string.

In [58]:
def listToString(ls): 
    returnString = ''
    
    for elt in ls: 
        returnString += elt  
    
    return returnString

## 2. Import data

We store the sample DNA string into a .txt file to keep our notebook clutter-free. That data needs to be imported from the text file to the program using file handling. Since the data is imported as a list, we convert it into a string using the above function.

In [59]:
# Courtesy of https://www.bioinformatics.org/sms2/random_dna.html & http://www.faculty.ucr.edu/~mmaduro/random.htm
with open('dna.txt', 'r') as fin:
    lines = fin.readlines()

# Convert entire sequence to uppercase
dna = listToString(lines).upper()

print(len(dna))

10000


## 3. Create DataFrames and Export data

Creating dictionaries for each nucleotide.

In [60]:
A, T, G, C = slidingWindow(dna)

Here we create the master DataFrame as well as a DataFrame for each individual nucleotide using the function defined above.

In [61]:
# Creating a master dictionary
master = {
    'Count of A': A,
    'Count of T': T,
    'Count of G': G,
    'Count of C': C
}

masterTable = pd.DataFrame(master)

# Setting the index to be Frame Position
masterTable.index.name = 'Frame Position'

In [62]:
# Using the function defined above to make a table for each nucleotide
aTable = formNucleotideTable(A, 'A')
tTable = formNucleotideTable(T, 'T')
gTable = formNucleotideTable(G, 'G')
cTable = formNucleotideTable(C, 'C')

In [63]:
aTable

Unnamed: 0_level_0,Count of A,Percentage of Frame
Frame Position,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,66.666667
2,2,66.666667
3,2,66.666667
4,2,66.666667
5,1,33.333333
...,...,...
9994,0,0.000000
9995,1,33.333333
9996,1,33.333333
9997,1,33.333333


Exporting data to a json file.

In [64]:
# Serializing json
jsonObject = json.dumps(master, indent=4)

# Writing json file
with open('data.json', 'w') as outfile:
    outfile.write(jsonObject)