## 0. Import libraries, configure environment and define global variables

Importing the json library to help export whatever data we gather from the DNA sequence.

In [23]:
import json
import pandas as pd

In [24]:
pd.options.display.max_rows = 100
pd.options.display.max_columns = 50
pd.options.display.float_format = '{:.3f}%'.format

Setting the fixed width for the sliding window to traverse the DNA sequence.

In [25]:
windowWidth = 3

The total number of positions of the sliding window will be the length of the sequence minus the size of the sliding window.

`len(seq) - windowSize`

## 1. Define functions

Function to create DataFrames given nucleotide information. The DataFrames will have nucleotide frequency as well as the percentage that the nucleotide make up of the frame.  

In [26]:
def formNucleotideTable(dictionary, base):
    # Making the names of the count and percentage cols dynamic by passing the name of the nucleotide as param
    countColName = base
    percentColName = base + '%'
    
    # Create DataFrame using the keys as the index and count values as data
    retTable = pd.DataFrame({
            'Frame Position': list(dictionary.keys()),
            countColName: list(dictionary.values())
        }
    ).set_index('Frame Position')
    
    # Calculate percentage of the frame that the nucleotide makes up 
    countList = retTable[countColName].tolist()
    percentage = [((i / windowWidth) * 100) for i in countList]

    # Add as a column
    retTable[percentColName] = percentage
    
    return retTable

Function to implement the sliding window functionality passing the DNA sequence as a string. This function returns four dictionaries, one for each nucleotide containing it's frequency of occurrence in the sequence.

In [27]:
def slidingWindow(seq):
    length = len(seq)
    A = dict()
    T = dict()
    G = dict()
    C = dict()
    
    # Loop through entire sequence
    for i in range(length - windowWidth + 1):
        aCount = 0
        tCount = 0        
        gCount = 0
        cCount = 0
        
        # In each iteration, loop though window
        for j in range(i, windowWidth + i):
            if seq[j] == 'A':
                aCount += 1
            elif seq[j] == 'T':
                tCount += 1
            elif seq[j] == 'G':
                gCount += 1
            else:
                cCount += 1
                
        # Update counts of each nucleotide in respective dictionary
        A[i + 1] = aCount    
        T[i + 1] = tCount    
        G[i + 1] = gCount    
        C[i + 1] = cCount  
        
    return A, T, G, C

Simple function to convert a list to a string.

In [28]:
def listToString(ls): 
    returnString = ''
    
    for elt in ls: 
        returnString += elt  
    
    return returnString

## 2. Import data

We store the sample DNA string into a .txt file to keep our notebook clutter-free. That data needs to be imported from the text file to the program using file handling. Since the data is imported as a list, we convert it into a string using the above function.

In [29]:
# Courtesy of https://www.bioinformatics.org/sms2/random_dna.html & http://www.faculty.ucr.edu/~mmaduro/random.htm
with open('dna.txt', 'r') as fin:
    lines = fin.readlines()

# Convert entire sequence to uppercase
dna = listToString(lines).upper()

print(len(dna))

10000


## 3. Create DataFrames and Export data

Creating dictionaries for each nucleotide.

In [30]:
A, T, G, C = slidingWindow(dna)

Here we create a DataFrame for each individual nucleotide using the function defined above as well as a concatenated master DataFrame.

In [32]:
# Using the function defined above to make a table for each nucleotide
aTable = formNucleotideTable(A, 'A')
tTable = formNucleotideTable(T, 'T')
gTable = formNucleotideTable(G, 'G')
cTable = formNucleotideTable(C, 'C')

In [33]:
aTable

Unnamed: 0_level_0,A,A%
Frame Position,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2,66.67%
2,2,66.67%
3,2,66.67%
4,2,66.67%
5,1,33.33%
...,...,...
9994,0,0.00%
9995,1,33.33%
9996,1,33.33%
9997,1,33.33%


In [34]:
# Concat. all the nucleotide DataFrames
master = pd.concat([aTable, tTable, gTable, cTable], axis=1).reindex(aTable.index)

In [35]:
master

Unnamed: 0_level_0,A,A%,T,T%,G,G%,C,C%
Frame Position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,2,66.67%,1,33.33%,0,0.00%,0,0.00%
2,2,66.67%,1,33.33%,0,0.00%,0,0.00%
3,2,66.67%,0,0.00%,0,0.00%,1,33.33%
4,2,66.67%,0,0.00%,0,0.00%,1,33.33%
5,1,33.33%,1,33.33%,0,0.00%,1,33.33%
...,...,...,...,...,...,...,...,...
9994,0,0.00%,2,66.67%,0,0.00%,1,33.33%
9995,1,33.33%,1,33.33%,0,0.00%,1,33.33%
9996,1,33.33%,1,33.33%,1,33.33%,0,0.00%
9997,1,33.33%,1,33.33%,1,33.33%,0,0.00%


Converting the master concatenated DataFrame to a dictionary and then serialising to JSON. Doing this since the formatting comes out to be much better than when you use pandas' in-built `to_json()` function. 

In [36]:
masterDictionary = master.to_dict('split')

# Serializing json
jsonObject = json.dumps(masterDictionary, indent=4)

# Writing json file
with open('data.json', 'w') as outfile:
    outfile.write(jsonObject)