# 1. Source

Click on the link to go to the source web page of **Rosalind**: [Counting DNA Nucleotides](https://rosalind.info/problems/dna)

**Problem**

![Counting DNA Nucleotides!](dna_problem.png 'Counting DNA Nucleotides')

**Sample Dataset**

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

**Sample Output**

20 12 17 21

# 2. Workspace

In [1]:
# copy paste the given dna sequence and assign it to a variable: dnaSeq

dnaSeq = 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC'
print(dnaSeq)

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC


In [2]:
# but rosalind gives us a file
# write a few lines of codes to read the dna sequence directly from the given file

with open('dna_test.txt', 'r') as file:
    dnaSeq = file.read().strip().upper()

In [3]:
# print what we have read on the file

print(dnaSeq)

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC


In [4]:
# count the number of adenines in dnaSeq

print(dnaSeq.count('A'))

20


In [5]:
# count the number of cytosines in dnaSeq

print(dnaSeq.count('C'))

12


In [6]:
# count the number of guanines in dnaSeq

print(dnaSeq.count('G'))

17


In [7]:
# count the number of thymines in dnaSeq

print(dnaSeq.count('T'))

21


In [8]:
# it does not look wise writing the same command again and again for counting
# create a for-loop instead

bases = 'ACGT'
for base in bases:
    print(dnaSeq.count(base))

20
12
17
21


In [9]:
# build the same strategy using list comprehension

result = [str(dnaSeq.count(base)) for base in 'ACGT']
result_str = ' '.join(result)
print(result_str)

20 12 17 21


In [10]:
# one way is to check each base in dnaSeq 
# and increasing the number of relevant bases one by one, depending on which base in dnaSeq

# first we can initialize the counters
A = 0; C = 0; G = 0; T = 0

# loop over dnaSeq
for base in dnaSeq:
    if base == 'A':
        A += 1
    elif base == 'C':
        C += 1
    elif base == 'G':
        G += 1
    else:
        T += 1
        
# print counts of nucleotides
print(f'{A} {C} {G} {T}')

20 12 17 21


In [11]:
# with a similar sense it is possible to create a frequence dictionary

counts_dict = dict()

# loop over dnaSeq
for base in dnaSeq:
    if base in counts_dict.keys():
        counts_dict[base] += 1
    else:
        counts_dict[base] = 1
        
# see the counts_dict's final status
counts_dict

{'A': 20, 'G': 17, 'C': 12, 'T': 21}

In [12]:
# we can print it

result = ''

for base, count in counts_dict.items():
    result += str(count) + ' '
    
# print it
print(result.strip())

20 17 12 21


In [13]:
# or using Counter function of collections module will give the same result

from collections import Counter

result = Counter(dnaSeq)

# see the result's final status
result

Counter({'A': 20, 'G': 17, 'C': 12, 'T': 21})

In [14]:
# looping over dnaSeq may take more time as the length of dnaSeq increases
# therefore, i am not willing to use the options that looping over dnaSeq
# but to prove that we can perform a very simple speed test for all options

### --A Simple Speed Test

In [15]:
# increase the size of dna sequence to see the differences (if there is any) easier

print('initial dnaSeq length:', len(dnaSeq))
dnaSeq *= 10000
print('final dnaSeq length:', len(dnaSeq))

initial dnaSeq length: 70
final dnaSeq length: 700000


In [16]:
# option 1

In [17]:
%%timeit -n 500

counts_list = []

counts_list.append(str(dnaSeq.count('A')))
counts_list.append(str(dnaSeq.count('C')))
counts_list.append(str(dnaSeq.count('G')))
counts_list.append(str(dnaSeq.count('T')))

1.81 ms ± 5.27 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)


In [18]:
# option 2

In [19]:
%%timeit -n 500

counts_list = []

for base in 'ACGT':
    counts_list.append(str(dnaSeq.count(base)))

1.79 ms ± 9.21 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)


In [20]:
# option 3

In [21]:
%%timeit -n 500

counts_list = [str(dnaSeq.count(base)) for base in 'ACGT']

1.79 ms ± 7.17 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)


In [22]:
# option 4

In [23]:
%%timeit -n 500

A = 0; C = 0; G = 0; T = 0

for base in dnaSeq:
    if base == 'A':
        A += 1
    elif base == 'C':
        C += 1
    elif base == 'G':
        G += 1
    else:
        T += 1

53.4 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)


In [24]:
# option 5

In [25]:
%%timeit -n 500

counts_dict = dict()

for base in dnaSeq:
    if base in counts_dict.keys():
        counts_dict[base] += 1
    else:
        counts_dict[base] = 1

88.6 ms ± 79.6 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)


In [26]:
# option 6

In [27]:
%%timeit -n 500

from collections import Counter
result = Counter(dnaSeq)

30.8 ms ± 163 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)


In [28]:
# as a result the options 4 - 5 (looping over dnaSeq) need 30 - 50 times longer runtime than
# python's built-in function: .count() / the options 1, 2 and 3
# and that runtime will increase dramatically as the length of dnaSeq increases
# Counter option is better than option 4 and 5 but also there are better options than this: 1, 2 and 3
# also printing and saving Counter object's content in the requested format by rosalind may be tricky
# therefore, i will implement the option 3 which includes .count() and list comprehension

# 3. Implementation

In [29]:
def dna(filename):
    
    '''
    input
        a file contains a dna string (sequence)
    process
        counts each nucleotide (A, C, G, T) in given dna sequence
    returns
        prints number of nucleotides seperated by spaces: <A> <C> <G> <T>
        writes result in a file and saves it
    '''
    
    # open and read file
    with open(filename, 'r') as file:
        dnaString = file.read().strip().upper()
        
    # count each nucleotide
    bases = 'ACGT'
    counts = [str(dnaString.count(base)) for base in bases]
    counts_str = ' '.join(counts)
    
    # print answer to console
    print('\n\x1B[1mANSWER\x1B[0m\n______\n')
    print(f'{counts_str}')
    
    # open file and write answer
    file = open(f'{filename.split(".")[0]}_answer.txt', 'w')
    file.write(f'{counts_str}')
    file.close()
    print('\n\n#! The answer has been written into the file:',
          f'\x1B[1m./{filename.split(".")[0]}_answer.txt\x1B[0m\n')

# 4. Execution

In [30]:
dna('dna_test.txt')


[1mANSWER[0m
______

20 12 17 21


#! The answer has been written into the file: [1m./dna_test_answer.txt[0m



In [31]:
dna('rosalind_dna.txt')


[1mANSWER[0m
______

232 228 191 200


#! The answer has been written into the file: [1m./rosalind_dna_answer.txt[0m



In [32]:
dna('rosalind_dna_12_dataset.txt')


[1mANSWER[0m
______

243 251 250 242


#! The answer has been written into the file: [1m./rosalind_dna_12_dataset_answer.txt[0m



<p style='text-align: right;'>
    <!--<b><font size = '5'>Contact</font></b><br>-->
    <b>Orcun Tasar</b><br>
    <i>Bioinformatician / Data Scientist</i><br>
    orcuntasar |at@| ogr.iu.edu.tr<br>
    tasar.orcun |at@| gmail.com<br>
    <a href = 'https://www.linkedin.com/in/orçun-taşar-7b5992a1/'>Linkedin</a> | <a href = 'https://www.instagram.com/shatranuchor/'>Instagram</a>
</p>