# 1. Source

Click on the link to go to the source web page of **Rosalind**: [Base Filtration by Quality](https://rosalind.info/problems/bfil/)

 ![Base Filtration by Quality](bfil_problem.png "Base Filtration by Quality")

**Sample Dataset**

20<br>
@Rosalind_0049<br>
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACACAG<br>
+<br>
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,<br>
@Rosalind_0049<br>
AATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCTTGATGTCAT<br>
+<br>
1<<65:793967<4:92568-34:.>1;2752)24')*15;1,.3*3+*!<br>
@Rosalind_0049<br>
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCTA<br>
+<br>
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+

**Sample Output**

@Rosalind_0049<br>
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACAC<br>
+<br>
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527<br>
@Rosalind_0049<br>
ATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCT<br>
+<br>
<<65:793967<4:92568-34:.>1;2752)24')*15;<br>
@Rosalind_0049<br>
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCT<br>
+<br>
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8

# 2. Workspace

In [1]:
# read and parse the input file

from Bio.SeqIO.QualityIO import FastqGeneralIterator # import biopython fastq iterator

qc_scores = list()

file = open('bfil_test.txt', 'r')

threshold = int(file.readline().rstrip())

for triplet in FastqGeneralIterator(file):
    identifier, sequence, quality = triplet
    qc_scores.append(quality)
    
file.close()

In [2]:
# look into qc_scores and threshold

print(threshold)
print(*qc_scores[:5], sep = '\n')

20
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,
1<<65:793967<4:92568-34:.>1;2752)24')*15;1,.3*3+*!
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+


In [3]:
# to compare each score with the threshold
# they need to be converted to numerical values - phred -33

def phred(qcSeq):
    qcNums = [ord(qcChar) - 33 for qcChar in qcSeq]
    return qcNums

In [4]:
# apply the function to all qc sequences

qc_scores_2 = list(map(lambda x: phred(x), qc_scores))

# print

print(*qc_scores_2[:2], sep = '\n\n')

[37, 35, 31, 31, 26, 34, 27, 32, 40, 30, 19, 33, 32, 25, 28, 29, 34, 27, 38, 28, 25, 32, 36, 28, 29, 27, 32, 30, 30, 29, 22, 21, 19, 32, 23, 33, 22, 24, 22, 31, 32, 25, 20, 23, 25, 20, 17, 22, 10, 11]

[16, 27, 27, 21, 20, 25, 22, 24, 18, 24, 21, 22, 27, 19, 25, 24, 17, 20, 21, 23, 12, 18, 19, 25, 13, 29, 16, 26, 17, 22, 20, 17, 8, 17, 19, 6, 8, 9, 16, 20, 26, 16, 11, 13, 18, 9, 18, 10, 9, 0]


In [5]:
# If loop each qc score sequence and compare the scores with threshold from trail and from lead:

for qcRead in qc_scores_2: # [qc1, qc2, qc3] in [[qc1, qc2, qc3], [qc4, qc5, qc6]]
    
    # look into leading bases qc
    i = 0 # start index - will go 0, 1, 2, 3, .... at each iteration
    lead_counter = 0 # counter to keep track of score < threshold
    while True:
        if qcRead[i] < 20:
            lead_counter += 1
            i += 1
        else:
            break
    print(lead_counter)
    
    # look into trailing bases qc
    i = -1 # start index - will go -1, -2, -3, ... at each iteration
    trail_counter = 0 # counter to keep track of score < threshold
    while True:
        if qcRead[i] < 20:
            trail_counter += 1 # we have one score < threshold
            i -= 1 # go to the left score at next iteration
        else:
            break
    print(trail_counter)
    
    print('##')

0
2
##
1
9
##
0
1
##


In [6]:
# the first read: from lead 0, from trail 2 records will be deleted
# the second read: from lead 1, from trail 9 records will be deleted
# the third read: from lead 0, from trail 1 records will be deleted

In [7]:
# the first read: from lead 0, from trail 2 records will be deleted

qc_scores[0][:-2]

'FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527'

In [8]:
# the second read: from lead 1, from trail 9 records will be deleted

qc_scores[1][1:-9]

"<<65:793967<4:92568-34:.>1;2752)24')*15;"

In [9]:
# the third read: from lead 0, from trail 1 records will be deleted

qc_scores[1][1:-1]

"<<65:793967<4:92568-34:.>1;2752)24')*15;1,.3*3+*"

In [10]:
# all of this stuff can be done while parsing the input file:

file = open('bfil_test.txt', 'r')

threshold = int(file.readline().rstrip())

for triplet in FastqGeneralIterator(file):
    identifier, dnaSeq, qcReads = triplet
    
    qcReadsNum = phred(qcReads)

    i = 0; lead_counter = 0
    while True:
        if qcReadsNum[i] < threshold:
            lead_counter += 1; i += 1
        else:
            break
    
    i = -1; trail_counter = 0 
    while True:
        if qcReadsNum[i] < threshold:
            trail_counter += 1; i -= 1 
        else:
            break
                
    #  After defining trimming amounts for both sides, apply sliding to both sequence and qc
    
    print(identifier)
    print(dnaSeq[lead_counter : - trail_counter])
    print('+')
    print(qcReads[lead_counter : - trail_counter])

# close file

file.close()

Rosalind_0049
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACAC
+
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527
Rosalind_0049
ATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCT
+
<<65:793967<4:92568-34:.>1;2752)24')*15;
Rosalind_0049
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCT
+
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8


# 3. Implementation

In [11]:
def phred(qcSeq):
    
    '''
    input
        ascii caharacter sequence string
    process
        converts into numeric values of each char
    output
        returns list of integers
    '''
    
    # process
    qcNums = [ord(qcChar) - 33 for qcChar in qcSeq]
    
    return qcNums

In [12]:
def bfil(filename):
    
    '''
    input
        a file containing a threshold, and fastq records
    process
        trims leading and trailing bases - below threshold
    output
        prints answer to console
        writes answer to a file
    '''
    
    # open input file for reading
    input_file = open(filename, 'r')
    
    # open output file for writing
    output_file = open(f'{filename.split(".")[0]}_answer.txt', 'w')

    # start reading input file
    threshold = int(input_file.readline().rstrip())

    print('\n\x1B[1mANSWER\x1B[0m\n______\n')
    
    # extract each triplet for every 4 lined-Fastq records
    for triplet in FastqGeneralIterator(input_file):
        identifier, dnaSeq, qcReads = triplet
        
        # phred -33 conversion
        qcReadsNum = phred(qcReads)

        # define leading and trailing trim amount
        i = 0; lead_counter = 0
        while True:
            if qcReadsNum[i] < threshold:
                lead_counter += 1; i += 1
            else:
                break
    
        i = -1; trail_counter = 0 
        while True:
            if qcReadsNum[i] < threshold:
                trail_counter += 1; i -= 1 
            else:
                break
                
        # After defining trimming amounts for both sides, apply sliding to both sequence and qc
        # prints to console and writes to file
        processed_identifier = '@' + identifier
        print(processed_identifier) 
        output_file.write(f'{processed_identifier}\n')
        
        processed_dnaSeq = dnaSeq[lead_counter : - trail_counter]
        print(processed_dnaSeq)
        output_file.write(f'{processed_dnaSeq}\n')
        
        print('+')
        output_file.write(f'+\n')
        
        processed_qcReads = qcReads[lead_counter : - trail_counter]
        print(processed_qcReads)
        output_file.write(f'{processed_qcReads}\n')

    # close files
    input_file.close()
    output_file.close()
    print('\n\n#! The answer has been written into the file:',
          f'\x1B[1m./{filename.split(".")[0]}_answer.txt\x1B[0m\n')

In [13]:
bfil('bfil_test.txt')


[1mANSWER[0m
______

@Rosalind_0049
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACAC
+
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527
@Rosalind_0049
ATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCT
+
<<65:793967<4:92568-34:.>1;2752)24')*15;
@Rosalind_0049
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCT
+
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8


#! The answer has been written into the file: [1m./bfil_test_answer.txt[0m



In [14]:
bfil('rosalind_bfil_2.txt')


[1mANSWER[0m
______

@Rosalind_0000
GTAGTAGCGCCTGTATTCGCGGTCGATAACGTTCAATGTTTAAAGGGTTCGGATACATCCCCCTTTAAACATAGGCTCCGCGCGTTACAAAAGTAGGAATAATTACTGGTACCTGGCAAAAACTTGAGCCGCCGTATAAATGGGCAGGAATACCAAGACTTGG
+
<B:HB=DJI?5>FJBJ?C8=<?ADDFFGEFFD??>;:5@FD<A=IHBHA<C?FD8B@9G<?:DA@E;EH@E>@BJE>IGD=A>H<@<D;HF8:=F???:<>4;7?<==?J=A=A0=@<=B76=??E?A;<:?;<87C5>3?:989995776:8:20+432/-:
@Rosalind_0001
GTCGTGGACCATTCTCGTTGGCAACGTCAGTACCAAGCCCAGCCATGTTGTCAGCACCGAAAACGGGAAGCGACTAATCTGGCGATCTACTATCTTACAGCCTAAAACCCGATCTCTTCGAAAAGGCACAGCCTATGCGGTACCAATTCT
+
C?A;GC><D;AB:FJ@@>CFAA=@8==@;CAJ=;B??CAHD==?4:A?;<AGBC;=@:;==D9AECBI<98=AH9D?>;E?>:H<B@BF68>=@7D8A7781<>=>AB6??;8=FAA9:8C;>:;=A69432;8@?85;4644<D:+6?B
@Rosalind_0002
GGGTGCAGGCGGGACCAATCACGGCAAAGGGTCGGCTGAAGGCGCGTTAGCACCGCGGAGATTATCGACCAGGCGGTCGGCAGCACGGTCAGGTATTGTTGTACCCATCCAGTCAAAACCGTAAAA
+
=*993731967+75.9<=33)>;16?55/151817<0:2E;7269*:3?'32<2;62:-60072;4.32<:94371-,87./52379844*96409*5,754;62:9/732*:-56.1-.&65(*;
@Rosalind_0003
TCTTGCGGATATGGTGTGAAGACGTG

In [15]:
bfil('rosalind_bfil.txt')


[1mANSWER[0m
______

@Rosalind_0000
GGCTCGTCCACGTGGTAACCTGGAGGGTCCCGACCCAAATGTAGGGGGGACATAGCCAATCTTCGCCGATGGCAGGATTCCACGAGCTGGTCATTGGGTCAAAATTATTGGGGTATGTGCGGTCAGAGTGTGT
+
>8282?<*6:<990=492>;;6435?2070773.;;7:83@49=55<8576686884@8@@@7657882;515:36-B215427;469;.44;4590/;2;>/:0->4.1.>197+1729179<1034/1-;A
@Rosalind_0001
CTTACGGGTGGTGCATAATAAAACAGTTTCATGTCTATAATTTACGCAGTCGACGCAGCCATTCCCATGAACTGAACGCGGTGCTCAGGGTAGTAACCTAATTGGGCAGACTACGTGAATATGCTGCAATCCAAGCGCACGGGCCCTGCACAGCAGACACGGCTACGACCGAGC
+
=>?BC;7B?>?2G>?@=?D:>9?<?::9><>;C@:<<E6CB1A::=4<7E<<<:<C?B=;<B;@3<;<?9BC;<=:A?;A<=;;A@::@B;?C>@@7<7@;295=?@A5D==628.6>>>9=:8>29,67A;666?@FB896799::<;3:42.24:7/<?9-73,5793011>
@Rosalind_0002
CCGCATTGTCATCCAATGACACTTACCAGTTCCATCGCCGGGTGGCCGACTAGGCGATTAGACTGTATCGGCTGCTTCGATTGTCTCGGATATGTTCAGGCACTCGTGGCGCCAGGGTGTCAACACTTGCACCGCTCACACGAAGACCCAATGGTATGGGATGAGGTTTAGTTAGCAACAT
+
?F7A=A?J7D5>>@CDB@:3CF:AD?<=J:DA;DJDG=DE?EE59FA??7GB4>IB>>FD>FAF;>?>A?D<@CA><BEH=::8BD;FC=EFI>?>4<HIFEAH=?97=9<>4A>79960A=AE9

<p style='text-align: right;'>
    <!--<b><font size = '5'>Contact</font></b><br>-->
    <b>Orcun Tasar</b><br>
    <i>Bioinformatician / Data Scientist</i><br>
    orcuntasar |at@| ogr.iu.edu.tr<br>
    tasar.orcun |at@| gmail.com<br>
    <a href = 'https://www.linkedin.com/in/orçun-taşar-7b5992a1/'>Linkedin</a> | <a href = 'https://www.instagram.com/shatranuchor/'>Instagram</a>
</p>