# 1. Source

Click on the link to go to the source web page of **Rosalind**: [Read Filtration by Quality](https://rosalind.info/problems/filt/)

 **Problem**
 
 ![Read Filtration by Quality](filt_problem.png "Read Filtration by Quality")

**Sample Dataset**

20 90<br>
@Rosalind_0049_1<br>
GCAGAGACCAGTAGATGTGTTTGCGGACGGTCGGGCTCCATGTGACACAG<br>
+<br>
FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,<br>
@Rosalind_0049_2<br>
AATGGGGGGGGGAGACAAAATACGGCTAAGGCAGGGGTCCTTGATGTCAT<br>
+<br>
1<<65:793967<4:92568-34:.>1;2752)24')*15;1,.3*3+*!<br>
@Rosalind_0049_3<br>
ACCCCATACGGCGAGCGTCAGCATCTGATATCCTCTTTCAATCCTAGCTA<br>
+<br>
B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+

**Sample Output**

2

# 2. Workspace

The similar calculations / workflow was discussed [here](https://github.com/ShatraNuchor/rosalind_solution_set/tree/main/bioinformatics_armory/005_PHRE).

In [1]:
# read and parse the input file

qc_scores = list()

with open('filt_test.txt', 'r') as file:
    tANDp = file.readline().rstrip().strip().split()
    threshold = int(tANDp[0])
    percentage = int(tANDp[1])
    while True:
        if len(file.readline()) == 0:
            break
        file.readline()
        file.readline()
        qc_score = file.readline().rstrip()
        qc_scores.append(qc_score)

In [2]:
# display qc scores

qc_scores

['FD@@;C<AI?4BA:=>C<G=:AE=><A??>764A8B797@A:58:527+,',
 "1<<65:793967<4:92568-34:.>1;2752)24')*15;1,.3*3+*!",
 'B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+']

In [3]:
# display percentage

percentage

90

In [4]:
# display threshold

threshold

20

In [5]:
# define a function fopr ascii char --> num conversion

def phredToNum(qcChar):
    numeric_qc = ord(qcChar) - 33
    return numeric_qc

In [6]:
phredToNum('F')

37

In [7]:
# define another function to calculate what percentage of bases are above threshold in sequence

def qcPT(qcSequence, threshold, percentage):
    qcNums = [phredToNum(qcChar) for qcChar in qcSequence]
    above_threshold = [qcNum for qcNum in qcNums if qcNum >= threshold]
    existing_percentage = 100 * len(above_threshold) / len(qcNums)
    if existing_percentage >= percentage:
        return True
    return False

In [8]:
qcPT('B:EI>JDB5=>DA?E6B@@CA?C;=;@@C:6D:3=@49;@87;::;;?8+', 20, 90)

True

In [9]:
# check each sequence's qc scores

counter = 0

for qc_score in qc_scores:
    if qcPT(qc_score, 20, 90):
        counter += 1
        
counter

2

# 3. Implementation

In [10]:
def phredToNum(qcChar):
    
    '''
    input
        an ascii char
    process
        converts char to a numerical value using phred -33 scale
    output
        an integer
    '''
    
    # convert char to number based on ascii table
    numeric_qc = ord(qcChar) - 33
    
    # return
    return numeric_qc

In [11]:
def qcPT(qcSequence, threshold, percentage):
    
    '''
    input
        qc score series as ascii char (phred-33) of a dna sequence read: qcSequence
        threshold and percentage for filtration
    process
        calculates whether given qc scores's p percentage is above t
    output
        if given qc scores's p percentage is above t returns True
        otherwise returns False
    '''
    
    # create a numerical qc score list
    qcNums = [phredToNum(qcChar) for qcChar in qcSequence]
    
    # extract qc scores which are above threshold
    above_threshold = [qcNum for qcNum in qcNums if qcNum >= threshold]
    
    # calculate the percentage of scores which are above t
    existing_percentage = 100 * len(above_threshold) / len(qcNums)
    
    # check if existing p is >= input p and return
    if existing_percentage >= percentage:
        return True
    return False

In [12]:
def filtFileParser(filename):
    
    '''
    input
        a file containing a threshold and fastq records
    process
        extracts t, p and qc_scores
    output
        returns t, p and qc_scores
    '''
    
    #initiate an empty list to keep qc score sequences, list of 4th lines of each record
    qc_scores = list()

    # open and parse file
    with open(filename, 'r') as file:
        
        # take threshold and percentage
        tANDp = file.readline().rstrip().split()
        threshold = int(tANDp[0])
        percentage = int(tANDp[1])
        
        # loop over for rest of file
        while True:
            if len(file.readline()) == 0:
                break
            file.readline()
            file.readline()
            qc_score = file.readline().rstrip()
            qc_scores.append(qc_score)
    
    # return
    return threshold, percentage, qc_scores

In [13]:
def filt(filename):
    
    '''
    input
        a file containing a threshold, a percentage and fastq records
    process
        parses file and extracts an integers as threshold and percentages and a list of qc score sequences
        finds how many of reads have qc scores above threshold with at least with input percentage
    output
        prints number reads with given condition
        writes asnwer in a file
    '''
    
    # get threshold and qc score list
    threshold, percentage, qcs = filtFileParser(filename)
    
    # loop over list:qc and count 
    counter = 0
    for qc in qcs:
        if qcPT(qc, threshold, percentage):
            counter += 1
            
    # print answer to console
    print('\n\x1B[1mANSWER\x1B[0m\n______\n')
    print(f'{counter}')
    
    # open file and write answer
    file = open(f'{filename.split(".")[0]}_answer.txt', 'w')
    file.write(f'{counter}')
    file.close()
    print('\n\n#! The answer has been written into the file:',
          f'\x1B[1m./{filename.split(".")[0]}_answer.txt\x1B[0m\n')

# 4. Execution

In [14]:
filt('filt_test.txt')


[1mANSWER[0m
______

2


#! The answer has been written into the file: [1m./filt_test_answer.txt[0m



In [15]:
filt('rosalind_filt_1_dataset.txt')


[1mANSWER[0m
______

24


#! The answer has been written into the file: [1m./rosalind_filt_1_dataset_answer.txt[0m



In [16]:
filt('rosalind_filt_2_dataset.txt')


[1mANSWER[0m
______

0


#! The answer has been written into the file: [1m./rosalind_filt_2_dataset_answer.txt[0m



In [17]:
filt('rosalind_filt_3_dataset.txt')


[1mANSWER[0m
______

51


#! The answer has been written into the file: [1m./rosalind_filt_3_dataset_answer.txt[0m



In [18]:
filt('rosalind_filt_4_dataset.txt')
# 54


[1mANSWER[0m
______

54


#! The answer has been written into the file: [1m./rosalind_filt_4_dataset_answer.txt[0m



In [19]:
filt('rosalind_filt.txt')


[1mANSWER[0m
______

59


#! The answer has been written into the file: [1m./rosalind_filt_answer.txt[0m



<p style='text-align: right;'>
    <!--<b><font size = '5'>Contact</font></b><br>-->
    <b>Orcun Tasar</b><br>
    <i>Bioinformatician / Data Scientist</i><br>
    orcuntasar |at@| ogr.iu.edu.tr<br>
    tasar.orcun |at@| gmail.com<br>
    <a href = 'https://www.linkedin.com/in/orçun-taşar-7b5992a1/'>Linkedin</a> | <a href = 'https://www.instagram.com/shatranuchor/'>Instagram</a>
</p>