<a href="https://colab.research.google.com/github/ARU-Bioinformatics/Lab_techniques_for_bioinformatics/blob/main/Week_5/tutorial_10_with_python_fastq-fasta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Converting between flatfiles with Python: FASTQ -> FASTA

The lecture notes give several different examples of how Unix commands can be used to convert between flat files. However, ideally we would want to be able to achieve this in Python. Below is an exemplar piece of Python code designed for a Jupyter Notebook. This code converts a FASTQ file to a FASTA file, a common task in bioinformatics for manipulating sequence data files. FASTQ files are used for storing biological sequences (typically nucleotide sequences) along with their quality scores, whereas FASTA files store just the sequence identifiers and the sequences themselves without any quality score information.

This code assumes you have a FASTQ file named example.fastq and will produce a FASTA file named output.fasta.

In [1]:
# 1. Open the FASTQ file for reading
with open('example.fastq', 'r') as fastq_file:
    # 2. Open a new FASTA file for writing
    with open('output.fasta', 'w') as fasta_file:
        # 3. Iterate over each line in the FASTQ file
        while True:
            # 4. Read the next four lines (one FASTQ entry)
            header = fastq_file.readline().strip()
            sequence = fastq_file.readline().strip()
            fastq_file.readline()  # Plus line (ignore)
            fastq_file.readline()  # Quality score line (ignore)

            # 5. If the header is empty, we've reached the end of the file
            if not header:
                break

            # 6. Convert the FASTQ header to a FASTA header
            fasta_header = header.replace('@', '>')

            # 7. Write the FASTA header and sequence to the FASTA file
            fasta_file.write(f"{fasta_header}\n{sequence}\n")

# 8. Notify the user that the conversion is complete
print("Conversion complete. The output has been saved to 'output.fasta'.")


Conversion complete. The output has been saved to 'output.fasta'.


## Python explainer

Each number below refers to the position in the script above.

1.  The with `open('example.fastq', 'r')` command opens the file named `example.fastq` in read mode (`'r'`), assigning it to the variable fastq_file.

2.  Inside the first with block, another with block opens a new file named `output.fasta` in write mode (`'w'`), assigning it to the variable `fasta_file`. This file will store the converted sequences.

3.  The `while True:` loop is used to repeatedly read four lines from the FASTQ file, which together represent a single sequence entry.

4. Reading the Four Lines of a FASTQ Entry:
*  The first `readline()` gets the header line (which starts with `@`) and removes the newline character at the end using `.strip()`.
*  The second `readline()` gets the nucleotide sequence line.
*  The next two `eadline()` calls read the '+' line and the quality scores line, respectively, but these are not stored because they are not needed for the FASTA format.

5.  The loop checks if the header is empty (`if not header:`). An empty string indicates the end of the file has been reached, at which point the loop breaks.

6. The FASTQ header (which starts with @) is converted to a FASTA header by replacing `@` with `>`.

7.  The converted header and the sequence are written to the output.fasta file, formatted according to FASTA standards.

8.  After converting all entries, the code prints a message to notify the user that the conversion is complete and the output is saved to output.fasta.



##Going further
In future weeks you will be introduced to writing Python functions. Try to place this code into a Python function that you define.