# Read and write sequence files with Biopython
## Read a *.fasta* file with Biopython

In [None]:
from Bio import SeqIO

In [None]:
records = SeqIO.parse("fastcats.fasta", "fasta")
records

*records* is an object of type **iterator**

In [None]:
for record in records:
    print(record)
    print("____")

Keep in mind that iterating an *iterator* consumes it. If we run the same loop again we will not get anything!

In [None]:
for record in records:
    print(record)

We will convert the iterator to a list so that we can better play with the data:

In [None]:
records = list(SeqIO.parse("fastcats.fasta", "fasta"))
records

We can see that *records* consists on record *SeqRecord* objects that, in turn, consist of a sequence (Seq object), and an id among others.

One SeqRecord object

In [None]:
records[0]

The sequence of the zeroeth sequence

In [None]:
records[0].seq

In [None]:
print(records[0].seq)

Some of their methods they are methods you have in regular strings

In [None]:
records[0].seq.startswith('THE')

In [None]:
for record in records:
    print(record.id, '\t', record.seq)

## Write a *.fasta* file with Biopython

We prepare one new sequence as a SeqRecord object

In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

record = SeqRecord(
    Seq("THEFATRAT"),
    id="FAT_RAT",
    description="a fat rat sequence",
    )
print(record)

We prepare a list of two sequences as SeqReord objects

In [None]:
records = [
    SeqRecord(
    Seq("THEFATRAT"),
    id="FAT_RAT",
    description="a fat rat sequence",
    ),
    SeqRecord(
    Seq("----A-RAT"),
    id="A_RAT",
    description="a rat sequence",
),
    ]        
print(records)

We write the list of records to a new file

In [None]:
with open("fatrats.fa", "w") as fname_out:
    SeqIO.write(records, fname_out, "fasta")

Let's check that everything is fine

In [None]:
records = list(SeqIO.parse("fatrats.fa", "fasta"))
records

## Other file types

Write our previous sequence in *Clustal* format

In [None]:
with open("fatrats.aln", "w") as fname_out:
    SeqIO.write(records, fname_out, "clustal")

Open a *.fastq* file

In [None]:
records = list(SeqIO.parse("reads.fastq", "fastq"))
records

In [None]:
records[0].seq

In [None]:
print(records[0].letter_annotations["phred_quality"])

# Your own functions to read/display *.fasta* files
Parsing files sometimes is not an easy task. There are not always tools avaliable for our files of interest. How would it be parsing *.fasta* files without Biopython? In that case we need to understand well the structure of the file format (*fasta* in our case).

In [None]:
with open("fastcats.fasta") as f_inp:
    lines = f_inp.read().splitlines()
lines

In [None]:
for line in lines:
    print(line)

# Exact matching

How to do exact matching in Python? Or how to find the offset of a substring *p* in a string *t*?

In [None]:
t = 'THEFASTCAT'
p = 'FAST'
t.find(p)

How to find the second (the third ...) instance of the motif?

In [None]:
t = 'THEFASTCATTHEFASTRATAFASTRAT'
t.find(p, 0)