## Chapter 3: Reading and writing files

### - Reading text from a file

textfiles are files that contains characters and lines that can be open and viewed using a text editor.
Some Examples are
   - FASTA files of DNA or protein sequences
   - files containing output from command-line programs (e.g. BLAST)
   - FASTQ files containing DNA sequencing reads
   - HTML files
   - word processing documents
   - and Python code

### - Using open to read a file

In python , we have to open a file before we can read it. we use the open function which takes one argument, a string that contains the name of the file and it returns a file object

In [1]:
my_file = open("dna.txt")

A file object is a new datatype, it represents a file on the computer's hard drive. We use methods to interact with file objects.

first we need to do now is to read the file using the read method, it doesn't take an argument but returns a string we store in a variable. then we treat it like a regular string

In [20]:
my_file = open("dna.txt")
file_contents = my_file.read()
print(file_contents)

ACTGTACGTGCACTGATC



#### - Files, content and file names

The difference between a file object, file name and the content of the file

In [3]:
my_file_name = "dna.txt"
my_file = open(my_file_name)
my_file_contents = my_file.read()

In line 1, we store the string dna.txt in the variable
my_file_name. On line 2, we use the variable my_file_name as the argument
to the open function, and store the resulting file object in the variable my_file.
On line 3, we call the read method on the variable my_file, and store the
resulting string in the variable my_file_contents.

my_file_name is a string, and it stores the name of a file on disk.
my_file is a file object, and it represents the file itself. my_file_contents is a
string, and it stores the text that is in the file.

#### - Dealing with new lines

We're going to write a simple program to read the DNA sequence from the file and
print it out along with its length.

In [7]:
#open the file
my_file = open("dna.txt")
# read the contents
my_DNA = my_file.read()
#calculate the length
dna_length = len(my_DNA)
# print the output
print("sequence is " + my_DNA + " and length is " + str(dna_length))


sequence is ACTGTACGTGCACTGATC
 and length is 19


Python has included the new line
character at the end of the dna.txt file as part of the contents. In other words, the
variable my_dna has a new line character at the end of it. If we could view the
my_dna variable directly

the solution is to remove the method to remove the new line, which is the rstrip method and it takes one argument which is the character you want to remove

In [9]:
#open the file
my_file = open("dna.txt")
# read the contents
my_file_contents = my_file.read()
#remove the new line from the end of the file content
my_DNA = my_file_contents.rstrip("\n")
dna_length = len(my_DNA)
print("sequence is " + my_DNA + " and length is " + str(dna_length))


sequence is ACTGTACGTGCACTGATC and length is 18


### - Opening files for writing

To open a file for writing, we use a two-argument string describing what we want to do to the file, the second argument can be "r" for reading and "w" for writing and "a" for appending. and the default is "r" when we leave out the second argument

The difference between "w" and "a" is if we open an already existing file with "w" it overwrites with what data we write and for "a" it add the new data onto the end for the file but it won't remove the existing content

we can use the write method to write into a file we've opened, it takes one argument which is what we want to write into the file.

In [10]:
my_file = open("out.txt", "w")
my_file.write("Hello World")

11

Remember that with write, just like with print, we can use any string as the
argument. This also means that we can use any method or function that returns a
string

In [None]:
# write "abcdef"
my_file.write("abc" + "def")
# write "8"
my_file.write(str(len('AGTGCTAG')))
# write "TTGC"
my_file.write("ATGC".replace('A', 'T'))
# write "atgc"
my_file.write("ATGC".lower())
# write contents of my_variable
my_file.write(my_variable)

#### - Closing files
We use the close method after we are done reading and writing a file


In [12]:
my_file = open("out.txt", "w")
my_file.write("Hello world")
# remember to close the file
my_file.close()

## - Exercises

1. Splitting genomic DNA

Look in the chapter_3 folder for a file called genomic_dna.txt – it contains the same
piece of genomic DNA that we were using in the final exercise from chapter 2. Write
a program that will split the genomic DNA into coding and non-coding parts, and
write these sequences to two separate files.

In [25]:
# open the file and read its contents
dna_file = open("genomic_dna.txt")
dna = dna_file.read()
# open the two output files
coding_region = open("coding_region.txt", "+w")
non_coding_region = open("non_coding_region.txt", "+w")
#first exon
exon1 = dna[0:63]
#intron in lowercase
intron = dna[63:90].lower()
#second exon
exon2 = dna[90:]
# write the sequences to the output files
coding_region.write(exon1 + exon2)
non_coding_region.write(intron)


27

2. Writing a FASTA file

FASTA file format is a commonly-used DNA and protein sequence file format. A
single sequence in FASTA format looks like this:


">sequence_name

ATCGACTGATCGATCGTACGAT"


Where sequence_name is a header that describes the sequence (the greater-than
symbol indicates the start of the header line). Often, the header contains an
accession number that relates to the record for the sequence in a public sequence
database. A single FASTA file can contain multiple sequences, like this:

">sequence_one

ATCGATCGATCGATCGAT

">sequence_two

ACTAGCTAGCTAGCATCG

">sequence_three

ACTGCATCGATCGTACCT


Write a program that will create a FASTA file for the following three sequences –
make sure that all sequences are in upper case and only contain the bases A, T, G
and C.

Sequence header DNA sequence

ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG

DEF456 actgatcgacgatcgatcgatcacgact

HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG

In [42]:
# set the values of all the header variables
header_1 = "ABC123"
header_2 = "DEF456"
header_3 = "HIJ789"

# set the values of all the sequence variables
seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
seq_2 = "actgatcgacgatcgatcgatcacgact"
seq_3 = "ACTGAC-ACTGT--ACTGTA----CATGTG"

# make a new file to hold the output
sequence_fasta = open("sequence.fasta", "w")

# write the header and sequence
sequence_fasta.write( "<" + header_1 +"\n"+seq_1+"\n")
sequence_fasta.write( "<" + header_2 +"\n"+seq_2.upper()+"\n")
sequence_fasta.write( "<" + header_3 +"\n"+seq_3.replace('-', '')+"\n")

#close
sequence_fasta.close()

3. Writing multiple FASTA files

Use the data from the previous exercise, but instead of creating a single FASTA file,
create three new FASTA files – one per sequence. The names of the FASTA files
should be the same as the sequence header names, with the extension .fasta.

In [39]:
# set the values of all the header variables
header_1 = "ABC123"
header_2 = "DEF456"
header_3 = "HIJ789"

# set the values of all the sequence variables
seq_1 = "ATCGTACGATCGATCGATCGCTAGACGTATCG"
seq_2 = "actgatcgacgatcgatcgatcacgact"
seq_3 = "ACTGAC-ACTGT--ACTGTA----CATGTG"

# make the file and write sequence for file 1
ABC123_fasta = open(header_1 + ".fasta", "w")
ABC123_fasta.write( "<" + header_1 +"\n"+seq_1+"\n")

# make the file and write sequence for file 2
DEF456_fasta = open(header_2 + ".fasta", "w")
DEF456_fasta.write( "<" + header_2 +"\n"+seq_2.upper()+"\n")

# make the file and write sequence for file 3
HIJ789_fasta = open(header_3 + ".fasta", "w")
HIJ789_fasta.write( "<" + header_3 +"\n"+seq_3.replace('-', '')+"\n")

#close
ABC123_fasta.close()
DEF456_fasta.close()
HIJ789_fasta.close()