## Reading and Writing Files

https://levelup.gitconnected.com/python-reading-and-writing-data-from-files-d3b70441416e

Reading and writing files is necessary for data analysis, and pandas is particularly useful when working with dataframes etc...

The most basic way to read and write files like txt documents is using the built-in function open('filename')

In [1]:
#If the file doesn't exist, python will give an error
open ('myfile')

FileNotFoundError: [Errno 2] No such file or directory: 'myfile'

In [2]:
#You can also open things using a try command...
try:
    f = open("myfile")
except IOError:
    print('File does not exist')

File does not exist


In [3]:
#Now let's open an actual file, in this case I have a txt file of vibrio_cholerae sequence
f = open ('Vibrio_cholerae.txt')

In [4]:
#one way to read the file is with a loop - line is a special syntax with for lops

for line in f:
    print(line)

ACAATGAGGTCACTATGTTCGAGCTCTTCAAACCGGCTGCGCATACGCAGCGGCTGCCATCCGATAAGGTGGACAGCGTCTATTCACGCCTTCGTTGGCAACTTTTCATCGGTATTTTTGTTGGCTATGCAGGCTACTATTTGGTTCGTAAGAACTTTAGCTTGGCAATGCCTTACCTGATTGAACAAGGCTTTAGTCGTGGCGATCTGGGTGTGGCTCTCGGTGCGGTTTCAATCGCGTATGGTCTGTCTAAATTTTTGATGGGGAACGTCTCTGACCGTTCTAACCCGCGCTACTTTCTGAGTGCAGGTCTACTCCTTTCGGCACTAGTGATGTTCTGCTTCGGCTTTATGCCATGGGCAACGGGCAGCATTACTGCGATGTTTATTCTGCTGTTCTTAAACGGCTGGTTCCAAGGCATGGGTTGGCCTGCTTGTGGCCGTACTATGGTGCACTGGTGGTCACGCAAAGAGCGTGGTGAGATTGTTTCGGTCTGGAACGTCGCTCACAACGTCGGTGGTGGTTTGATTGGCCCCATTTTCCTGCTCGGCCTATGGATGTTTAACGATGATTGGCGCACGGCCTTCTATGTCCCCGCTTTCTTTGCGGTGCTGGTTGCCGTATTTACTTGGCTAGTCATGCGCGATACTCCTCAATCTTGTGGTTTACCACCGATTGAAGAGTACAAAAACGACTATCCCGATGATTACGATAAGTCGCATGAAAATGAGATGACTGCGAAAGAGATCTTCTTTAAGTATGTCTTCAACAACAAACTGCTTTGGTCGATTGCGATTGCTAACGCCTTCGTTTACCTGATCCGCTACGGTGTACTTGACTGGGCTCCGGTTTACCTCAAAGAAGCCAAACACTTCACGGTTGATAAATCTTCTTGGGCTTACTTCCTGTACGAGTGGGCGGGCATTCCGGGTACTTTGTTGTGTGGTTGGATTTCCGACAAAGTGTTTAAAGGCCGCCGCGCTCCAGCAGGCATCCTGTT

In [22]:
#You could also just use the read method

In [6]:
f.read()

''

You can also write to a file using an append mode open('filename', a)

In [7]:
d = open ('demo.txt')
d.read()

'this is just a random file I made for this demo'

In [8]:
d.close() #I'm going to close the file because I'm going to reopen it in append mode

In [9]:
d = open ('demo.txt', 'a') #this opens the file in a special append mode

In [10]:
d.write(' ...Here is a second sentence')

29

In [11]:
#Notice, that I cannot read the file when it is in append mode
d.read()

UnsupportedOperation: not readable

In [12]:
#So I have to close the file...
d.close()

In [13]:
#Now I can open it and read it
d = open ('demo.txt')
d.read()

'this is just a random file I made for this demo ...Here is a second sentence'

## Reading fasta files

Exercise: Build a dictionary containing all sequences from a FASTA file

In [14]:
try:
    f = open("lambda_virus.fa")
except IOError:
    print('File does not exist')

In [15]:
seqs={} #create the dictionary seq
for line in f:
    line=line.rstrip() #removes any white space that we don't want in sequence
    if line.startswith('>'): #if the line starts with '>'
        words=line.split()
        name=words[0][1:] #we slice in from the first position since we ignore >
        seqs[name]=''
    else:
        seqs[name] = seqs[name] + line
f.close()

In [16]:
seqs

{'gi|9626243|ref|NC_001416.1|': 'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACCCTCTGAAAAGAAAGGAAACGACAGGTGCTGAAAGCGAGGCTTTTTGGCCTCTGTCGTTTCCTTTCTCTGTTTTTGTCCGTGGAATGAACAATGGAAGTCAACAAAAAGCAGCTGGCTGACATTTTCGGTGCGAGTATCCGTACCATTCAGAACTGGCAGGAACAGGGAATGCCCGTTCTGCGAGGCGGTGGCAAGGGTAATGAGGTGCTTTATGACTCTGCCGCCGTCATAAAATGGTATGCCGAAAGGGATGCTGAAATTGAGAACGAAAAGCTGCGCCGGGAGGTTGAAGAACTGCGGCAGGCCAGCGAGGCAGATCTCCAGCCAGGAACTATTGAGTACGAACGCCATCGACTTACGCGTGCGCAGGCCGACGCACAGGAACTGAAGAATGCCAGAGACTCCGCTGAAGTGGTGGAAACCGCATTCTGTACTTTCGTGCTGTCGCGGATCGCAGGTGAAATTGCCAGTATTCTCGACGGGCTCCCCCTGTCGGTGCAGCGGCGTTTTCCGGAACTGGAAAACCGACATGTTGATTTCCTGAAACGGGATATCATCAAAGCCATGAACAAAGCAGCCGCGCTGGATGAACTGATACCGGGGTTGCTGAGTGAATATATCGAACAGTCAGGTTAACAGGCTGCGGCATTTTGTCCGCGCCGGGCTTCGCTCACTGTTCAGGCCGGAGCCACAGACCGCCGTTGAATGGGCGGATGCTAATTACTATCTCCCGAAAGAATCCGCATACCAGGAAGGGCGCTGGGAAACACTGCCCTTTCAGCGGGCCATCATGAATGCGATGGGCAGCGACTACATCCGTGAGGTGAATGTGGTGAAGTCTGCCCGTGTCGGTTATTCCAAAATGCT

### Of note, you can pull fasta files straight from the url if you install wget

In [17]:
import wget

In [18]:
!python -m wget http://www.uniprot.org/uniprot/B5ZC00.fasta


Saved under B5ZC00 (3).fasta


In [19]:
try:
    f= open('B5ZC00.fasta')
except IOError:
    print('File does not exist')

seqs={}
for line in f:
    line=line.rstrip() #removes any white space that we don't want in sequence
    if line.startswith('>'): #if the line starts with '>'
        words=line.split()
        name=words[0][1:] #we slice in from the first position since we ignore >
        seqs[name]=''
    else:
        seqs[name] = seqs[name] + line
f.close()

    

In [20]:
seqs

{'sp|B5ZC00|SYG_UREU1': 'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQKDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSSNEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVNFKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKYLNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYDLSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILMDLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIYCLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK'}

In [36]:
type(seqs)

dict