# File I/O and Lists

By the end of this lesson, you will learn how to read and write data from files using Python and how to store and manipulate data using an important data structure in Python called lists. 

##  Learning Objectives

1. Open and read data from files
2. Write data to files
3. Create lists and learn basic list operations: length, index, append, extend, sort, reverse, insert, remove
4. Read data from a file, store in a list and manipulate the list. Then write data back to a new file


## Files

In [2]:
filepath = 'dna.txt'
my_file = open(filepath)

### File Objects

* cannot directly print a file object

In [3]:
print(my_file)

<open file 'dna.txt', mode 'r' at 0x10718c780>


### Reading a file object

In [4]:
file_contents = my_file.read()
print (file_contents)

ATATCGCGAA



### "Exhausting" a file

In [5]:
my_file = open(filepath)
print(my_file.read())

ATATCGCGAA



In [6]:
print(my_file.read())




No output is displayed because the file object has already been read through.

The file must be read in again to start from the beginning.

Storing the output into a variable (file_contents) allows us to use the file data without worry about this.

In [7]:
print (file_contents)

ATATCGCGAA



### Working with the file

In [8]:
dna_length = len(file_contents)
print('sequence is ' + file_contents + ' and the length is ' + str(dna_length))

sequence is ATATCGCGAA
 and the length is 11


Output looks strange and the length is incorrect due to a hidden newline ('\n') character

The file we read in is actually 2 lines with the second line being blank

### Stripping

In [9]:
my_dna_strip = file_contents.strip('\n')
print('sequence is ' + my_dna_strip + ' and the length is ' + str(len(my_dna_strip)))

sequence is ATATCGCGAA and the length is 10


.strip() removes any leading or trailing instances of the given character

In [12]:
new_dna = my_dna_strip.strip('A')
print(new_dna)

TATCGCG


### Closing files

It's good programming practice to close files once you have read from them.
There are limits by your OS on how many files can be kept open. 

In [14]:
my_file2 = open('three_seq.txt')
file_contents2 = my_file2.read()
my_file2.close()

## Lists 

* Most versatile python data structure
* Comma-separated values (elements) surrounded by []
* Values can be anything

In [15]:
mySequences = ["AGTAC", "GCCTTTA", "GAT", "CCCAAA"]
print mySequences

['AGTAC', 'GCCTTTA', 'GAT', 'CCCAAA']


In [16]:
seq1 = "AGTAC"
seq2 = "GCCTTTA"
seq3 = "GAT"
seq4 = "CCCAAA"

In [17]:
# You can use variables to fill up a list
mySequences = [seq1,seq2,seq3,seq4]
print mySequences

['AGTAC', 'GCCTTTA', 'GAT', 'CCCAAA']


In [18]:
myExpressionValues = [256,753,315.0,962.53,6472]
print myExpressionValues

[256, 753, 315.0, 962.53, 6472]


In [19]:
mySeqAndExpressionVals = ["AGTAC",256,"GCCTTTA",753]
print mySeqAndExpressionVals

['AGTAC', 256, 'GCCTTTA', 753]


### Manipulating Lists

Functions we will cover: length, index, append, extend, sort, reverse, insert, remove


#### These are some operations we have used previously on strings

In [20]:
# Get length of the list
print len(mySequences)

4


In [21]:
# Access individual elements in a list
print mySequences[0]
print mySequences[2:4]

AGTAC
['GAT', 'CCCAAA']


In [22]:
# Concatenate two lists
mySequences1 = [seq1,seq2]
mySequences2 = [seq3,seq4]
mySeq1and2 = mySequences1 + mySequences2
print mySeq1and2

['AGTAC', 'GCCTTTA', 'GAT', 'CCCAAA']


#### These are some operations that are unique to lists

In [23]:
# Insert elements into a list at a specific index
Seq5 = "GATAC"
mySequences.insert(2,Seq5)
print mySequences

['AGTAC', 'GCCTTTA', 'GATAC', 'GAT', 'CCCAAA']


In [24]:
# Append elements at the end of the list
Seq6 = "ACCCTA"
mySequences.append(Seq6)
print mySequences

['AGTAC', 'GCCTTTA', 'GATAC', 'GAT', 'CCCAAA', 'ACCCTA']


In [25]:
# Remove elements from the list
mySequences.remove(Seq5)
print mySequences

['AGTAC', 'GCCTTTA', 'GAT', 'CCCAAA', 'ACCCTA']


In [26]:
# Find an element within a list and return its index
print seq3
mySequences.index(seq3)

GAT


2

In [27]:
# Sort a list 
mySequences.sort()
print mySequences
myExpressionValues.sort()
print myExpressionValues

['ACCCTA', 'AGTAC', 'CCCAAA', 'GAT', 'GCCTTTA']
[256, 315.0, 753, 962.53, 6472]


In [28]:
# Reverse a list
myExpressionValues.reverse()
print myExpressionValues

[6472, 962.53, 753, 315.0, 256]


## Read data from file into a list

Let's read data from a fasta file that has two sequences and store it into a list


In [29]:
# Read data from fasta file
seqfile = open("three_seq.txt")
seqsToRead = seqfile.read()
seqfile.close()

In [30]:
print seqsToRead

ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG
CTGCTCCGGAGTGACGCGGGCCCGGGCGCGACGGTCTCGGCGGCGGCGGCGGCGGCGACAGAGCGAGCGCGGCGCGGGGCCACC
AGAAGGAGGGCGTGGTAATATGAAGTCAGTTCCGGTTGGTGTAAAACCCCCGGGGCGGCGGCGAACTGGCTTTAGATGCTTCTGGGTCGCGGTGTGCTAAGCGAGGAGTCCGAGTGTGTGAGCTTGAGAGCCGCGCGCTAGAGCGACCCGGCGAGGG


In [31]:
seqsToRead.strip('\n')

'ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG\nCTGCTCCGGAGTGACGCGGGCCCGGGCGCGACGGTCTCGGCGGCGGCGGCGGCGGCGACAGAGCGAGCGCGGCGCGGGGCCACC\nAGAAGGAGGGCGTGGTAATATGAAGTCAGTTCCGGTTGGTGTAAAACCCCCGGGGCGGCGGCGAACTGGCTTTAGATGCTTCTGGGTCGCGGTGTGCTAAGCGAGGAGTCCGAGTGTGTGAGCTTGAGAGCCGCGCGCTAGAGCGACCCGGCGAGGG'

In [33]:
seqs = seqsToRead.strip('\n').split('\n')
print seqs

['ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG', 'CTGCTCCGGAGTGACGCGGGCCCGGGCGCGACGGTCTCGGCGGCGGCGGCGGCGGCGACAGAGCGAGCGCGGCGCGGGGCCACC', 'AGAAGGAGGGCGTGGTAATATGAAGTCAGTTCCGGTTGGTGTAAAACCCCCGGGGCGGCGGCGAACTGGCTTTAGATGCTTCTGGGTCGCGGTGTGCTAAGCGAGGAGTCCGAGTGTGTGAGCTTGAGAGCCGCGCGCTAGAGCGACCCGGCGAGGG']


The split operation is very useful to separate strings based on a certain character. The output of split is a list with the different substrings that were separated as each element of the list

In [34]:
print seqs[0]

ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG


In [35]:
seq_to_add = "GGGCGGGGCCGCGGGAGGGCGGGGCCGGCGCGGCGAGCGCACCAGCAGCATCCTGGCTCAGCCGCGGCGGTGGCGGGGGCGCAACCAGCGGGCCGAGGCGGCGGCGCCAGCGGCGCCTTAAATAGCATCCAGAGCCGGCGCGGGGCAGGGAGTGGGCTGCAGTGACAGCCGGCGGCGGAGCGGCCGGTCCACGGAGGAGAATTCAGCTTAGAGAACTATCAACACAGGACA"
seqs.append(seq_to_add)
print seqs

None


## Write data from list to file

Let's write data from the list to a new file

In [37]:
# Use the 'w' to indicate that this file is being opened to write
file_to_write = open("four_seq.txt",'w')

In [38]:
# join the sequences in list back together using join operation
seqs_to_write = ('\n').join(seqs)
print seqs_to_write

ATCAGACGCGCAGAGGAGGCGGGGCCGCGGCTGGTTTCCTGCCGGGGGGCGGCTCTGGGCCGCCGAGTCCCCTCCTCCCGCCCCTGAGGAGGAGGAGCCGCCGCCACCCGCCGCGCCCGACACCCGGGAGGCCCCGCCAGCCCGCGGGAGAGGCCCAGCGGGAGTCGCGGAACAGCAGGCCCGAGCCCACCGCGCCGGGCCCCGGACGCCGCGCGGAAAAG
CTGCTCCGGAGTGACGCGGGCCCGGGCGCGACGGTCTCGGCGGCGGCGGCGGCGGCGACAGAGCGAGCGCGGCGCGGGGCCACC
AGAAGGAGGGCGTGGTAATATGAAGTCAGTTCCGGTTGGTGTAAAACCCCCGGGGCGGCGGCGAACTGGCTTTAGATGCTTCTGGGTCGCGGTGTGCTAAGCGAGGAGTCCGAGTGTGTGAGCTTGAGAGCCGCGCGCTAGAGCGACCCGGCGAGGG
GGGCGGGGCCGCGGGAGGGCGGGGCCGGCGCGGCGAGCGCACCAGCAGCATCCTGGCTCAGCCGCGGCGGTGGCGGGGGCGCAACCAGCGGGCCGAGGCGGCGGCGCCAGCGGCGCCTTAAATAGCATCCAGAGCCGGCGCGGGGCAGGGAGTGGGCTGCAGTGACAGCCGGCGGCGGAGCGGCCGGTCCACGGAGGAGAATTCAGCTTAGAGAACTATCAACACAGGACA


In [39]:
# write sequence to file
file_to_write.write(seqs_to_write)

In [40]:
# Remember to always close the file
file_to_write.close()

# In-Class exercise

* Read in the newly written file 'four_seq.txt'
* Calculate the GC content of the last sequence in that file
* Write a new file that has the sentence: "This sequence [sequence] has a [GCcontent]% GC content.

In [41]:
# Read in file four_seq.txt. Remember to close the file! 

In [42]:
# Strip and split the sequences using newline character into a list

In [43]:
# Get the last sequence from the list

In [44]:
# Calculate the GC content of the sequence

In [45]:
# Create the string that will be written to the new file

In [46]:
# Open file to be written to and write the string to the file. Close the file. 

# Homework

1. Read in sequences from file HW_seq.txt
2. Store these sequences in a list
  * Add these sequences to the first and second position in the list: AGGACGGGCG, CATGGATGGGTGCAC
  * Add this sequence to the end of the list: AGCTCATGAGCCAGGA
  * Remove sequence that is in the third position
  * Determine the average length of these sequences 
  * Determine the total number of ATGs in these sequences
3. Write the two statements to a single new file seperated by a newline
        E.g. Average of length of sequences: 45
            Total number of ATGs: 10