# BMI 565: Bioinformatics Programming & Scripting

#### (C) 2015 Michael Mooney (mooneymi@ohsu.edu)

## Week 1: File Input/Output

1. Basic File I/O
    * Methods for Reading from Files
    * Writing to Files
    * The `with` Statement
2. Parsing Data Files
3. The `csv` Module
4. The `cPickle` Module

#### Requirements
- Python 2.7
- `csv` module
- `cPickle` module

- Data Files
    - `P00533.fasta`
    - `annot_test.txt`

## Basic File I/O

### Opening Files

In [12]:
## Use the open() function to create a file handle object
## The two parameters are the file path and the mode: 
## 'r' = read (default), 'w' = write, 'a' = append
fh = open('P00533.fasta', 'r')

In [13]:
fh

<open file 'P00533.fasta', mode 'r' at 0x102744ed0>

### Methods for Reading from Files

<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center">`fh.read([size])`</td><td>Will read the file up to 'size' bytes. If 'size' is not specified the <br />entire file will be read. Returns a string.</td></tr>
<tr><td style="text-align:center">`fh.readline([size])`</td><td>Will read a single line up to 'size' bytes. If 'size' is not specified the <br />entire line is read. Returns a string.</td></tr>
<tr><td style="text-align:center">`fh.readlines([sizehint])`</td><td>Will read multiple lines and return a list containing each line.</td></tr>
</table>

In [14]:
## Read the entire fasta file
data = fh.read()
print data

>sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2
MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV
VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA
VLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDF
QNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGC
TGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYV
VTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFK
NCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAF
ENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKL
FGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCN
LLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVM
GENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVV
ALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGS
GAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGI
CLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAA
RNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDV

### Closing the File

In [None]:
fh.close()

In [None]:
fh.read()

### Methods for Writing to Files

<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center">`file.write(str)`</td><td>Writes the string `str` to the file.</td></tr>
<tr><td style="text-align:center">`fh.writelines(iterable)`</td><td>Writes each element of an iterable object (that produces strings)<br />to the file. It does not add line separators.</td></tr>
</table>

In [7]:
## Use the open() function to create a file handle object
## For writing use the 'a' or 'w' mode
## Note: using the 'w' mode will overwrite an existing file
fh = open('new_file.txt', 'w')

In [8]:
lines = ['Hello', 'World']
fh.writelines(lines)

In [9]:
fh.close()

### The 'with' Statement: A Context Manager

The `with` statement implements a context manager. Special `__enter__()` and `__exit__()` methods are called when entering and exiting the `with` block of code. For a file handle, the `__exit__()` method is `fh.close()`, which ensures that the file is closed after exiting the block.

In [10]:
## Using a context manager ensures the file is closed
## at the end of the 'with' block
with open('P00533.fasta') as fh:
    lines = []
    for line in fh:
        lines.append(line.rstrip())

## Print the first line
lines[0]

'>sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2'

In [11]:
lines[1]

'MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV'

## Parsing Data Files

### Example 1:  Fasta File

In [15]:
## Open the fasta file
fh = open('P00533.fasta', 'r')
## Read the first line of the file
seq_description = fh.readline()
seq_description

'>sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2\n'

In [16]:
## Clean up the first line to get the sequence description
seq_description = seq_description.rstrip()[1:]
seq_description

'sp|P00533|EGFR_HUMAN Epidermal growth factor receptor OS=Homo sapiens GN=EGFR PE=1 SV=2'

In [17]:
## The file handle object is an iterable object and can
## be used to iterate through the lines of a file.
## The list comprehension below creates a list containing
## each line of the file with the trailing newline 
## character removed.
seq_list = [line.rstrip() for line in fh]
seq_list[0]

'MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV'

In [20]:
seq_list[0:2]

['MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEV',
 'VLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALA']

In [21]:
## Join the elements of the list to get the entire sequence
seq = ''.join(seq_list)
seq

'MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSIATGMVGALLLLLVVALGIGLFMRRRHIVRKRTLRRLLQERELVEPLTPSGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYR

In [22]:
fh.close()

\* Note: Ideally you would put all the above commands for Example 1 (excluding the `fh.close()`) inside a `with` block. Same for the other examples below.

### Example 2: Microarray Annotations

In [23]:
## Open the file
fh = open('annot_test.txt')

In [24]:
## Get the first line and remove the newline character
first_line = fh.readline()
first_line = first_line.rstrip()
first_line

'ProbeID\tPrimaryAccession\tRefSeqAccession\tGenbankAccession\tUniGeneID\tEntrezGeneID\tGeneSymbol\tGeneName\tEnsemblID\tTIGRID\tGO\tDescription\tGenomicCoordinates\tCytoband'

In [25]:
## Get the column names
column_names = first_line.split('\t')
column_names

['ProbeID',
 'PrimaryAccession',
 'RefSeqAccession',
 'GenbankAccession',
 'UniGeneID',
 'EntrezGeneID',
 'GeneSymbol',
 'GeneName',
 'EnsemblID',
 'TIGRID',
 'GO',
 'Description',
 'GenomicCoordinates',
 'Cytoband']

In [26]:
## Read the rest of the file and create a list of lists.
## For each line of the file, remove the newline character
## and split the line to get the data in each column.
data = [line.rstrip().split('\t') for line in fh]
data[0]

['A_23_P253586',
 'NM_005128',
 'NM_005128',
 'NM_005128',
 'Hs.204575',
 '9980',
 'DOPEY2',
 'dopey family member 2',
 'ENST00000270190',
 'THC2471394',
 'GO:0000139(Golgi membrane)|GO:0003674(molecular_function)|GO:0006895(Golgi to endosome transport)|GO:0007029(endoplasmic reticulum organization and biogenesis)|GO:0007275(multicellular organismal development)',
 'Homo sapiens dopey family member 2 (DOPEY2), mRNA [NM_005128]',
 'chr21:36586364-36587509',
 'hs|21q22.12']

In [27]:
data[1]

['A_23_P217507',
 'NM_004729',
 'NM_004729',
 'NM_004729',
 'Hs.131452',
 '9189',
 'ZBED1',
 'zinc finger, BED-type containing 1',
 'ENST00000381222',
 'THC2461273',
 'GO:0000228(nuclear chromosome)|GO:0003677(DNA binding)|GO:0004803(transposase activity)|GO:0005634(nucleus)|GO:0008270(zinc ion binding)|GO:0046872(metal ion binding)|GO:0046983(protein dimerization activity)',
 'Homo sapiens zinc finger, BED-type containing 1 (ZBED1), mRNA [NM_004729]',
 'chrY:2415173-2415114',
 'hs|Yp11.31']

In [28]:
## Close the file
fh.close()

## The csv Module

The `csv` module can be used to read and parse delimited files.

In [29]:
import csv
fh = open('annot_test.txt')

In [30]:
## Read the file with the csv.reader() method.
## The csv.reader object is a generator and will split
## each line of the file (using the specified delimiter), 
## and return a list.
## The command below will read the entire file into a list of lists
data = list(csv.reader(fh, delimiter='\t'))

## Print the list created from the first line of the file
data[0]

['ProbeID',
 'PrimaryAccession',
 'RefSeqAccession',
 'GenbankAccession',
 'UniGeneID',
 'EntrezGeneID',
 'GeneSymbol',
 'GeneName',
 'EnsemblID',
 'TIGRID',
 'GO',
 'Description',
 'GenomicCoordinates',
 'Cytoband']

In [31]:
## Print the list created from the second line of the file
data[1]

['A_23_P253586',
 'NM_005128',
 'NM_005128',
 'NM_005128',
 'Hs.204575',
 '9980',
 'DOPEY2',
 'dopey family member 2',
 'ENST00000270190',
 'THC2471394',
 'GO:0000139(Golgi membrane)|GO:0003674(molecular_function)|GO:0006895(Golgi to endosome transport)|GO:0007029(endoplasmic reticulum organization and biogenesis)|GO:0007275(multicellular organismal development)',
 'Homo sapiens dopey family member 2 (DOPEY2), mRNA [NM_005128]',
 'chr21:36586364-36587509',
 'hs|21q22.12']

In [32]:
## Close the file
fh.close()

### The csv.DictReader() Method

In [33]:
fh = open('annot_test.txt')

In [34]:
## Read the file with the csv.DictReader() method.
## csv.DictReader() will split each line of the file
## using the specified delimiter, and will return a
## dictionary for each line, where the dictionary keys
## are the column names.
data = list(csv.DictReader(fh, delimiter='\t'))

## Print the dictionary created from the first line of the file
data[0]

{'Cytoband': 'hs|21q22.12',
 'Description': 'Homo sapiens dopey family member 2 (DOPEY2), mRNA [NM_005128]',
 'EnsemblID': 'ENST00000270190',
 'EntrezGeneID': '9980',
 'GO': 'GO:0000139(Golgi membrane)|GO:0003674(molecular_function)|GO:0006895(Golgi to endosome transport)|GO:0007029(endoplasmic reticulum organization and biogenesis)|GO:0007275(multicellular organismal development)',
 'GenbankAccession': 'NM_005128',
 'GeneName': 'dopey family member 2',
 'GeneSymbol': 'DOPEY2',
 'GenomicCoordinates': 'chr21:36586364-36587509',
 'PrimaryAccession': 'NM_005128',
 'ProbeID': 'A_23_P253586',
 'RefSeqAccession': 'NM_005128',
 'TIGRID': 'THC2471394',
 'UniGeneID': 'Hs.204575'}

In [35]:
fh.close()

### Writing to Files with csv

In [36]:
## First, get the data
fh = open('annot_test.txt')
data = list(csv.reader(fh, delimiter='\t'))
fh.close()

## Open a file for writing
fh = open('new_file.txt', 'w')

In [37]:
## Create a csv.writer() object
w = csv.writer(fh, delimiter='\t')
## The writerow() method will create delimited string 
## from a list and will write that string to the file
w.writerow(data[0])
## The writerows() will do the same as above, but requires
## a list of lists and will write mulitple lines
w.writerows(data[1:])

In [38]:
fh.close()

### A Useful Function: `zip()`
`zip(x, y)` takes two lists and returns a list of tuples where the ith tuple represents the ith value from each original list.

In [39]:
x = [1,2,3]
y = [4,8,10]
z = zip(x, y)
z

[(1, 4), (2, 8), (3, 10)]

## Saving Python Objects to File with cPickle

In [40]:
import cPickle

## Open a file in write mode
fh = open('pickle_data.dat', 'w')

## Save a python object to the file
cPickle.dump(z, fh)

## Close the file
fh.close()

In [41]:
## Load a python object from a cPickle data file
fh = open('pickle_data.dat')
dat = cPickle.load(fh)

## Close the file
fh.close()

In [42]:
dat

[(1, 4), (2, 8), (3, 10)]

## In-Class Exercises

In [None]:
## Exercise 1.
## Parse the microarray annotation file and create 
## lists of Probe IDs and Gene Symbols
## You'll need to have the 'annot_test.txt' file in the current directory


In [None]:
## Exercise 2.
## Iterate through the microarray annotations and
## write, to a new file, only annotations for genes
## that start with 'D'.
## You'll need to have the 'annot_test.txt' file in the current directory


## References

- <u>Python Essential Reference</u>, David Beazley, 4th Edition, Addison‐Wesley (2008)
- <u>Python for Bioinformatics</u>, Sebastian Bassi, CRC Press (2010)
- [http://docs.python.org/](http://docs.python.org/)

#### Last Updated: 22-Sep-2015