#Reading from files

An important component of any scientific program is the ability to read and write data to files.  Python's built-in file utilities make this process very easy.  Files are accessed by creating file objects.  The member functions of these objects provide methods for reading from and/or writing to a file.  To create a file object, use the file function.  Consider a generic file data.txt whose contents are:

The following example creates an open file object to data.txt and reads all of its contents into a string:

In [11]:
>>> f = file("data.txt", "r")
>>> s = f.read()
>>> f.close()
>>> s

'#pressure temperature average_energy\r\n1.0 1.0 -50.0\r\n1.0 1.5 -27.8\r\n2.0 1.0 -14.5\r\n2.0 1.5 -11.2\r\n'

In [12]:
>>> print s

#pressure temperature average_energy
1.0 1.0 -50.0
1.0 1.5 -27.8
2.0 1.0 -14.5
2.0 1.5 -11.2



Here, the file function took two arguments, the name of the file followed by a string "r" indicating that we are opening the file for reading.  It is possible to omit the second argument, in which case Python defaults to "r"; however, it is usually a good idea to include it explicitly for programming clarity.  One can also use the command open, which is a synonym for file.


Once the file object is created, we can read its entire contents into a string using the read() member function.  Python reads all characters from the file, including whitespace and line breaks (e.g., "\n" entries).  


When we have read the contents, we invoke the close function, which terminates the operating system link to our file.  It is good to close files after using them, as open file objects consume system resources.  In any case, Python automatically closes any open files if there are no more objects pointing to them using its garbage collecting routines.  Consider the following:

In [13]:
>>> s = file("data.txt", "r").read()

This command accomplishes the same result as the last example, but does not create a file object that persists after execution.  That is, file first creates the object, then read() extracts its contents and places them into the variable s.  After this operation, there are no variables pointing to the file object anymore, and so it is automatically closed.  File objects created within functions (that are not returned) are also always closed upon exiting the function, since variables created within functions are deleted upon exit.

One does not have to read all of the contents into a string.  We can also read all of the contents into a list of lines in the file:

In [14]:
>>> l = file("data.txt", "r").readlines()
>>> l

['#pressure temperature average_energy\r\n',
 '1.0 1.0 -50.0\r\n',
 '1.0 1.5 -27.8\r\n',
 '2.0 1.0 -14.5\r\n',
 '2.0 1.5 -11.2\r\n']

Operationally, the readlines() function is identical to read().split('\n').


If the file is large, it might not be efficient to read all of its contents at one time.  Instead, we can read one line at a time using the readline function:

In [38]:
>>> f = file("data.txt", "r")
>>> s = "dummy"
>>> while len(s):
...   s = f.readline()
...   if not s.startswith("#"): print s.strip()

1.0 1.0 -50.0
1.0 1.5 -27.8
2.0 1.0 -14.5
2.0 1.5 -11.2



This example prints out all of the lines that do not start with "#".  The while loop continues as long as the last readline() command returns a string of length greater than zero.  When Python reaches the end of a file, readline will return an empty string.  It is important to know that readline() returns an entire line including the line break character '\n' at the end; in this way, a blank line will return a string of nonzero length.  It is also for that reason that we used the strip() function when printing out the lines in the example above.


The read and readline functions can also take an optional argument size that sets the maximum number of characters (bytes) that Python will read in at a time.  Subsequent calls move through the file until the end of the file is reached, at which point Python will return an empty string:

In [15]:
>>> f = file("data.txt", "r")
>>> f.read(5)

'#pres'

In [16]:
>>> f.read(5)

'sure '

In [17]:
>>> f.close()

The seek function can be used to move to a specific byte location in a file.  Similarly, the tell function will indicate the current byte position within the file:

In [18]:
>>> f = file("data.txt", "r")
>>> f.seek(5)
>>> f.read(5)

'sure '

In [19]:
>>> f.tell()

10

The 'L' after the number 10 simply indicates that the returned type is a long integer, since normal integers do not contain enough precision to address all of the bytes in large files.


We end with an example that illustrates some of the elegant ways in which Python can handle files.  Imagine we would like to parse the data in the file above into the list called Data such that:

In [46]:
Data = [[1.0, 1.0, -50.0,], [1.0, 1.5, -27.8], [2.0, 1.0, -14.5], [2.0, 1.5, -11.2]]

Here, we need to read the data (ignoring the comment), convert it to floats, and structure it into a list.  New Python programmers might take an approach similar to the manner in which this would be accomplished in other languages:

In [20]:
>>> f = file("data.txt", "r")
>>> Data = []
>>> line = f.readline()
>>> while len(line):
...   if not line.startswith("#"):
...     l = line.split()
...     Pres = float(l[0])
...     Temp = float(l[1])
...     Ene = float(l[2])
...     Data.append([Pres, Temp, Ene])
...   line = f.readline()

In [21]:
>>> f.close()

We could shorten the program by using the readlines function and by moving the file object creation into the loop itself:

In [22]:
>>> Data = []
>>> for line in file("data.txt", "r").readlines():
...   if not line.startswith("#"):
...     l = line.split()
...     Pres = float(line[0])
...     Temp = float(line[1])
...     Ene = float(line[2])
...     Data.append([Pres, Temp, Ene])

ValueError: could not convert string to float: .

Ultimately, however, we can make these operations much more compact using Python's list comprehensions:

In [23]:
>>> Data = [[float(x) for x in line.split()]
...         for line in file("data.txt", "r").readlines()
...         if not line.startswith("#")]

Here, we use two nested list comprehensions: the inner one loops over columns in each line, and the outer one over lines in the file with a filter established by the if statement.

#Writing to files

Writing data to a file is very simple.  To begin writing to a new file, open a file object with the "w" flag:

In [53]:
>>> f = file("new.txt", "w")
>>> f.write("This is the first line.")
>>> f.write("  Still on the first line.")
>>> f.write("\nThis is the second line.")
>>> f.close()

Here, the contents of our file new.txt would look like:

The write flag "w" tells Python to create a new file ready for writing, and the function write will write a string verbatim to the current position within the file.  Subsequent write statements therefore append data to the file.  Notice that write writes the string text explicitly and so line breaks must be specified in the strings if desired in the file.

If the "w" flag is used on a file that already exists, Python will overwrite it completely.  Alternatively, one can append data to an existing file using the "a" flag:

In [54]:
>>> f = file("new.txt", "a")
>>> f.write("\nThis is the third line.")
>>> f.close()

Our file would now look like:

The write function only accepts strings.  That means that numeric values must be converted to strings prior to writing to the file.  This can be accomplished using the str function, which formats values into a default precision, or using string formatting:

In [55]:
>>> f = file("new.txt", "w")
>>> pi = 3.14159
>>> f.write(str(pi))
>>> f.write('\n')
>>> f.write("%.2f" % pi)
>>> f.close()

#Binary data and compressed files

When storing numeric data, it is inefficient to write them to files in textual format because it requires many more characters to express a textual version of a float at the same precision it would require to hold it in memory.  There are two approaches to more efficient writing of numeric data that results in smaller file sizes.


The first approach is not to store values in a legible format but to write them to the file in a way similar to their representation in memory.  To do so, we must convert a value to a binary representation in string format.  The struct module can be used for this purpose.  However, there are some subtleties to the different data types (struct uses C, rather than Python, types) that can make this approach a bit confusing.


The second approach is to write to, and subsequently also read from, compressed files.  In this way, numeric data written in human-readable form can be compressed to take up much less space on disk.  This approach is sometimes more convenient because numeric values can still be read by human eyes when data files are decompressed by various utilities outside of Python.  


Conveniently, Python comes with modules that enable one to read and write a number of popular compressed formats in an almost completely transparent manner.  Two formats are recommended: the Gzip format, which achieves reasonable compression and is fast, and the Bzip2 format, which achieves higher compression but at the expense of speed.  Both formats are standardized, open, can be read by most common decompression programs, and are single-file based, meaning they compress a single file, not cabinets or archives of multiple files, which complicates things.


To write to a new Gzip file, we import the gzip module and create a GzipFile object in a manner identical to the way we created a file object:

In [24]:
>>> import gzip
>>> f = gzip.GzipFile("data.txt.gz", "w")
>>> f.write("This is some test data for compression.")
>>> f.close()
>>> print gzip.GzipFile("data.txt.gz", "r").read()

This is some test data for compression.


Here, Python takes care of compression (and decompression) entirely behind the scenes.  The only difference from our earlier efforts is that we have replaced the file function with the gzip.GzipFile call and we have given the extension ".gz" to the file we create, in order to indicate that it is a compressed file.  In fact, gzip objects behave exactly like file objects, and implement all of the same functions (read, readline, readlines, write).  This makes it very easy and transparent for storing data in a compressed format.  One minor exception, however, is that the seek and tell functions do not work exactly the same and should be avoided with compressed files.


The bz2 module works in exactly the same manner:

In [25]:
>>> import bz2
>>> f = bz2.BZ2File("data.txt.bz2", "w")
>>> f.write("This is some test data for compression.")
>>> f.close()
>>> print bz2.BZ2File("data.txt.bz2", "r").read()

This is some test data for compression.


In general, compression is only recommended for datasets on disk that are large (e.g., > 1MB) and that are read or written only a few times during a program.  For disk-intensive programs that are speed-limited by the rate at which they can read and write to disk, compression will incur a considerable computational overhead and it is probably best to work with an uncompressed file.  In these latter cases, the large datasets can ultimately be compressed by outside utilities after all programs and analyses have been performed.  

#File systems functions

Python offers a host of other modules and functions for accessing and manipulating files and directories on disk.  The latter are indicated by strings.  Python recognizes directory hierarchies using the forward slash character, regardless of the particular operating system (Windows, Linux, or MacOS).  On Windows machines, it is also possible to use the backwards slash character; however, in strings this must be escaped since '\' normally tells Python that a special code is being used.  For example, both of the following point to the same file on a Windows machine:

The os module contains a large number of useful file functions.  In particular, the sub-module os.path provides a number of functions for manipulating path and file names.  For example, a filename with a path can be split into various parts:

In [35]:
>>> import os	
>>> p = "c:/temp/file.txt"
>>> os.path.basename(p)

'file.txt'

In [60]:
>>> os.path.dirname(p)

'c:/temp'

In [61]:
>>> os.path.split(p)

('c:/temp', 'file.txt')

The opposite of the split function is the join function.  It is a good idea to always use join when combining pathnames with other pathnames or files, since join takes care of any operating-system specific actions.  join can take any number of arguments:

In [36]:
>>> os.path.join("c:\\temp", "file.txt")

'c:\\temp/file.txt'

In [39]:
>>> os.path.join("c:\\", "temp", "file.txt")

'c:\\/temp/file.txt'

If the path name is not absolute but relative to the current directory, there is a function for returning the absolute version:

In [68]:
>>> os.path.abspath("/temp/file.txt")

'/temp/file.txt'

Several functions enable testing the existence and type of files and directories:

In [40]:
>>> p = 'c:/temp/file.txt'
>>> os.path.exists(p)

False

In [41]:
>>> os.path.isfile(p)

False

In [42]:
>>> os.path.isdir(p)

False

Here, the isfile and isdir functions test both for the existence of the object as well as their type.

One can get the size on disk (in bytes) of a file:

In [50]:
>>> os.path.getsize('/Users/kevinhoang/Desktop/data.txt')

98

In [43]:
>>> os.getcwd()

'/Users/kevinhoang/Desktop'

In [70]:
>>> os.chdir("..")
>>> os.getcwd()

'/Users/kevinhoang'

Note that the notation ".." signifies the containing directory one level up.


A directory can be created:

To delete a file:

To delete a directory:

To rename a file:

The shutil module provides methods for copying and moving files:

Finally, the glob module provides wildcard matching routines for finding files and directories that match a specification.  Matches are placed in lists:

In [72]:
>>> import glob
>>> glob.glob("c:\\temp\\*.dat")

[]

Here the "*" wildcard matches anything of any length.  The "?" wildcard will match anything of length one character.  Multiple wildcards can appear in a glob specification:

In [73]:
>>> glob.glob("c:\\*\\?.dat")

[]

glob returns both files and directories.  List comprehensions provide an easy way to filter for one or the other.

In [74]:
>>> [p for p in glob.glob("p*") if os.path.isdir(p)]

[]

#Command line arguments

It is very common to write programs that run with options from the command line, i.e., the DOS command prompt in Windows or a terminal in Linux or MacOS.  Usually, one provides a number of arguments to the program that are detected.  Let's say we wanted a program to take an input file in.txt and produce an output file out.txt in the following way at the prompt:

In Windows, if Python is associated with files ending in '.py', we can just write instead:

In Linux, we can accomplish the same behavior by including in the very first line of our program a comment directive that tells the system to use Python to execute the file:

Either way, we would like to capture the arguments in.txt and out.txt.  To do this, we use the sys module and its member variable argv:

Running program.py from the command line:

Notice that argv is a list that contains the (string) arguments in order.  The first argument, with index 0, is the name of the program that we are executing.  Subsequent arguments correspond to space-separated items that we input on the command line when running the program.  The form of argv is exactly the same whether or not we call Python directly, since the Python executable is ignored:

#Classes

So far, we have only dealt with built-in object types like floats and ints.  Python, however, allows us to create new object types called classes.  We can then use these classes to create new objects of our own design.  In the following example, we create a new class that describes an atom type.  

We can import the atom.py module and create a new instance of the AtomClass type:

In [52]:
>>> import atom
>>> a = atom.AtomClass(2.0, Element = 'O', Mass = 16.0)
>>> b = atom.AtomClass(1.0)
>>> a.Element

'O'

In [53]:
>>> a.Mass

16.0

In [54]:
>>> a.Momentum()

32.0

In [55]:
>>> b.Element

'C'

In [56]:
>>> b.Velocity

1.0

In this example, the class statement indicates the creation of a new class called AtomClass;  all definitions for this class must be indented underneath it.  The first definition is for a special function called __init__ that is a constructor for the class, meaning this function is automatically executed by Python every time a new object of type AtomClass is created.  There are actually many special functions that can be defined for a class; each of these begins and ends with two underscore marks.


Notice that the first argument to the __init__ function is the object self.  This is a generic feature of any class function.  This syntax indicates that the object itself is automatically sent to the function upon calls to it.  This allows modifications to the object by manipulating the variable self; for example, new object members are added using expressions of the form self.X = Y.  This approach may seem unusual, but it actually simplifies the ways in which Python defines class functions behind the scenes.  


The __init__ function gives the form of the arguments that are used when we create a new object with atom.AtomClass(2.0, Element = 'O', Mass = 16.0).   Like any other function in Python, this function can include optional arguments.


Object members can be accessed using dot notation, as shown in the above example.  Each new instance object of a class acquires its own object members, separate from other instances.  Functions can also be defined as object members, as shown with the Momentum function above.  The first argument to any function in this definition must always be self; calls to functions through object instances, however, do not supply this variable since Python sends the object itself automatically as the first argument.


Many special functions can be defined for objects that tell Python how to use your new type with existing operations.  Below is a selected list of some of these:

special class method  |	behavior / purpose

__del__(self) |	A destructor; called when an instance is deleted using del or via Python's garbage collecting routines.

__repr__(self) |	Returns a string representation of the object; used by print statements, for example

__cmp__(self, other) |	Defines a comparison method with other objects.  Returns a negative number if self < other, zero if self == other, and a positive number if self > other.  Used to evaluate comparison statements for objects, like a > b, or for sorting.

__len__(self) |	Returns the length of the object; used by the len function.

__getitem__(self, key),  
__setitem__(self, key, value),  
__delitem__(self, key) 	| Define methods for accessing and modifying elements of an object via bracket notation, e.g., a[key] = value.

__contains__(self, item) |	Called for an object when the in statement is used, e.g., item in a.

__add__(self, other),  
__sub__(self, other),  
__mul__(self, other),  
__div__(self, other),  
__mod__(self, other),  
__pow__(self, other)  |	Methods that are called when various arithmetic operations are executed on objects, e.g., a + b, a – b, a * b, a / b, a % b, and a**b.  In other programming languages, these functions might be termed operator overloading.

Object members can be accessed using dot notation, as shown in the above example.  Each new instance object of a class acquires its own object members, separate from other instances.  Functions can also be defined as object members, as shown with the Momentum function above.  The first argument to any function in this definition must always be self; calls to functions through object instances, however, do not supply this variable since Python sends the object itself automatically as the first argument.


Many special functions can be defined for objects that tell Python how to use your new type with existing operations.  Below is a selected list of some of these:


Classes can be an extremely convenient way for organizing data in scientific programs.  However, this benefit does not come without a cost: oftentimes stratifying data across a class will slow your program considerably.  Consider the atom class defined above.  We could put a separate position or velocity vector inside each atom instance.  However, when we perform calculations that make intense use of these quantities—such as a pairwise loop that computes all interatomic distances—it is inefficient for Python to jump around in memory accessing individual position variables in each class.  Rather, it would be much more efficient to store all positions for all atoms in a single large array that occupies one location in memory.  In this case, we would consider those quantities that appear in the slowest step of our calculations (typically the pairwise loop) and keep them outside of the classes as large, easily manipulated arrays and then put everything else that is not accessed frequently (such as the element name) inside the class definitions.  Such a separation may seem messy, but ultimately it is essential if we are to achieve reasonable performance in numeric computations.

#Exceptions

Python offers a simple way to test for errors as a part of a program using the try and except statements:

Here we have defined a function that performs multiplication that we can call for any type.  If multiplication is not defined for a particular type, an error is thrown that is caught by the except statement.   Rather than stop our program, this error causes our own error-handling code to be executed.  The try statement defines the range of code in which we are testing for this error.  Consider this example:

In [57]:
>>> import test
>>> test.multiply(3, 6)

18

In [58]:
>>> test.multiply("3", "6")

0

In [59]:
>>> "3" * "6"

TypeError: can't multiply sequence by non-int of type 'str'

In [6]:
>>> raise FloatingPointError, "A floating point error has occurred."

FloatingPointError: A floating point error has occurred.

The ability to raise errors is convenient for adding user-defined information when improper calls to our functions or objects are made.  Ultimately this helps us locate bugs in our code.

#Timing functions and programs

It is often useful to be able to time routines in our program, to get a sense of the relative computation demands of different parts of it.  A very simple approach is to use the time module:

In [7]:
>>> import time
>>> time.time()

1431548389.826442

The time() function of the time module gives the time in seconds as measured from a reference date called the epoch.  Ultimately, we are interested in time differences between two points in our program and so this exact date is unimportant.  Consider the following code snippet from a script that computes the time required for a particular function ComputeEnergies() to finish:

In [61]:
t1 = time.time()
ComputeEnergies()
t2 = time.time()
print "The time required was %.2f sec" % (t2 - t1)

NameError: name 'ComputeEnergies' is not defined

For long programs, adding such statements for each function execution would be very tedious.  Python includes a profiling module that enables you to examine timings throughout your code.  There are two modules: profile and cProfile.  These modules are entirely identical except that cProfile has been written mostly in C and is much faster.  cProfile is always recommended unless you have an older version of Python that doesn't include it.

To use cProfile to profile a single function, 

In [62]:
import cProfile
cProfile.run("ComputeEnergies()")

         2 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




NameError: name 'ComputeEnergies' is not defined

Notice that we send to the run function in cProfile a string that we want to execute.  After ComputeEnergies() finishes, cProfile will print out a long list of statistics about timings in for that function and the functions it calls.

To profile a complete script, we can run cProfile on it from the command line:

After running, we get a report that looks something like this (abbreviated):

The names to the right are names of the modules and functions called by our program.  Some of them might not look familiar; this is usually the case when modules that have functions that call other functions in the same and other modules.  The numbers in columns give statistics about the program timing:

    •ncalls – number of times a function was called

    •tottime – total time spent in a function, summed over all calls

    •percall – average time per call spent in a function

    •cumtime – total time spent in a function and all the functions called by it

    •percall – average time per call spent in a function and all the functions called by it