# 3.3 Files and the Operating System

Most of this book uses high-level tools like <b>pandas.read_csv</b> to read data files from disk into Python data structures. 
However, it’s important to understand the basics of how to work with files in Python. 

Fortunately, it’s relatively straightforward, which is one reason Python is so popular for text and file munging.

To open a file for reading or writing, use the built-in <b>open</b> function with either a relative or absolute file path and an optional file encoding:

In [172]:
path = "../examples/segismundo.txt"

In [174]:
f = open(path, encoding="utf-8")

Here, I pass <b>encoding="utf-8"</b> as a best practice because the default Unicode encoding for reading files varies from platform to platform.

By default, the file is opened in read-only mode <b>"r"</b>. 

We can then treat the file object <b>f</b> like a <b>list</b> and iterate over the lines like so:

In [178]:
for line in f:
    print(line)

Sueña el rico en su riqueza,

que más cuidados le ofrece;



sueña el pobre que padece

su miseria y su pobreza;



sueña el que a medrar empieza,

sueña el que afana y pretende,

sueña el que agravia y ofende,



y en el mundo, en conclusión,

todos sueñan lo que son,

aunque ninguno lo entiende.





The lines come out of the file with the end-of-line (EOL) markers intact, so you’ll often see code to get an EOL-free list of lines in a file like:

In [181]:
lines = [x.rstrip() for x in open(path, encoding="utf-8")]
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.',
 '']

When you use <b>open</b> to create file objects, it is recommended to <b>close</b> the file when you are finished with it.

Closing the file releases its resources back to the operating system:

In [120]:
f.close()

One of the ways to make it easier to clean up open files is to use the <b>with</b> statement:

In [123]:
with open(path, encoding="utf-8") as f:
    lines = [x.rstrip() for x in f]

This will automatically close the file f when exiting the with block. Failing to ensure that files are closed will not cause problems in many small programs or scripts, but it can be an issue in programs that need to interact with a large number of files.

If we had typed <b>f = open(path, "w")</b>, a new file at examples/segismundo.txt would have been created (be careful!), overwriting any file in its place. 

There is also the <b>"x"</b> file mode, which creates a writable file but fails if the file path already exists. 

See Table 3-3 for a list of all valid file read/write modes.

### Table 3-3. Python file modes

|Mode|	Description|
|---|---|
|r|Read-only mode|
|w|Write-only mode; creates a new file (erasing the data for any file with the same name)|
|x|Write-only mode; creates a new file but fails if the file path already exists|
|a|Append to existing file (creates the file if it does not already exist)|
|r+|Read and write|
|b|Add to mode for binary files (i.e., "rb" or "wb")|
|t|Text mode for files (automatically decoding bytes to Unicode); this is the default if not specified|

For readable files, some of the most commonly used methods are <b>read</b>, <b>seek</b>, and <b>tell</b>. 

<b>read</b> returns a certain number of characters from the file. 

What constitutes a “character” is determined by the file encoding or simply raw bytes if the file is opened in binary mode:

In [129]:
f1 = open(path)

In [131]:
f1.read(10)

'Sueña el r'

In [148]:
f2 = open(path, mode="rb")  # Binary mode

In [150]:
f2.read(10)

b'Sue\xc3\xb1a el '

The <b>read</b> method advances the file object position by the number of bytes read. 

<b>tell</b> gives you the current position:

In [153]:
f1.tell()

11

In [155]:
f2.tell()

10

Even though we read <b>10</b> characters from the file <b>f1</b> opened in text mode, the position is <b>11</b> because it took that many bytes to decode 10 characters using the default encoding. 

You can check the default encoding in the <b>sys</b> module:

In [42]:
import sys
sys.getdefaultencoding()

'utf-8'

To get consistent behavior across platforms, it is best to pass an encoding (such as <b>encoding="utf-8"</b>, which is widely used) when opening files.

<b>seek</b> changes the file position to the indicated byte in the file:

In [158]:
f1.seek(3)

3

In [160]:
f1.read(1)

'ñ'

In [162]:
f1.tell()

5

Lastly, we remember to <b>close</b> the files:

In [53]:
f1.close()
f2.close()

To write text to a file, you can use the file’s <b>write</b> or <b>writelines</b> methods. 

For example, we could create a version of <b>examples/segismundo.txt</b> with no blank lines like so:

In [56]:
path

'../examples/segismundo.txt'

In [58]:
with open("tmp.txt", mode="w") as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)

Let's read the contents of the file we created

In [60]:
with open("tmp.txt") as f:
    lines = f.readlines()

In [62]:
lines

['Sueña el rico en su riqueza,\n',
 'que más cuidados le ofrece;\n',
 'sueña el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueña el que a medrar empieza,\n',
 'sueña el que afana y pretende,\n',
 'sueña el que agravia y ofende,\n',
 'y en el mundo, en conclusión,\n',
 'todos sueñan lo que son,\n',
 'aunque ninguno lo entiende.\n']

See Table 3-4 for many of the most commonly used file methods.

### Table 3-4. Important Python file methods or attributes

|Method/attribute|Description|
|---|---|
|read([size])|Return data from file as bytes or string depending on the file mode, with optional size argument indicating the number of bytes or string characters to read|
|readable()|Return True if the file supports read operations|
|readlines([size])|Return list of lines in the file, with optional size argument|
|write(string)|Write passed string to file|
|writable()|Return True if the file supports write operations|
|writelines(strings)|Write passed sequence of strings to the file|
|close()|Close the file object|
|flush()|Flush the internal I/O buffer to disk|
|seek(pos)|Move to indicated file position (integer)|
|seekable()|Return True if the file object supports seeking and thus random access (some file-like objects do not)|
|tell()|Return current file position as integer|
|closed|True if the file is closed|
|encoding|The encoding used to interpret bytes in the file as Unicode (typically UTF-8)|

## Bytes and Unicode with Files

The default behavior for Python files (whether readable or writable) is text mode, which means that you intend to work with Python strings (i.e., Unicode). 

This contrasts with binary mode, which you can obtain by appending <b>b</b> to the file mode. 

Revisiting the file (which contains non-ASCII characters with UTF-8 encoding) from the previous section, we have:

In [69]:
with open(path) as f:
    chars = f.read(10)

In [71]:
chars

'Sueña el r'

In [73]:
len(chars)

10

<b>UTF-8</b> is a variable-length Unicode encoding, so when we request some number of characters from the file, Python reads enough bytes (which could be as few as 10 or as many as 40 bytes) from the file to decode that many characters. 

If I open the file in <b>"rb"</b> mode instead, read requests that exact number of bytes:

In [76]:
with open(path, mode="rb") as f:
    data = f.read(10)

In [78]:
data

b'Sue\xc3\xb1a el '

Depending on the text encoding, you may be able to decode the bytes to a <b>str</b> object yourself, but only if each of the encoded Unicode characters is fully formed:

In [80]:
data.decode("utf-8")

'Sueña el '

In [82]:
data[:4].decode("utf-8")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: unexpected end of data

Text mode, combined with the encoding option of <b>open</b>, provides a convenient way to convert from one Unicode encoding to another:

In [90]:
sink_path = "sink.txt"
!rm sink.txt #delete existing file

with open(path) as source:
    with open(sink_path, "x", encoding="iso-8859-1") as sink:
        sink.write(source.read())

In [92]:
with open(sink_path, encoding="iso-8859-1") as f:
    print(f.read(10))

Sueña el r


Beware using <b>seek</b> when opening files in any mode other than binary. 

If the file position falls in the middle of the bytes defining a Unicode character, then subsequent reads will result in an error:

In [95]:
f = open(path, encoding='utf-8')

In [97]:
f.read(5)

'Sueña'

In [99]:
f.seek(4)

4

In [167]:
f.read(1)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte

If you find yourself regularly doing data analysis on non-ASCII text data, mastering Python’s Unicode functionality will prove valuable. 

See Python’s online documentation for much more.

# 3.4 Conclusion

With some of the basics of the Python environment and language now under your belt, it is time to move on and learn about NumPy and array-oriented computing in Python. 

But before we do that, we are going to skip ahead and look at additional file handling capabilities through the <b>pandas</b> library.