![header](header

# Opening files

When we're writing a program, it's pretty clear that we don't want to always have to type the data we're working with directly into the source code. Rather, we want to get the data from an external source: either the web, or a file on our computer. In this article, we're going to look at how to work with files in Python.

Did you read the article about string manipulation, and in particular, the bit about character encodings? It might be helpful here!

## The <code>open()</code> function and its modes

Python has one built-in function for getting the information out of a file, or writing information to a file. 

This is simply the <code>open()</code> function. Its first argument is always a string, which is the path to a file.

If you checked out the lesson on using the command line (and if you haven't you really should!), the way that file paths work should be very familiar. They can be either relative or absolute. Relative to what? Well, Python has a current working directory, just like you do when you are working via the command line. This will default to the current working directory you were in when you ran Python. Finally, in Python you should always use forward slashes in file paths, even if your computer uses backslashes. The <code>open()</code> function is smart and will figure this out for you -- it means the same code can easily open files on different operating systems!

What happens next is dependent on the next argument: the mode. This will determine whether the file should be opened as text or as bytes, and whether you want to read the file or write to the file. We'll deal with these modes now.

The function returns a file object, which you should store as a variable to refer to later (otherwise the file will be opened but you won't be able to access it).

### Read mode, 'r'

This is the mode you need to get the data out of a file as text; it is the default mode when opening. By default, it attempts to decode the file using the default encoding of your computer. For instance, a Windows computer and a Linux machine may have different default encodings. To be safe, it's best to specify the correct encoding by passing the keyword argument <code>encoding</code>. 

Let's open a text file and read its contents.

In [9]:
example_file = open('haiku.txt', 'r', encoding='UTF-8')

What can we do with this file? Well, we can try to extract its contents as a string, using <code>.read()</code>.

In [10]:
contents = example_file.read()

In [11]:
print(contents)

This is a text file,
UTF-8 encoding.
It has two line breaks.


What happens if we try to read it again?

In [12]:
contents2 = example_file.read()

In [13]:
print(contents2)




Nothing! What happened?

When we read a file, you should imagine it like a tape. Just like a tape, once you have read to the end, you have to "rewind" it to read it again. To do this, use <code>.seek()</code>.

In [14]:
# go back to the 0th byte
example_file.seek(0)

0

In [15]:
contents3 = example_file.read()
print(contents3)

This is a text file,
UTF-8 encoding.
It has two line breaks.


And... it works again. Be warned though: <code>seek()</code> doesn't count characters! It counts bytes, and if you recall from the lesson on strings and text, UTF-8 uses a variable number of bytes to encode each character. So you can't reliably search around the text file using <code>seek()</code> -- you could end up "between" characters.

This is one reason I implored in the strings lesson to convert your text into a Python string as soon as possible! Once it's a string, you're safe! You can forget all about bytes and encodings.

Another nice features of the file object is that if you loop over the file object, you will get each line as a string:

In [18]:
example_file.seek(0)
for line in example_file:
    print(line.upper())

THIS IS A TEXT FILE,

UTF-8 ENCODING.

IT HAS TWO LINE BREAKS.


Why are there line breaks in funny places? This is just because each line contains a newline (<code>\n</code>) character, and each <code>print()</code> prints to a newline, leading to a doubling up of linebreaks.

There's more good news on this front. Since different operating systems prefer different linebreak characters, the file object provided by <code>open()</code> is kind and clever enough to figure out where the linebreaks should be when you loop over it.

After opening any file, it is essential to close it when you are done -- ideally as soon as you have extracted the text as a string. If a file is opened by one program, other programs cannot use it. Moreover, if your program crashes unexpectedly and you didn't close the file, it may stay hanging around in memory, clogging up your computer and remaining unavailable.

In [19]:
example_file.close()

We'll soon meet a better way of closing files safely.

### Write and append modes; 'w' and 'a'

We now know how to get data from a file. The next task is to put data in, and we have two real choices here. Write mode will delete the contents of a file, ready to be re-written from scratch. Append mode will add to the end of the file. Let's just repeat that: <i>write mode will delete the contents of the file to be written</i>, so make sure you've definitely got the right file, and if in doubt, back it up. If the file does not exist, it will be created.

In [23]:
new_file = open('newhaiku.txt', 'w', encoding='UTF-8')

To write into the file, we use <code>.write()</code>

In [24]:
haiku = "Always remember\nto consider that new lines\naren't automatic."
new_file.write(haiku)

60

Let's write again and see what happens

In [25]:
new_file.write(haiku)
new_file.close()
file_check = open('newhaiku.txt', 'r', encoding='UTF-8')
contents = file_check.read()
print(contents)
file_check.close()

Always remember
to consider that new lines
aren't automatic.Always remember
to consider that new lines
aren't automatic.


Unfortunately, it looks like I forgot that new lines aren't automatic. Let's add a correctly line-broken version to the end of this file.

In [26]:
haiku_file = open('newhaiku.txt', 'a', encoding='UTF-8') # 'a' for "append" mode

haiku_file.write('\n\n') # two new lines
haiku_file.write(haiku)
haiku_file.close()

file_check = open('newhaiku.txt', 'r', encoding='UTF-8')
contents = file_check.read()
print(contents)
file_check.close()

Always remember
to consider that new lines
aren't automatic.Always remember
to consider that new lines
aren't automatic.

Always remember
to consider that new lines
aren't automatic.


That's better.

### Opening/writing things as bytes

We have so far been opening files as text -- the bytes in the file are converted to strings of characters. But there are times when we might want to open the file as bytes. There are two main times we might want to do this.

* The file is not a text file!
* The file is a text file, but we don't know what the encoding is.

In this latter case, we might want to open it as bytes and then use some method to try to determine the encoding, such as the <code>chardet</code> module from the previous topic. Opening something as bytes is just the same, but you add a <code>b</code>:

In [28]:
bytes_file = open('newhaiku.txt', 'rb')
print(bytes_file.read())
import chardet
bytes_file.seek(0)
print(chardet.detect(bytes_file.read()))
bytes_file.close()

b"Always remember\nto consider that new lines\naren't automatic.Always remember\nto consider that new lines\naren't automatic.\n\nAlways remember\nto consider that new lines\naren't automatic."
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


This incorrect guess from <code>chardet</code> just confirms what we were saying last time, that UTF-8 and ASCII are identical on the ASCII characters.

## The <code>with</code> keyword

So far, I've been drilling the importance of safely opening and closing files. But now, we're going to basically forget all that and show a better, safer way of opening files. You may have already seen it if you've been watching the videos closely. Check out this code:

In [29]:
with open('haiku.txt', 'r', encoding='UTF-8') as haiku:
    haiku_string = haiku.read()
    
print(haiku_string)

This is a text file,
UTF-8 encoding.
It has two line breaks.


And now, I don't need to close it. In fact, it's already closed. The <code>with</code> block opens the file only for the duration of the code in the block, and closes it when the <code>with</code> block has finished executing. The best part is that this happens regardless of <i>why</i> the <code>with</code> block finished executing, even if it's caused by a crash or if most of the block is skipped by <code>if</code>-statements.

You should use <code>with</code> when you open files, extract the information you need from inside the <code>with</code> block, and then just move on, allowing <code>with</code> to close the file for you to free up the resources. The syntax is simple -- the <code>open()</code> function works just as before. The <code>as</code> keyword just assigns it to a variable.

This method can be used for reading and writing files, and is the safest and most "Pythonic" way to work with files.