# Data Has to Go Somewhere

## The open and close Command

For data processing you will eventually need to be able to read and write files. In this section, we will show you how you can manipulate files using Python.

Let's first look at the fundamental open command that will enable you to access a file within a Python application.

`fileHandle = open(path_to_file, mode)`

`fileHandler` is the file object that will enable us to manipulate the file lead by `path_to_file`. 

The `path_to_file` variable is a fully qualified path and file name that tells Python where the file we want to open is located.

The mode gives us the ability to tell our Python script the kind of operations we want to do on the file, and the kind of file it is.

The first letter indicates the operation:
- r means read
- w means write, and overwrite the file if it exists
- x means write, but only if file does not already exist
- a means write (append): write at the end of the file if it exists

The second letter of mode indicates the file type:
- t (or nothing) means text
- b means binary

After opening a file and manipulating it the way you like, you must be sure to close the file as well. Closing the file ensures that Python does not maintain file handles to files it is not using anymore. This practice can keep memory usage low as well. 

## Writing a Text File Using write()

Let's now try to write a file using Python. First, let's create some content.

In [1]:
file_content = '''To be or not to be
That is a question.'''

We just set a string variable with a multiline string to be the content of the file we wan to create.

To create a file with the content within file_content, we would do the following.

In [2]:
poem_file = open('shakespeare.txt', 'wt')
poem_file.write(file_content)
poem_file.close()

You should now see a file named "shakespeare.txt" in the folder where this IPython Notebook is located. Let's take a look.

In [3]:
!ls -al

total 576
drwxr-xr-x+ 17 kashaolu  staff     578 Oct 28 09:53 [34m.[m[m
drwxr-xr-x+ 19 kashaolu  staff     646 Oct 20 13:28 [34m..[m[m
-rw-r--r--@  1 kashaolu  staff    6148 Oct 28 07:43 .DS_Store
drwxr-xr-x+  4 Personal  staff     136 Oct 28 09:31 [34m.ipynb_checkpoints[m[m
-rw-r--r--+  1 kashaolu  staff  100184 Oct 20 13:28 8.1 - Encoding Text.pptx
-rw-r--r--+  1 kashaolu  staff    8825 Sep 19 11:26 8.2 - Unicode Strings.ipynb
-rw-r--r--+  1 kashaolu  staff   15496 Sep 19 11:26 8.3 - Encoding.ipynb
-rw-r--r--+  1 kashaolu  staff   16894 Sep 19 11:26 8.4 - Formatting.ipynb
-rw-r--r--+  1 kashaolu  staff   23439 Sep 19 11:26 8.5 - Regular Expressions.ipynb
-rw-r--r--+  1 kashaolu  staff   11738 Sep 19 11:26 8.6 - Binary Data.ipynb
-rw-r--r--+  1 kashaolu  staff   19483 Oct 28 09:14 8.7 - File Input and Output.ipynb
-rw-r--r--+  1 kashaolu  staff    4887 Oct 28 09:53 8.8 - Structured Text Files.ipynb
-rw-r--r--+  1 kashaolu  staff      72 Sep 19 11:26 Week 8 Assign

We can see the contents using the cat command.

In [4]:
!cat shakespeare.txt

To be or not to be
That is a question.

We just now programatically created (or overwrote, if that file was still present) the file shakespeare.txt with the given content.

Now let's append to the file. We will need to use the "a" mode to append.

In [5]:
poem_file = open('shakespeare.txt', 'at')
poem_file.write("\n--Written By Shakespeare")
poem_file.close()

Let's see what the file looks like now.

In [6]:
!cat shakespeare.txt

To be or not to be
That is a question.
--Written By Shakespeare

We add "\n" to force a new line. The function write() writes out all of the characters verbatim.

Feel free to play around with the different modes to see how they work for you.

## Read a Text File With read(), readline(), or readlines()

Now we will learn to read files, but there are a few caveats.

The read() function with no arguments will load the entire file into memory, which is great for small files, but it is very easy to have a file that is larger than your memory; in this case, your application will run out of memory. However, if this is desired, you can go the route as shown below. 

In [7]:
poem_file_read = open('shakespeare.txt', 'rt')
poem_read = poem_file_read.read()
poem_file_read.close()
print(poem_read)

To be or not to be
That is a question.
--Written By Shakespeare


You can also use the function readlines() that will also load all of the file in memory and give you a single array populated with each line of the array, as in the following example.

In [8]:
poem_file_read = open('shakespeare.txt', 'rt')
poem_read_array = poem_file_read.readlines()
poem_file_read.close()
print(poem_read_array)

['To be or not to be\n', 'That is a question.\n', '--Written By Shakespeare']


In [9]:
print(poem_read_array[0])

To be or not to be



Notice the contents of the file are now in an array. You can now process a file line by line as shown below.

In [10]:
for line in poem_read_array :
    print(line)

To be or not to be

That is a question.

--Written By Shakespeare


We now have a nice for loop that lets us access each line of the file one by one and process it, something we would do pretty often. However the issue still remains that we are loading the entire file into memory when we use read() or readlines().

The readline() function saves us from doing this. What readline() does is reads a single line from the file every time it is evoked, so we can process a single line without loading the entire file in memory.

In [11]:
poem_file_read = open('shakespeare.txt', 'rt')

while True:
    line = poem_file_read.readline()
    if not line:
        break
        
    print(line)

poem_file_read.close()

To be or not to be

That is a question.

--Written By Shakespeare


Here we use an infinite loop to continue to read lines one at a time using the readline() function and breaking out of the loop once we are at the end of the line. This way we only retrieve input from the file line by line and do not have to worry about memory issues (unless the line is very long). For text files, a line is only False if it is at the end of the file (EOF character is encountered), so this loop will retrieve all of the text in the file.

There is an even easier way to read files. The file handle itself is an iterator, so you can write the above loop in an even simplier way.

In [12]:
poem_file_read = open('shakespeare.txt', 'rt')

for line in poem_file_read:
    print(line)
    
poem_file_read.close()

To be or not to be

That is a question.

--Written By Shakespeare


## Writing a Binary File Using write()

By passing a "b" as the second parameter in the `mode` parameter of the function `open` allows you to open a binary file for writing.
*Is the first word intended to be "Bypassing"? If not, please delete "By" and cap "passing".

In [13]:
binary_data = bytes(range(0,255)) # Generating some arbitrary binary data

binary_file_write = open('binary', 'wb')
binary_file_write.write(binary_data)
binary_file_write.close()

## Reading a Binary File 

Similar to text files, you can create a file handle that has read access to the binary file. Please note that since there are typically no newlines in binary data, you should use the read() funciton to read data.

In [14]:
binary_file_read = open('binary', 'rb')
print(binary_file_read.read())
binary_file_read.close()

b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe'


## Find Position With tell(), Change Position With seek()

As you read and write, Python keeps track of where you are in the file. There is a set of functions that allow you to find out and modify your current position in that file.

The `tell()` function returns your location from the beginning of the file in bytes.
The `seek()` function allows you to jump to another location in the file.

For example, let's look at the binary file that we created.

In [15]:
binary_file_read = open('binary', 'rb')

Let's find out our current position in the file.

In [16]:
binary_file_read.tell()

0

Now let's go to the offset where we saw printed that capital "A".
*Is "saw" or "printed" the desired word?*

In [17]:
binary_file_read.seek(65)
print(binary_file_read.read())

b'ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe'


You can also seek from your current position. Let's say I want to move five bytes from my current position. I would pass a 1 as a second argument to the seek function.

In [18]:
binary_file_read.seek(65) #I'm going to go back to the original position
binary_file_read.seek(5, 1) # Now I'm reading five bytes after the current position
print(binary_file_read.read())

b'FGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe'


Please note: The seek() and tell() functions work best with binary files as you are moving back and forth in byte-like chunks. These functions will also work with text files, but because of variable encoding, one character could use more bytes than another, leading to unexpected side effects.

In the next section we will learn about structured text files and how to use them in your applications.