In [14]:
import sys, os, json, pickle
import numpy as np

# Input and Output
You'll eventually reach a point where you want your program to interact with the user by printing messages or writing output to a file. You might also need a way to save really large objects once you begin hitting your computer's memory limits. Hence this lesson!

## Lesson Objectives
By the end of this lesson, you will be able to:
- Use the `input()` and `print()` functions more extensively
- Read and write text & binary data
- Save structured data with JSON
- Persist Python objects using the `pickle` module 

## Table of Contents
- [More About `print()` and `input()`](#print)
- [Reading and Writing Files](#read)
- [Saving Structured Data with JSON](#json)
- [Pickling Python objects](#pickle)
- [Reading and Writing Binary Data](#binary)

<a id='print'></a>
## More About `print()` and `input()`
You can use `print()` and `input()` to interact with the user of your program. In these cases, you'll take `input` from the user and then, optionally, `print` something back. Before we consider a use case, let's familiarize ourselves with the two functions.

**`input([prompt])`**

> If the `prompt` argument is present, it is written to standard output without a trailing newline. The function then reads a line from input, converts it to a string (stripping a trailing newline), and returns it.

In [2]:
input("Type something, anything!, and it'll become a string:")

Type something, anything!, and it'll become a string:


''

> Usually you'll want to save the input so that you can do something with it.

In [2]:
userInput = input("Type something:")

Type something:abc


> That brings us to a simple use case - asking the user if they want to do one thing or another:

In [5]:
userInput = input("To calculate the median, enter median; otherwise you'll get the mean:")

def input_check(ui):
    if ui == 'median':
        return True
    else:
        return False

numList = [1,2,100]

if input_check(userInput):
    print(np.median(numList))
else:
    print(np.mean(numList))

To calculate the median, enter median; otherwise you'll get the mean:f
34.3333333333


> Above, we ask the user to enter the word "median" if they want the median calculated; otherwise we'll calculate the mean.

> Then we define a function that accepts a string. If that string is 'median', it returns True; otherwise it returns False.

> Then we make a simple list:  `[1,2,100]`

> Then we pass `userInput` to `input_check`. If `input_check` evaluates to `True`, we calculate the median using numpy. If `False`, we calculate the mean.

> Although this example is a bit simplistic, it demonstrates the utility of `input()`.

#### Exercise

> 1) Write a `while` loop that asks the user to `input` a `password` until it matches the `secret`:  'password'

**`print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)`**

> Print object(s), separated by `sep` and followed by `end`, to the text stream file, which by default is sys.stdout (here in Jupyter in other words).

> `sep` is useful when you want to control the separators between each string you're printing. 

> `end` is useful when you want to do the same for the end of a line.

> Let's inspect the default behavior:

In [39]:
print('A', 'space', 'between','each','word','by','default...')
print("And a new line at the end by default")

A space between each word by default...
And a new line at the end by default


> Now we'll use a comma as the separator and then, for the second `print`, change `end`:

In [24]:
print('a', 'comma', 'between','each','word','but','still','a','new','line','at','the','end', sep = ",")
print('a', 'comma', 'between','each','word','with','the', 'line','ending','exclamatorily',sep = ",",end = "!")
print('!')

a,comma,between,each,word,but,still,a,new,line,at,the,end
a,comma,between,each,word,with,the,line,ending,exclamatorily!!


> Another look at the utility of `end`:

In [41]:
print("A", end="-")
print("dash", end="-")
print("instead", end="-")
print("of", end="-")
print("new", end="-")
print("lines")

A-dash-instead-of-new-lines


> As you can see, changing the `end` keyword argument suppresses the defualt new line. This can be helpful if you're trying to use `print` to help you inspect the contents of a container during debugging.

In [1]:
# if the container is large, you'll have to scroll through a whole lot
for n in range(0,10):
    print("See how each number is printed on a new line:",n)

See how each number is printed on a new line: 0
See how each number is printed on a new line: 1
See how each number is printed on a new line: 2
See how each number is printed on a new line: 3
See how each number is printed on a new line: 4
See how each number is printed on a new line: 5
See how each number is printed on a new line: 6
See how each number is printed on a new line: 7
See how each number is printed on a new line: 8
See how each number is printed on a new line: 9


In [2]:
#so if the container is large, use end to supress the new lines
for n in range(0,100):
    print(n,end=" ")

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 

> To get an idea for how `print` knows where to print, consider the following code:

In [66]:
print("Now you see me!")                     # printing to stdout, which you'll see here in jupyter
placeHolder = sys.stdout                     # store original stdout object for later
sys.stdout = open(r'assets/stream.txt', 'w') # open a file, stream.txt, make it stdout and thereby redirect print
print("Now you don't!")                      # you won't see this print here because it's written to stream.txt
sys.stdout.close()                           # close the file object we opened
sys.stdout = placeHolder                     # restore print to sys.stdout
print("Now you see me!")                     # now we can see this print

Now you see me!
Now you see me!


> If you go open stream.txt, you'll see that `"Now you don't!"` printed there. But we could have achieved that much more easily by using the `file` keyword argument:

In [65]:
streamFile = open(r'assets/stream.txt', 'w')
print("Printing to stream.txt", file = streamFile)
streamFile.close()

> Now stream.txt contains the following line: `Printing to stream.txt`

<a id='read'></a>
## Reading and Writing Files
We've already been opening and writing to files, but let's now explain that process.

When working with files, you typically:
1. Open the file, specifying the mode
2. Do something with the file
   - depends on the mode you used to open the file
   - usually facilitated by *seeking* through the file so you can get to an exact spot
3. Close the file
   - Closing a file causes attempts to use the file to fail. Closing also frees up system resources and prevents Python’s garbage collector from eventually destroying the file object.
   
 

### Opening a file
> The built-in `open()` function returns a [file object](https://docs.python.org/3/glossary.html#term-file-object), and is most commonly used with two arguments: `filename` and `mode`. 

> The filename can include a file path. 

> And the file can be opened in one or more modes:

> <table border="1">
	<caption>Open File Modes</caption>
	<tbody>
		<tr>
			<th>Mode</th>
			<th>Description</th>
		</tr>
		<tr>
			<td>&#39;r&#39;</td>
			<td>Open a file for reading. (default)</td>
		</tr>
		<tr>
			<td>&#39;w&#39;</td>
			<td>Open a file for writing. Creates a new file if it does not exist or truncates the file if it exists.</td>
		</tr>
		<tr>
			<td>&#39;x&#39;</td>
			<td>Open a file for exclusive creation. If the file already exists, the operation fails.</td>
		</tr>
		<tr>
			<td>&#39;a&#39;</td>
			<td>Open for appending at the end of the file without truncating it. Creates a new file if it does not exist.</td>
		</tr>
		<tr>
			<td>&#39;t&#39;</td>
			<td>Open in text mode. (default)</td>
		</tr>
		<tr>
			<td>&#39;b&#39;</td>
			<td>Open in binary mode.</td>
		</tr>
		<tr>
			<td>&#39;+&#39;</td>
			<td>Open a file for updating (reading and writing)</td>
		</tr>
	</tbody>
</table>

> By default, `open()` expects a `'t'`ext file and opens it in `'r'`ead mode. Let's practice.

In [228]:
lines = open(r'assets/lines.txt', 'r') #open for reading (didn't need to specify 'r' since it's default)

`lines` is now a file object. Let's inspect its type (note there are multiple file object types defined within the `io` module).

In [229]:
type(lines)

<_io.TextIOWrapper name='assets/lines.txt' mode='r' encoding='UTF-8'>

### Reading a file
> Since we opened the file in `r` mode, we can now read it using some of the file object's available methods.

#### `f.read([size])` 
> Although `size` is an optional numeric argument, when size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory!

In [230]:
lines.read(1) #Read the first character

'T'

In [231]:
lines.read(22) #Read the next 21 characters, picking up where we left off

'his is the first line\n'

#### `f.readline()`
> `readline()` reads a single line from the file, leaving a newline character at the end of the string unless reading a final line that doesn't end in a newline character. This makes the return value unambiguous:  if `f.readline()` returns an empty string `''`, the end of the file has been reached, while a blank line is represented by `'\n'`.

In [232]:
lines.readline() #reading the line where we left off with f.read() above; note the trailing \n

'This is the second line\n'

In [233]:
lines.readline() #read the next line; note the trailing \n

'This is the third line\n'

> If you want to read a file line by line, you can loop over the file object:

In [234]:
for line in lines:
    print(line, end='')

This is the fourth line

This is the sixth line (#5 was blank!)
This is the seventh and final line.
We just wrote a new line!


In [235]:
lines.close() #close the file for now

#### `list(f)` & `f.readlines()`
> You can use these if you want to read all the lines of a file in a list.

In [236]:
lines = open(r'assets/lines.txt', 'r') #we've already read every line, so we need to reopen
line_list = list(lines)
lines.close()
line_list

['This is the first line\n',
 'This is the second line\n',
 'This is the third line\n',
 'This is the fourth line\n',
 '\n',
 'This is the sixth line (#5 was blank!)\n',
 'This is the seventh and final line.\n',
 'We just wrote a new line!\n']

In [237]:
lines = open(r'assets/lines.txt', 'r') #we've already read every line, so we need to reopen
line_list = lines.readlines()
lines.close()
line_list

['This is the first line\n',
 'This is the second line\n',
 'This is the third line\n',
 'This is the fourth line\n',
 '\n',
 'This is the sixth line (#5 was blank!)\n',
 'This is the seventh and final line.\n',
 'We just wrote a new line!\n']

### Writing to a file
#### `f.write(string)` 
> Writes the string to the file, returning the number of characters written.

In [250]:
written_lines = open(r'assets/write_lines.txt', 'w+') #open for reading & writing (creates file since it doesn't exist)
written_lines.write("We just wrote a new line!\n")
written_lines.write("We just wrote another line!\n")
written_lines.write("This is the last line we'll write")

33

In [251]:
for line in written_lines:
    print(line)

> The above loop didn't print our lines although we opened the file for both reading and writing? Why is that?

> When we opened the file in `w+` mode, we truncated the file, meaning we'd only be able to read data that was previously written. We need to return to the top of the file before reading, otherwise we'll just be reading from the point after writing, which is nothing.

> Fortunately, you can move through a file without returning the contents.

In [253]:
written_lines.close() #close the file!

### File Seeking
#### `f.tell()`
> Returns an integer giving your current current position in the file object. The initial position is 0.

In [254]:
reading_lines = open(r'assets/write_lines.txt')
reading_lines.tell()

0

#### `f.seek(offset, from_what)`
> To change the file object’s position, you use `f.seek(offset, from_what)`. The position is computed from adding `offset` to a reference point; the reference point is selected by the `from_what` argument. 

> - `from_what` values:
    - 0 (the default) measures from the beginning of the file
    - 1 uses the current file position
    - 2 uses the end of the file as the reference point
    
> In text files (those opened without a `b` in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from `f.tell()`, or zero. Any other offset value produces undefined behaviour.

In [266]:
reading_lines.seek(10) #this returns 10 since we're offsetting 10 from 0 (the beginning)

10

In [274]:
reading_lines.readline(reading_lines.seek(10)) #sub the return value into the f.readline()

'ote a new '

In [275]:
reading_lines.close()

### Closing Files
> So far, we've been careful to close each file once we're done with it. However, this isn't good practice. 

> The preferred method is to use the `with` keyword when dealing with file objects. This is an example of a **context manager** in Python (there's actually a whole standard library module called [contextlib](https://docs.python.org/3/library/contextlib.html) that's dedicated to context managers). 

> The advantage of using the `with` keyword is that the file is properly closed after we're done working with it even if an exception is raised at some point. This becomes important when you're working with a lot of files as it will help your computer manager resources and not crash. It's also a more elegant syntax when comapred to the equivalent `try-finally` block.

> Here's what using the `with` keyword looks like:

In [276]:
with open(r'assets/write_lines.txt') as f:
    read_data = f.read()

f.closed #boolean test to see if the file is closed

True

> Above, we opened the file using the `with` keyword, read its contents into `read_data`, and then tested to see if the file was indeed closed after the `with` statement. Here we can see that we did indeed read the contents of the file:

In [277]:
read_data

"We just wrote a new line!\nWe just wrote another line!\nThis is the last line we'll write"

> If you're curious, the equivalent using a `try-finally` block would look like this:

In [284]:
try:
    f = open(r'assets/write_lines.txt')
    read_data = f.read()
finally:
    f.close()

f.closed

True

<a id='json'></a>
## Saving Structured Data with JSON
Working with files requires you to read and write strings. But reading and writing numbers can take more effort, since file methods, like `read()`,  return strings. This means you'd have to pass a number string, like `'100'`, to a function like `int()` to return its numeric value, `100`. This can get quite complicated once you need to save more complex data types, like nested lists and dictionaries.

The `json` module can convert Python's data hierarchies into string representations using the popular data interchange format called [JSON](https://www.json.org/). This process is called *serializing* while the reconstruction of the data from the JSON string representation is called *deserializing*. It's between these serializing and deserializing processes that the string representing the object may be stored in a file and/or sent over a network connection to some far-away machine.

> If you have an object, `x`, you can view its JSON string representation with a simple line of code:

In [291]:
x = [1,'simple','list']
json.dumps(x)

'[1, "simple", "list"]'

> Now that we have a string representation of `x`, we could write it to a file. Or we could use the `dump()` function to achieve all of this in one step.

In [294]:
with open(r'assets/from_json.txt', 'w') as f:
    json.dump(x, f)

> To see that we wrote the encoded object, let's open it, decode it, and read it using `json.load()`:

In [297]:
with open(r'assets/from_json.txt') as f:
    y = json.load(f)
print(y)

[1, 'simple', 'list']


This technique can handle lists and dictionaries well enough, but serializing arbitrary class instances in JSON requires a bit of extra effort. See the `json` module [reference](https://docs.python.org/3/library/json.html#module-json) for more information.

<a id='pickle'></a>
## Pickle
So far we've been working with **dynamic data**, where information is asynchronously changed as further updates to the information become available.

The opposite of dynamic data is **persistent data**, which denotes information that is infrequently accessed and/or not likely to be modified. There are degrees to data persistence, but that can be the topic of another lesson.

For now, we'll just introduce ourselves to the [`pickle` module](https://docs.python.org/3.6/library/pickle.html). As you may know, [pickling](https://en.wikipedia.org/wiki/Pickling) is the process of preserving or expanding the lifespan of food by either anaerobic fermentation in brine or immersion in vinegar. You can do the digital equivalent using the `pickle` module. 

What exactly happens when you pickle and unpickle a Python object? This is straight from the Python docs:
 - “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream.
 - “Unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
 
Once you've pickled something you can either write it to a file (saving it persistenly to disk) or assign the `bytes object` to an object in memory. We'll show an example of the former. An example will make this concrete.

In [6]:
# The name of the file where we will store the object
pickleFile = r'assets/pickleFile.data'
# The list of things to pickle
pickleList = ['cucumber', 'cabbage', 'carrot', 'turnip', 'kimchi']

# Write to the file
with open(pickleFile, 'wb') as f:           # open the file in write binary mode
    pickle.dump(pickleList, f)              # dump the object to a file (pickling)

# Destroy pickleList
del pickleList

# Read back from the storage
with open(pickleFile, 'rb') as f:           # open the file in read binary mode
    pickles = pickle.load(f)                # return the pickled object using the load function (unpickling)
 
print(pickles)

['cucumber', 'cabbage', 'carrot', 'turnip', 'kimchi']


### Why Pickle?
We just saw how pickling converts a Python object (e.g. list) into a character stream containing all of the information necessary to reconstruct the object in another Python script. But why would you want to do this?

#### Memory Constraints
Sometimes you need to work with data that exceeds your computer's memory limits. If a scalable server isn't an option, you can use pickling to chunk up your data objects. In this way you do a little work on a portion of the data, pickle the output, do a little more work on another portion of the data, pickle that output, and so forth until you've covered all of the data. Then you can combine the pickled outputs.

#### Networks
Another use case is transmitting a Python object over a network. Normally, you would first need to convert the objet into a character stream before sending it over the connection. Pickling does that for you.

<a id='binary'></a>
## Reading and Writing Binary Data
At some point you might need to read or write binary data, such as that found in an image or audio file.

And if you were paying attention to how we opened the pickle files above, you would have noticed that were working with those files in binary rather than text mode.

You can use `open()` with the `b` mode to specify binary data.

But before we do that, let's familiarize ourselves with **encoding**.

### Encodings 
> An encoding is a how computers represent audio, images, text, etc in **bytes**. 
 - A byte is a unit of digital information that most commonly consists of eight **bits**.
 - A bit is the smallest unit of digital storage, the proverbial 0 or 1. Bit is a portmanteau of **bi**nary digi**t**.
     - One byte = 8 bits
         - e.g. `1 1 0 1 1 1 1 0`
         - One byte can represent 256 ($2^8$) different patterns

> #### Bytes and Characters
> [ASCII](http://www.asciitable.com/) is an old encoding that can represent common latinate characters as numbers. Each number is represented by one byte. This means the numbers are from 0 to 255:
 - A is 65
 - B is 66
 - a is 96

> Computers eventually expanded beyond English-speaking countries. The [Unicode Standard](https://unicode.org/) provides a unique number for every character, no matter what platform, device, application or language.

> Python's default encoding is UTF-8, which is capable of encoding 1,112,064 valid code points using one to four 8-bit bytes. Importantly, UTF-8 was designed for backward compatibility with ASCII. 

### Reading and Writing Binary Files
We read a binary file the same way we read and write text files

In [329]:
# Read the entire file as a single byte string
with open(r'assets/binary.bin', 'wb') as f:
    f.write(b'Hello World')

In [330]:
with open(r'assets/binary.bin', 'rb') as f:
    binary_data = f.read()
    
binary_data

b'Hello World'

> When reading binary, your data will be returned in the form of byte strings, not text strings. That's why there's the `b` before the string.

> Similarly, when writing, you must supply data in the form of byte objects.

> Finally, when indexing byte objects, integer byte values are returned instead of the string representations.

In [337]:
s = "foo"
for c in s:
    print(c)

f
o
o


In [338]:
b = b"bar"
for c in b:
    print(c)

98
97
114


If you ever need to read or write text from a binary file, make sure you remember to decode or encode it. For example:

In [11]:
with open(r'assets/binFile.bin', 'wb') as f:
    text = 'binary is fun!'
    f.write(text.encode('utf-8'))
    
    
with open(r'assets/binFile.bin', 'rb') as f:
    data = f.read(16)
    text = data.decode('utf-8')
    print(text)

binary is fun!


# An Example

## Extracting png images from a file
Behind the scenes, all files are binary. Moreover, file types are standardized so that each file of a certain type organizes its binary in a specifc way. We'll use the [standards for png image files](http://www.libpng.org/pub/png/spec/1.2/PNG-Chunks.html) to help us extract png image files from a document.

Here are the key details that we need to know:
 
 - A PNG file always contains a **PNG signature**, with the first eight bytes of the file always containing the following (decimal) values:  $137$ $80$ $78$ $71$ $13$ $10$ $26$ $10$

 - The signature is then followed by a series of chunks. One of these chunks is the **IHDR Image header**. The IHDR chunk always appears first and contains some vital details about the image:
<table>
<tr>
    <th>Detail</th>
    <th>Size</th>
</tr>
<tr>
    <td>Width:</td>
    <td>4 bytes</td>
</tr>
<tr>
    <td>Height:</td>
    <td>4 bytes</td>
</tr>
<tr>
    <td>Bit depth:</td>
    <td>1 byte</td>
</tr>
<tr>
    <td>Color type:</td>
    <td>1 byte</td>
</tr>
<tr>
    <td>Compression method:</td>
    <td>1 byte</td>
</tr>
<tr>
    <td>Filter method:</td>
    <td>1 byte</td>
</tr>
<tr>
    <td>Interlace method:</td>
    <td>1 byte</td>
</tr>
    
We'll use this knowledge to write a function that extracts png image files from another file. In our case, we'll be extracting the pokemon pngs from a word document that contains 4 pngs and one jpg.

In [34]:
def extract_pngs(file):
    """
    Extracts pngs from files and writes them to a new directory:  pngs/...
    file:  file path, which will be read as binary
    """  
    png_path = os.path.join(os.getcwd(),"assets","pngs")
    
    with open(file, "rb") as bf:
        bf.seek(0, 2)  # seek to the end
        num_bytes = bf.tell()  # get the file size

        c = 0 #instantiate counter
        for i in range(num_bytes):
            bf.seek(i)
            eight_bytes = bf.read(8)
            if eight_bytes == b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a":  # that's the PNG signature
                c += 1
                print("Found Signature #" + str(c) + " at " + str(i))

                # the next 4 bytes after the signature are part of the IHDR. This gives us the image size
                png_size_bytes = bf.read(4)
                png_size = int.from_bytes(png_size_bytes, byteorder='little', signed=False)

                # Now go back to the beginning of the image file and extract full thing
                bf.seek(i)
                # Read the size of image plus the signature (8)
                png_data = bf.read(png_size + 8)
                
                #conditional check if our path exists. If not, make it
                if not os.path.exists(png_path):
                    os.makedirs(png_path)
            
                with open(os.path.join(png_path,str(i)+".png"), "wb") as pic_out:
                    pic_out.write(png_data)

In [35]:
pic_file = r'assets/pokemon_pics.docx' #this is the word doc

In [36]:
extract_pngs(pic_file)

Found Signature #1 at 127567
Found Signature #2 at 135907
Found Signature #3 at 348481
Found Signature #4 at 976658
