# Table of Contents
* [Learning Objectives:](#Learning-Objectives:)
* [Input and output](#Input-and-output)
	* [The `input` function](#The-input-function)
	* [The `print` function](#The-print-function)
	* [`stdin`, `stdout`, and `stderr`](#stdin,-stdout,-and-stderr)
	* [Basic file I/O](#Basic-file-I/O)
	* [Delimited data files](#Delimited-data-files)
* [Reading from a URL](#Reading-from-a-URL)
* [Pickling objects](#Pickling-objects)


# Learning Objectives:

After completion of this module, learners should be able to:

* use & explain common Python idioms for working with text files
* retrieve information from websites
* pickle and load Python objects

# Input and output

Up to now, our discussion of input and output has been restricted to output using the Python `print` function.
Data can be retrieved from users in Python programs using the `input` function. It is also possible to use alternative *streams* for input and output. A very special category of data stream is a *file* that can be used for both input and output. It turns out that more complicated input and output sources (e.g., databases, URLs, etc.) are quite simple to use in Python when we understand that arbitrary data input/output scenarios can be modeled easily using streams.

## The `input` function

Prior to Python 3, there were two functions for getting user input from the keyboard: `raw_input` and `input`. In Python 2, `raw_input` would take exactly what the user typed and return it as a string. The Python 2 function `input` was the composition of the function `eval` and `raw_input`. The primary difference was that `input` in Python 2 required a syntactically correct Python statement whereas `raw_input` did not.

Ay any rate, this proved to be needlessly confusing. In Python 3, the function `input` returns the string the user enters; it is equivalent to `raw_input` from Python 2. If that string needs to be evaluated as a Python statement, this can be achieved in Python 3 through the expression `eval(input())`.

In [None]:
my_string = input('Please enter a string: ')
print('You entered: %s' % my_string)

## The `print` function

In Python 2, the `print` command was not a function. As such, various unusual syntactic conventions were inconsistent with other aspects of the language. In Python 2, these were examples of `print` statements:

```python
>>> # These would work in Python 2
>>> print "The answer is", 2*2
>>> print x,           # Trailing comma suppresses newline
>>> print              # Prints a newline
>>> print>>sys.stderr, "fatal error"
>>> print (x, y)       # prints repr((x, y))
```

In Python 3, the statements above produce syntax errors. The equivalent Python 3 expressions are:

```python
>>> print("The answer is", 2*2)
>>> print(x, end=" ")  # Appends a space instead of a newline
>>> print()            # You must call the function!
>>> print("fatal error", file=sys.stderr)
>>> print((x, y))      # Not the same as print(x, y)!
```

In fact, in recent Python 2 versions encourage use of the `print()` function rather than the older statement via the directive:

```python
from __future__ import print_function
```

The basic syntax of the `print` function in Python 3 is
```python
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
```

where `value` is a comma-separated sequence of values to print (possibly empty), `sep` is a string to insert in between printed values (default: a blank space), `end` is a string to print at the end (default: `'\n'`), `file` is a *stream* or file-like object to print to (default: `sys.stdout`, the *standard output*; see below), and `flush` is a boolean value indicating whether the output stream should be forcibly flushed before exiting the function  `print()` (default: `False`).

In [None]:
print(4, 5, 6, sep='-', end=':') # notice there is no newline between these print calls
print(7, 8, 9, sep='$', end=')')

## `stdin`, `stdout`, and `stderr`

Unix systems use a clever system for modeling file-like objects with *streams*. The three default streams programs can interact with are:

* *standard input* (also called *`stdin`*);
* *standard output* (also called *`stdout`*); and
* *standard error* (also called *`stderr`*).

In simplest term, `stdin` is input from the keyboard while `stdout` and `stderr` are output to the terminal screen. When the `input` function is invoked, the program waits for a suitable sequence of bytes to flow from the stream `stdin` (i.e., the user typing on the keyboard) into the program. When the `print` function is invoked, the program pushes a sequence of bytes to `stdout` (i.e., to the terminal screen) as output to the user. There is an additional output stream—`stderr`—that also echoes output to the screen when a program encounters an error.

One of the most powerful components of the Unix operating system is the notion of *pipes*: input and output streams from programs can be redirected in numerous ways. It is common to redirect the standard input stream from a file (so that rather than typing a long sequence of characters into a program, the program can get its required input from a text file). Similarly, the standard output (or standard error) from a program can be redirected to files so that they can be preserved for later scrutiny. Even more powerful, these streams can be redirected to use as input or output for *different programs*!

What does this have to do with Python? Well, even on non-Unix systems, the Python language is designed to interact with the operating system in a Unix-flavored way. Thus, the default output stream for the `print` function is `sys.stdout` and the default input stream for the `input` function is `sys.stdin`. These options can be reassigned to different stream objects in Python so, for instance, when a file `'my_output.txt'` is opened as a stream `outfile`, the `print` function invocation

```python
print("Hello, world!", file=outfile)
```

prints the characters `'Hello, world!\n'` into the file `my_output.txt` rather than to the screen.


* Working with files
  * Traditional idiom for working with files
  * Show `while not EOF:` idiom
  * Show Pythonic `for lines in file:` idiom

In [None]:
outfile = open('./tmp/my_output.txt','w')  # Open a file for writing to
print('Hello, world!', file=outfile) # Print text into the file
outfile.close()                      # Close the file
%cat tmp/my_output.txt

## Basic file I/O

The generic way to work with files is by *opening* them (either for reading or writing). The important Python keyword for opening files is `open`. We'll create a string to write to a file for illustrative purposes.

In [None]:
# Let's first create a string to write to a file
long_string = """Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed
do eiusmod tempor incididunt ut labore et dolore magna aliqua."""
# We'll use a function "fill" from the textwrap module to modify this string
import textwrap
long_string = textwrap.fill(long_string, width=35)
print(long_string)

In [None]:
# Now, let's write a file to disk
outfile = open('data/lorem-ipsum', mode='w')
outfile.write(long_string)
outfile.close()

* The value returned by the function `open` is a *stream object* that we'll usually refer to as an *open file* or *file handle*.
* The call to the function `open()` accepts a string with a path to a filename as a first argument. The path can contain  forward slash characters (i.e., `/`) to separate directories and files, as long as the path is valid.  Backslash characters (i.e. `\`) are permitted on Windows systems also.
* The option keyword argument `mode='w'` means *writeable*. There are other alternatives.  

|Character|Meaning|
|:-:|:-|
|`r`| open for reading (default) |
|`w`| open for writing, truncating the file first |
|`x`| create a new file and open it for writing |
|`a`| open for writing, appending to the end of the file if it exists|
|`b`| binary mode |
|`t`| text mode (default)|
|`+`| open a disk file for updating (reading and writing)  

* The invocation `outfile.write(long_string)` writes the text string as is (including line breaks) to disk.
* More invocations of the form `outfile.write(`*`string`*`)` would append more strings after the text currently in the file.
* The method `outfile.writelines(sequence_of_strings)` will write multiple strings at once.
* The invocation `outfile.close()` closes the file that was previously opened for writing.
* Stream objects have several useful methods and attributes we can investigate using `help`.

It is generally a good idea to close an open file (stream object) when it is no longer needed using the `close()` method. The Python standard does not guarantee that open files will be closed upon exit from the program. The CPython implementation does, in practice, close any unclosed files, but that is not guaranteed in all Python implementations (e.g., in IronPython or in Jython).  Of greater concern, however, is that a file might remain open throughout a program run, and attempts to read or write from it later might not behave as expected. It is safest to match any `file=open(...)` invocation with a matching `file.close()` invocation.

Actually, there is an even safer idiom for file-handling that uses the idiom

```python
with open(filename) as file:
```

to enclose a block that uses `file`. At the end of the `with` block, the file is closed automatically. It is good practice to use the `with` statement when opening file objects because it is guaranteed to close the file *no matter how the nested block exits*. Even if an exception occurs before the end of the block, the file will be closed prior to passing the exception up to an outer exception handler (the `with` block is also shorter than the corresponding `try-except-finally` block). The `with` statement also closes the file even if the nested block contains `return`, `continue`, or `break` statements. For other use cases, [The Python "with" Statement by Example](http://preshing.com/20110920/the-python-with-statement-by-example/) contains an interesting discussion of using `with` as a context manager.

In [None]:
# An example using "with" to read from a file (not necessary to close infile explicitly)
# The convenient feature of "context managers" handles cleanup.
with open('data/lorem-ipsum','r') as infile:
    lines = infile.readlines()
print(lines)

* At the end of the `with` block, the file is closed without having to execute `infile.close()`.
* The stream method `read()` reads data from the stream object `infile` into a string `lines`.
* By default, the data from the stream is treated as plain text, but this can be overridden.
* Attempting to read a non-existent file produces an error.
* Attempting to write/append to a non-existent file creates an empty file.
* Attempting to read or write to a stream that has been close produces an error.

In [None]:
# Opening a non-existent file for reading: FileNotFoundError
with open('no-such-file') as infile: # Default: mode='r'
    pass

In [None]:
# By contrast, opening a non-existent file for write/append mode *creates* file
with open('./tmp/make-the-file', 'a') as newfile:
    pass
%ls -l tmp/make-the-file

In [None]:
# once closed, we cannot write more to outfile.
newfile.write("More stuff") # raises a ValueError

* By specifying a positive integer argument to the `read` method, a specific number of bytes can be read from a file. This is useful for reading binary files and for files with no (or non-meaningful) line breaks.
* Usually a limited `read()` size is used when reading a very large file that we do not want to keep in memory all at once.  But in principle, any block size may be used.
* The `readline` method reads lines from a file one at a time. When there are no more lines to read, the `readline` method returns an empty string.
* The `readlines` method also reads the file line by line but it reads the whole file at once. Notice that the linefeed characters are preserved in each element of the list.

In [None]:
# We can read a file in fixed-size chunks (not just line by line)
from random import random
with open('data/lorem-ipsum') as infile:
    CHUNK_SIZE = 4    # Specify block-size in bytes
    while True:
        chunk = infile.read(CHUNK_SIZE)
        if not chunk:
            break
        if random() < 0.5:
            print(chunk.upper(), end='')
        else:
            print(chunk.lower(), end='')

In [None]:
with open('data/lorem-ipsum','r') as infile:
    print(infile.readline(), end='')
    print(infile.readline(), end='')
    line = infile.readline()
    print(line, end='')
    line = infile.readline()
    print(line, end='')
    line = infile.readline()
line

In [None]:
with open('data/lorem-ipsum','r') as infile:
    lines = infile.readlines()
lines

A more Pythonic way to read data from a text file is to read it line-by-line using a `for` loop. Just as with various data collections, a stream is an *iterable* in Python; hence, it can be looped over. This is probably a wiser choice when dealing with arbitrarily large files that can in principle fill the available memory.

In [None]:
# It is often more idiomatic to read by lines in a loop
with open('data/lorem-ipsum') as infile:
    for line in infile:
        print(line.upper(), end="")

We can also use comprehensions to open, read and manipulate the data.

In [None]:
[line.upper() for line in open('data/lorem-ipsum')]

In [None]:
# You can compact code even more to avoid intermidiate names often
[word for word in open('data/lorem-ipsum').read().split() if 's' in word]

In [None]:
# You can also nest loops inside comprehensions, although it's easy to 
# get carried away pretty quickly if you do this too deeply
[(lineno+1, word) for (lineno, line) in enumerate(open('data/lorem-ipsum'))
                  for word in line.split() 
                  if 's' in word]

## Delimited data files

Very often, text files represent tabular data using a *delimiter* to separate columns. We illustrate a few examples of writing and reading CSV (*comma-separated-values*) files or similarly TSV (*tab-separated-values*) or other delimiters.

In [None]:
# Here is a "hand-crafted" way to write a TSV file of random numbers
# Notice the "print" function can print to a file
from random import random
nrows, ncols, scale = 4, 10, 500.0
delimiter = '\t'
filename = './tmp/random-numbers.tsv'
# Pay attention to scope of "with" block & the nested for blocks...
with open(filename,'w') as randfile:
    for _ in range(nrows):
        for _ in range(ncols):
            print("%0.2f" % (scale*random()), end=delimiter, file=randfile)
        # End last entry with a newline (i.e., not *delimiter*)
        print("%0.2f" % (scale*random()), file=randfile) 

In [None]:
# Now we open and read back the TSV file just generated
with open(filename) as randfile:
    for line in randfile:
        print(line, end='') #Suppress extra line feed

Of course, rather than developing code for reading and writing to CSV/TSV files, we can use a module from the Python Standard Library (namely `csv`). When confronted with a new data analysis problem, *always* check to see if there is a library. There is a good chance someone else has had a similar problem to solve and has developed a module that will solve the problem for you.

In [None]:
# Here, we open a CSV file and create a custom stream using csv.reader.
# The csv.reader stream can be iterated over in a for loop that extracts rows
# from the CSV file and separates entries into Python lists of strings. We 
# could do this with str.split(','), but the code has been written for us.
import csv
with open('./data/AAPL.csv') as csvfile:
    stockreader = csv.reader(csvfile)
    for row in stockreader:
        print(row)

# Reading from a URL

The Python Standard library includes a module called `urllib`. The `urlopen` function is capable of reading many different protocols including `file:`, `ftp:` and `imap:`.

In [None]:
try:
    from urllib.request import urlopen
except ImportError:
    from urllib import urlopen # Python 2.7
url='http://www.wunderground.com/history/airport/KAGC/2016/1/6/DailyHistory.html?req_city=Pittsburgh&req_state=PA&req_statename=Pennsylvania&reqdb.zip=15206&reqdb.magic=1&reqdb.wmo=99999&format=1'
u = urlopen(url)
type(u)

`urlopen` creates a connection to the website much like `open` created a connection to a file. We can now call `.read` inside the `with` context manager.

In [None]:
# Python 3 (Note: Python 2 urlopen does not use the with statement as below)
with u as url:
    contents = url.read()


In [None]:
# Python 2 (Python 3 uses a with statement, like a with statement around opening a local file)
contents = u.read()

In [None]:
print(contents)

In [None]:
print(contents.decode('utf-8'))

In many cases there can be quite a lot of additional text processing required before you can work with the data. There are many modules available in Python. For example, [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is great for parsing HTML content.

In [None]:
weather = [line.strip().replace('<br />','').split(',') for line in contents.decode('utf-8').split('\n')]
# remove empty lines
weather.remove([''])
weather.remove([''])

In [None]:
# Average temperature for Jan 6, 2015
sum([float(l[1]) for l in weather[1:]])/len(weather[1:])

# Pickling objects

A `pickle` is a binary dump of a Python object to a file. Here we are opening a new file called `weather.pkl` and declaring it to be binary with `wb`.

In [None]:
import pickle
with open('./tmp/weather.pkl', 'wb') as out_file:
    pickle.dump(weather,out_file)
%ls -l tmp/weather.pkl

Using the same context manager idiom we can load the pickle directly into a new object. Reading foreign pickle files can be dangerous as malicious code stored in the pickle could be run on `load`. It is best to keep pickle files for local use only.

In [None]:
with open('./tmp/weather.pkl','rb') as in_file:
    new_weather = pickle.load(in_file)   

In [None]:
for line in new_weather:
    print(*line)